Benjamin Cremer

@benjamincremer
Organizer of the PHP Usergroup Münster
Lead Developer at CHECK24 Hotel
Linux / Operations Guy / On-Call

Using elks and large telegraph poles to squash tiny bugs

Application metrics and log aggregation with open source tools

Disclaimer

  • This has no storyline
  • It has no fancy graphics
  • It contains a lot of practical information
  • Examples apply for CHECK24 Hotel
  • The stack is running for a high double digit number of prod nodes

Topics

  • Log aggregation and visualization
  • System Monitoring
  • Application Metrics
  • Profiling
  • Alerting and Escalating

Log Aggregation

Logfiles are everywhere

  • Application logfiles: myapp/logs/app.log
  • Daemon logfiles: /var/log/nginx/*.log, mysql, fpm etc.
  • systemd journal

grep and tail -f are great tools for one single node

Requirements

  • Aggregate logs from various sources and servers to a central storage
  • Transform unstructured loglines into structured documents
  • Fulltext Search and Filters

Filebeat

  • Collects logfiles from services (nginx, fpm) and applications
  • Tagging and simple filtering
  • Runs on each node/container that generates logfiles
  • Sends data to logstash for further processing

Log Format for Application logs


// Example for Laminas Log Writer
'options' => [
  'stream' => __DIR__ . '/logs/app.log',
  'formatter' => [
    'name' => Simple::class,
    'options' => [
      // Kibana log format
      'format'         =>'[%timestamp% - %priorityName% (%priority%)]: %message% ~>%extra%',
      'dateTimeFormat' => 'Y-m-d H:i:s',
    ],
  ],
],
                    

$container->get(LoggerInterface::class)->warning(
    'User logged in.',
    ['username' => 'benjamin', 'status' => 'admin']
);
            

$ cat logs/app.log
[2021-01-21 16:37:11 - WARN (4)]: User logged in. ~>{"username":"benjamin","status":"admin"}
    

filebeat.inputs:
- type: log
  paths:
    - /srv/application/logs/*.log
  fields:
    log_type: application_log_format
  fields_under_root: true
  exclude_lines: ['DEBUG']

- type: log
  paths:
    - /var/log/messages
    - /var/log/syslog
  fields:
    log_type: syslog
  fields_under_root: true
  exclude_lines: ['DEBUG']

output.logstash:
    hosts: ["192.168.1.10:5043"]
            

More Beats

  • journalbeat
  • metricbeat
  • winlogbeat
  • ...

Logstash

  • Parse & transform data on the fly
  • Derive structure from unstructured data with grok
  • Lookup geo coordinates from IP addresses
  • Anonymize or exclude sensitive fields, and ease overall processing

Logstash

Input(s) -> Filter(s) -> Output(s)


# /etc/logstash/conf.d/01-input-beats.conf
input {
    beats {
        port => "5043"
    }
}
            

# /etc/logstash/conf.d/99-output.elasticsearch.conf
output {
    elasticsearch {
        hosts => ["192.168.1.100:9200"]
        manage_template => false
        index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM.dd}"
    }
}
            

# /etc/logstash/conf.d/10-filter-application_log_format.conf
filter {
    if [log_type] == "application_log_format" {
        grok {
            match => { "message" => "%{APPLICATION_LOG_FORMAT}" }
        }
        date {
            match => [ "timestamp", "yyyy-MM-dd HH:mm:ss" ]
            remove_field => ["timestamp"]
        }
        json {
            source => "jsonmsg"
        }
        mutate {
            uppercase => [ "severity" ]
        }
        if [severity] == "WARNING" {
            mutate {
                update => { "severity" => "WARN" }
            }
        }
    }
}
            

Grok Pattern


APPLICATION_LOG_FORMAT \[%{TIMESTAMP_ISO8601:timestamp} - %{WORD:severity} (\(%{NUMBER:warnlevelnum}\))?\]:(?m)%{GREEDYDATA:short_message} ~>%{GREEDYDATA:jsonmsg}

$ cat logs/app.log
[2021-01-21 16:37:11 - WARN (4)]: User logged in. ~>{"username":"benjamin","status":"admin"}
    

Correlation ID

  • Unique ID created "at the edge"
  • Created for every Frontend pageload
  • Passed down to subsystems (background jobs) via HTTP header
  • Find related requests to an error
  • Header: C24-Correlation-Id
  • In Logfile: correlation_id

final class CorrelationIdMiddleware implements MiddlewareInterface
{
    private const HEADER_NAME = 'C24-Correlation-Id';

    public function process(Request $request, RequestHandlerInterface $handler): Response
    {
        $correlationID = $this->getCorrelationIdFromRequestHeader($request);
        if ($correlationID === null) {
            $correlationID = CorrelationId::create();
        }

        $response = $handler->handle(
            $request->withAttribute(CorrelationId::class, $correlationID);
        );

        return $response->withHeader(self::HEADER_NAME, $correlationID->toString());
    }
}
            

class CorrelationIdProcessor implements ProcessorInterface
{
    public function process(array $event): array
    {
        $event['extra']['correlation_id'] = $this->requestContext->getCorrelationId();

        return $event;
    }
}
            

System Monitoring and Metrics

TIG Stack (Telegraf, InfluxDB, and Grafana)

Telegraf

Data collector written in Go for collecting, processing, aggregating and writing metrics

Runs on each host, pushes data to influxdb

Telegraf

  • Single binary with no external dependencies
  • Minimal memory footprint
  • Plugin system allows for new inputs and outputs

InfluxDB

Scalable time series database for metrics, events and real-time analytics.

Grafana

  • an open source analytics and interactive visualization software
  • provides charts, graphs, and alerts
  • can be connected to multiple supported data sources
  • expandable through a plug-in system

Grafana

End users can create complex monitoring dashboards using interactive query builders.

Application Metrics

Application Metrics

  • Technical Metrics (example: Outgoing Proivder Request/s, AVG Response time)
  • Business Metrics (Bookings per Hour, Error rate, Offer-Count by Provider)

Telegraf statsd Plugin

  • StatsD standard
  • UDP port on localhost
  • PHP Application writes to UDP port
  • Nearly Zero overhead

Application Metrics in Code


interface MetricServiceInterface
{
    public function increment(string $metric, int $delta = 1, array $tags = []): void;

    public function decrement(string $metric, int $delta = 1, array $tags = []): void;

    public function timing(string $metric, int $timeInMs, array $tags = []): void;

    public function gauge(string $metric, int $value, array $tags = []): void;
}
                    

Application Metrics


public function geocodeRequest(array $params): Response
{
    $googleResponse = $this->fetchUrl(self::GOOGLE_GEOCODE_URI, $params);
    $this->metricService->increment(MetricName::GOOGLE_API_REQUEST_COUNT, 1, ['type' => 'geocode']);

    return $googleResponse;
}
            

Measure CPU-Time

Gist

$rUsage = getrusage();
$startUTime = $rUsage['ru_utime.tv_usec'] + $rUsage['ru_utime.tv_sec'] * 1000000;
$startSTime = $rUsage['ru_stime.tv_usec'] + $rUsage['ru_stime.tv_sec'] * 1000000;

// execute your code

$rUsage = getrusage();
$user   = ($rUsage['ru_utime.tv_usec'] + $rUsage['ru_utime.tv_sec'] * 1000000 - $startUTime) / 1000;
$system = ($rUsage['ru_stime.tv_usec'] + $rUsage['ru_stime.tv_sec'] * 1000000 - $startSTime) / 1000;

$cpuTimeMilliseconds = $user + $system;

$this->metricService->timing(
    MetricName::MVC_REQUEST_CPU_TIME,
    (int) $cpuTimeMilliseconds,
    $tags
);
            

Live Demo

Docker Stack for Telegraf StatsD InfluxDB Grafana

https://github.com/bcremer/docker-telegraf-influx-grafana-stack

Grafana Annotations


#!/usr/bin/env bash

currentTime=$(date +%s%N | cut -b1-13)
tag="${bamboo.deploy.release}"

# POST to Grafana
curl -X POST \
  https://grafana/api/annotations \
  -H 'Authorization: Bearer deadbeefapitoken' \
  -H 'Content-Type: application/json' \
  -d "{
  \"time\":${currentTime},
  \"tags\":[\"Prod-Release\", \"HOAPI-Prod-Release\", \"HOAPI-${tag}\"],
  \"text\":\"HOAPI ${tag} was released\"
}"
            

Database Monitoring

PMM - Percona Monitoring and Management

  • open source platform for managing and monitoring MySQL performance
  • developed by Percona
  • based on Grafana and Prometheus (Starting from 2.12 VictoriaMetrics)
  • Runs agent on DB Server
  • Reads slowlog and performance schema

PMM query analytics

Which query has more impact?

Query A: 120ms latency
Query B: 800ms latency
            

It Depends

Query Rate / queries per second (QPS)
Query A: 120ms latency @ 20 QPS   = Load 2.400
Query B: 800ms latency @ 0.5 QPS  = Load 0.060
            

Slow Query Log sampling

Gather a representative sample of queries


[mysqld]
slow_query_log                    = ON
slow_query_log_file               = /var/log/mysql/mysql-slow.log

# log all queries regardless of runtime
long_query_time                   = 0

# randomly log every 1/log_slow_rate_limit query
# 100 = one percent of queries will be logged
log_slow_rate_limit               = 100
                    

[mysqld]
# See: https://www.percona.com/doc/percona-monitoring-and-management/conf-mysql.html#pmm-conf-mysql-slow-log-settings
slow_query_log                    = ON
slow_query_log_file               = /var/log/mysql/mysql-slow.log
log-queries-not-using-indexes
long_query_time                   = 0
log_slow_rate_limit               = 100
log_slow_rate_type                = query
log_slow_verbosity                = full
log_slow_admin_statements         = ON
log_slow_slave_statements         = ON
slow_query_log_always_write_time  = 1
slow_query_log_use_global_control = all
innodb_monitor_enable=all
userstat=1
                    

PHP application performance and profiling

Tideways

Tideways

  • Originally based on xhprof
  • PHP Extension
  • Aggregation daemon on localhost (UDP)

N+1 Queries

Alerts

Alerts

  • Grafana thresholds and webhooks
  • Own tooling (business hours, flexible thresholds)
  • On-Call and Escalation managment using pagerduty/pagertree etc.

Recap

Alternatives

  • Commercial: splunk, datadog
  • Open Source: TICK (Telegraf, Influx, Chronograf, Kapacitor)
  • Alternative Datasource: Prometheus
  • Overlapping: Kibana has dashboards and metrics too

Thank you

Please leave me feedback.

@benjamincremer

Q and A