Skip to content

Logging & Metrics Stack

MDH Media homelab uses Grafana Loki for centralized log aggregation, Prometheus for metrics storage, and Grafana Alloy as the unified collection agent for both logs and metrics.

Architecture

flowchart TD
    subgraph Hosts[Monitored Hosts]
        A1[VM/LXC 1<br/>Alloy]
        A2[VM/LXC 2<br/>Alloy]
        A3[Caddy<br/>Alloy]
    end

    subgraph Grafana[Grafana VM - 192.168.1.230]
        Loki[Loki<br/>:3100]
        Prom[Prometheus<br/>:9090]
        Graf[Grafana<br/>Dashboards]
    end

    A1 -->|logs| Loki
    A2 -->|logs| Loki
    A3 -->|logs| Loki
    A1 -->|metrics| Prom
    A2 -->|metrics| Prom
    A3 -->|metrics| Prom
    Loki --> Graf
    Prom --> Graf

Services

Grafana Loki

  • Host: Grafana VM (VMID 118, pve2)
  • IP: 192.168.1.230
  • Port: 3100
  • Role: Log aggregation and storage
  • Config: /etc/loki/loki.yaml
  • Data: /var/lib/loki

Prometheus

  • Host: Grafana VM (VMID 118, pve2)
  • IP: 192.168.1.230
  • Port: 9090
  • Role: Metrics storage (via remote_write API)
  • Config: /etc/prometheus/prometheus.yml
  • Data: /var/lib/prometheus

Grafana Alloy

  • Role: Unified log and metrics collection agent
  • Config: /etc/alloy/config.alloy
  • Service: systemctl status alloy
  • UI: http://localhost:12345 (when enabled)

Collects:

  • System metrics: CPU, memory, disk, filesystem, load, network
  • Systemd journal logs
  • Syslog and auth logs
  • Application-specific logs (Caddy, Docker, etc.)

Installation

Loki (Grafana VM)

Run on the Grafana VM (CT 118) to set up the log aggregation server:

curl -O https://raw.githubusercontent.com/MDHMatt/mdhmedia/main/scripts/install-loki.sh
chmod +x install-loki.sh
sudo ./install-loki.sh

# Or specify retention period in days
sudo ./install-loki.sh 60  # 60 days retention

The script will:

  1. Install Loki from official Grafana repositories
  2. Create optimized configuration at /etc/loki/loki.yaml
  3. Set up data directories at /var/lib/loki
  4. Enable automatic log retention/compaction
  5. Start and enable the systemd service

After installation, add Loki as a data source in Grafana:

  1. Go to ConnectionsData SourcesAdd data source
  2. Select Loki
  3. Set URL to http://localhost:3100
  4. Click Save & Test

Prometheus (Grafana VM)

Run on the Grafana VM (VMID 118) to set up the metrics storage server:

curl -O https://raw.githubusercontent.com/MDHMatt/mdhmedia/main/scripts/install-prometheus.sh
chmod +x install-prometheus.sh
sudo ./install-prometheus.sh

# Or specify retention period in days
sudo ./install-prometheus.sh 60  # 60 days retention

The script will:

  1. Install Prometheus from official repositories
  2. Configure storage at /var/lib/prometheus
  3. Enable the remote_write receiver for Alloy agents
  4. Set up automatic data retention
  5. Start and enable the systemd service

After installation, add Prometheus as a data source in Grafana:

  1. Go to ConnectionsData SourcesAdd data source
  2. Select Prometheus
  3. Set URL to http://localhost:9090
  4. Click Save & Test

Alloy (Generic Host)

Use for any VM/LXC to collect system logs and metrics:

curl -O https://raw.githubusercontent.com/MDHMatt/mdhmedia/main/scripts/install-alloy.sh
chmod +x install-alloy.sh
sudo ./install-alloy.sh

# Or specify custom Grafana VM IP
sudo ./install-alloy.sh 192.168.1.230

Features:

  • Collects system metrics (CPU, memory, disk, filesystem, load, network)
  • Forwards metrics to Prometheus via remote_write
  • Collects systemd journal, syslog, and auth logs
  • Forwards logs to Loki

Alloy (Caddy Reverse Proxy)

Specialized script for the Caddy VM (CT 112, pve1) with JSON access log parsing:

curl -O https://raw.githubusercontent.com/MDHMatt/mdhmedia/main/scripts/install-alloy-caddy.sh
chmod +x install-alloy-caddy.sh
sudo ./install-alloy-caddy.sh

# Or specify custom Grafana VM IP
sudo ./install-alloy-caddy.sh 192.168.1.230

Features:

  • Collects system metrics (CPU, memory, disk, filesystem, load, network)
  • Parses Caddy JSON access logs at /var/lib/caddy/logs/access.log
  • Extracts labels for filtering: request_host, request_method, status
  • Collects Caddy systemd service logs
  • Creates human-readable log lines from JSON

Manual Installation

# Add Grafana repository (Debian/Ubuntu - DEB822 format)
sudo mkdir -p /etc/apt/keyrings
sudo wget -q -O /etc/apt/keyrings/grafana.asc https://apt.grafana.com/gpg-full.key
sudo chmod 644 /etc/apt/keyrings/grafana.asc

cat << 'EOF' | sudo tee /etc/apt/sources.list.d/grafana.sources
Types: deb
URIs: https://apt.grafana.com
Suites: stable
Components: main
Signed-By: /etc/apt/keyrings/grafana.asc
EOF

# Install
sudo apt-get update
sudo apt-get install alloy

# Configure and start
sudo systemctl enable --now alloy

Configuration

Default Log Sources

Source Type Labels
Systemd Journal All services job=systemd-journal, unit=<service>
/var/log/syslog System logs job=syslog
/var/log/auth.log Auth logs job=auth
/var/log/messages System messages job=messages

Caddy Log Labels

Label Description Example
request_host Destination service grafana.mdhmedia.uk
request_method HTTP method GET, POST
status HTTP status code 200, 404, 502

Adding Custom Log Sources

Edit /etc/alloy/config.alloy:

local.file_match "myapp" {
  path_targets = [
    {
      __path__ = "/var/log/myapp/*.log",
      job      = "myapp",
    },
  ]
}

loki.source.file "myapp" {
  targets    = local.file_match.myapp.targets
  forward_to = [loki.write.default.receiver]
  tail_from_end = true
}

Reload configuration:

sudo systemctl reload alloy

Node Metrics Configuration

The scripts configure system metrics collection using prometheus.exporter.unix:

prometheus.exporter.unix "node" {
  set_collectors = ["cpu", "meminfo", "diskstats", "filesystem", "loadavg", "netdev", "uname"]
  filesystem {
    fs_types_exclude = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
  }
}

prometheus.scrape "node" {
  targets    = prometheus.exporter.unix.node.targets
  forward_to = [prometheus.remote_write.default.receiver]
  scrape_interval = "15s"
  job_name   = "node"
}

Example Configurations

Docker Container Logs

local.file_match "docker" {
  path_targets = [
    {
      __path__         = "/var/lib/docker/containers/*/*.log",
      job              = "docker",
      __path_exclude__ = "/var/lib/docker/containers/*/*-json.log.*.gz",
    },
  ]
}

loki.source.file "docker" {
  targets    = local.file_match.docker.targets
  forward_to = [loki.write.default.receiver]
  tail_from_end = true
}

Nginx Logs

local.file_match "nginx" {
  path_targets = [
    {__path__ = "/var/log/nginx/access.log", job = "nginx", type = "access"},
    {__path__ = "/var/log/nginx/error.log", job = "nginx", type = "error"},
  ]
}

loki.source.file "nginx" {
  targets    = local.file_match.nginx.targets
  forward_to = [loki.write.default.receiver]
  tail_from_end = true
}

Querying Logs

LogQL Examples

# All logs from a specific host
{host="docker1"}

# Systemd journal logs for a specific unit
{job="systemd-journal", unit="nginx.service"}

# Search for errors across all hosts
{env="homelab"} |= "error"

# Auth failures
{job="auth"} |= "Failed password"

# Top error-producing hosts (last hour)
sum by (host) (count_over_time({env="homelab"} |= "error" [1h]))

# SSH login attempts
{job="auth"} |~ "sshd.*Accepted|sshd.*Failed"

# Systemd service failures
{job="systemd-journal"} |= "Failed to start"

# Caddy requests by service
{job="caddy", request_host="grafana.mdhmedia.uk"}

# Caddy 5xx errors
{job="caddy", status=~"5.."}

Querying Metrics

PromQL Examples

# CPU usage by host (percentage)
100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage by host (percentage)
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Disk usage by host and mountpoint (percentage)
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)

# Network traffic (bytes/sec)
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

# System load average (1 minute)
node_load1

# Top 5 hosts by CPU usage
topk(5, 100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))

# Hosts with disk usage > 80%
node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 < 20

# All metrics from a specific host
{host="caddy"}

Dashboards

Homelab Logs

  • File: grafana/dashboards/loki-homelab-logs.json
Section Panels
Overview Total logs, active hosts, errors, warnings, auth failures, successful logins
Volume Log volume by host, log volume by job
Errors Error/warning trends, errors by host, errors by job
Security SSH login attempts (success/fail), failed logins by host
Systemd Logs by systemd unit
Browser Full log browser with search, error-only log view

Variables: host, job, search

Caddy Access Logs

  • File: grafana/dashboards/caddy-access-logs.json
Section Panels
Traffic Overview Total requests, 2xx/3xx/4xx/5xx counts, requests/sec
Status Codes Requests by status code over time (stacked)
Services Requests by service/host, top services bar chart
Distribution Request methods pie chart, status code donut chart
Error Analysis Error rate trends, errors by service, error log viewer
Caddy Service Caddy systemd service errors/warnings
Access Browser Full access log browser with filters

Variables: host (service), status, search

Homelab Overview

  • File: grafana/dashboards/homelab-overview.json
Section Panels
Infrastructure Health Total hosts, hosts online, errors (24h), warnings (24h)
Host Status Host status table (up/down, last seen)
Systemd Services Service status, recent failures
Security Auth events, failed logins
Caddy Requests, errors, traffic by service
System Metrics CPU, memory, disk, network (Prometheus)

Variables: host, job, timeRange

Importing Dashboards

  1. In Grafana, go to DashboardsImport
  2. Click Upload JSON file
  3. Select the dashboard JSON file
  4. Select your data sources (Loki for logs, Prometheus for metrics)
  5. Click Import

Troubleshooting

Alloy Not Starting

sudo systemctl status alloy
sudo journalctl -u alloy -f
alloy fmt /etc/alloy/config.alloy  # Check syntax and format

Common config errors: - Unrecognized attribute: Options like fs_types_exclude must be inside nested blocks (e.g., filesystem { }) - Missing quotes: String values must be quoted - Invalid regex: Check escape characters in patterns

No Logs Appearing in Loki

  1. Check Alloy is running: systemctl status alloy
  2. Test Loki connectivity: curl http://192.168.1.230:3100/ready
  3. Check firewall: Ensure port 3100 is open
  4. Verify permissions:
    sudo usermod -aG systemd-journal alloy
    sudo usermod -aG adm alloy
    sudo systemctl restart alloy
    

High Memory Usage (Alloy)

Reduce batch size in /etc/alloy/config.alloy:

loki.write "default" {
  endpoint {
    url = "http://192.168.1.230:3100/loki/api/v1/push"
    batch_size = "512KiB"
    batch_wait = "2s"
  }
}

Loki Not Starting

sudo systemctl status loki
sudo journalctl -u loki -f
loki -config.file=/etc/loki/loki.yaml -verify-config

Loki High Disk Usage

Adjust retention in /etc/loki/loki.yaml:

limits_config:
  retention_period: 720h  # 30 days

compactor:
  retention_enabled: true
  retention_delete_delay: 2h

Loki Out of Memory

Reduce ingester settings in /etc/loki/loki.yaml:

ingester:
  chunk_idle_period: 30m
  max_chunk_age: 1h
  chunk_target_size: 1048576

Prometheus Not Starting

sudo systemctl status prometheus
sudo journalctl -u prometheus -f
promtool check config /etc/prometheus/prometheus.yml

No Metrics in Prometheus

  1. Check Alloy is running: systemctl status alloy
  2. Test Prometheus connectivity: curl http://192.168.1.230:9090/-/ready
  3. Check firewall: Ensure port 9090 is open
  4. Verify remote_write is enabled in Prometheus (check for --web.enable-remote-write-receiver flag)
  5. Check Alloy config has correct Prometheus URL

Prometheus High Disk Usage

Adjust retention in systemd override /etc/systemd/system/prometheus.service.d/override.conf:

--storage.tsdb.retention.time=15d

Then reload:

sudo systemctl daemon-reload
sudo systemctl restart prometheus

References