Logging & Metrics Stack¶

MDH Media homelab uses Grafana Loki for centralized log aggregation, Prometheus for metrics storage, and Grafana Alloy as the unified collection agent for both logs and metrics.

Architecture¶

flowchart TD
    subgraph Hosts[Monitored Hosts]
        A1[VM/LXC 1<br/>Alloy]
        A2[VM/LXC 2<br/>Alloy]
        A3[Caddy<br/>Alloy]
    end

    subgraph Grafana[Grafana VM - 192.168.1.230]
        Loki[Loki<br/>:3100]
        Prom[Prometheus<br/>:9090]
        Graf[Grafana<br/>Dashboards]
    end

    A1 -->|logs| Loki
    A2 -->|logs| Loki
    A3 -->|logs| Loki
    A1 -->|metrics| Prom
    A2 -->|metrics| Prom
    A3 -->|metrics| Prom
    Loki --> Graf
    Prom --> Graf

Services¶

Grafana Loki¶

Host: Grafana VM (VMID 118, pve2)
IP: 192.168.1.230
Port: 3100
Role: Log aggregation and storage
Config: /etc/loki/loki.yaml
Data: /var/lib/loki

Prometheus¶

Host: Grafana VM (VMID 118, pve2)
IP: 192.168.1.230
Port: 9090
Role: Metrics storage (via remote_write API)
Config: /etc/prometheus/prometheus.yml
Data: /var/lib/prometheus

Grafana Alloy¶

Role: Unified log and metrics collection agent
Config: /etc/alloy/config.alloy
Service: systemctl status alloy
UI: http://localhost:12345 (when enabled)

Collects:

System metrics: CPU, memory, disk, filesystem, load, network
Systemd journal logs
Syslog and auth logs
Application-specific logs (Caddy, Docker, etc.)

Installation¶

Loki (Grafana VM)¶

Run on the Grafana VM (CT 118) to set up the log aggregation server:

curl -O https://raw.githubusercontent.com/MDHMatt/mdhmedia/main/scripts/install-loki.sh
chmod +x install-loki.sh
sudo ./install-loki.sh

# Or specify retention period in days
sudo ./install-loki.sh 60  # 60 days retention

The script will:

Install Loki from official Grafana repositories
Create optimized configuration at /etc/loki/loki.yaml
Set up data directories at /var/lib/loki
Enable automatic log retention/compaction
Start and enable the systemd service

After installation, add Loki as a data source in Grafana:

Go to Connections → Data Sources → Add data source
Select Loki
Set URL to http://localhost:3100
Click Save & Test

Prometheus (Grafana VM)¶

Run on the Grafana VM (VMID 118) to set up the metrics storage server:

curl -O https://raw.githubusercontent.com/MDHMatt/mdhmedia/main/scripts/install-prometheus.sh
chmod +x install-prometheus.sh
sudo ./install-prometheus.sh

# Or specify retention period in days
sudo ./install-prometheus.sh 60  # 60 days retention

The script will:

Install Prometheus from official repositories
Configure storage at /var/lib/prometheus
Enable the remote_write receiver for Alloy agents
Set up automatic data retention
Start and enable the systemd service

After installation, add Prometheus as a data source in Grafana:

Go to Connections → Data Sources → Add data source
Select Prometheus
Set URL to http://localhost:9090
Click Save & Test

Alloy (Generic Host)¶

Use for any VM/LXC to collect system logs and metrics:

curl -O https://raw.githubusercontent.com/MDHMatt/mdhmedia/main/scripts/install-alloy.sh
chmod +x install-alloy.sh
sudo ./install-alloy.sh

# Or specify custom Grafana VM IP
sudo ./install-alloy.sh 192.168.1.230

Features:

Collects system metrics (CPU, memory, disk, filesystem, load, network)
Forwards metrics to Prometheus via remote_write
Collects systemd journal, syslog, and auth logs
Forwards logs to Loki

Alloy (Caddy Reverse Proxy)¶

Specialized script for the Caddy VM (CT 112, pve1) with JSON access log parsing:

curl -O https://raw.githubusercontent.com/MDHMatt/mdhmedia/main/scripts/install-alloy-caddy.sh
chmod +x install-alloy-caddy.sh
sudo ./install-alloy-caddy.sh

# Or specify custom Grafana VM IP
sudo ./install-alloy-caddy.sh 192.168.1.230

Features:

Collects system metrics (CPU, memory, disk, filesystem, load, network)
Parses Caddy JSON access logs at /var/lib/caddy/logs/access.log
Extracts labels for filtering: request_host, request_method, status
Collects Caddy systemd service logs
Creates human-readable log lines from JSON

Manual Installation¶

# Add Grafana repository (Debian/Ubuntu - DEB822 format)
sudo mkdir -p /etc/apt/keyrings
sudo wget -q -O /etc/apt/keyrings/grafana.asc https://apt.grafana.com/gpg-full.key
sudo chmod 644 /etc/apt/keyrings/grafana.asc

cat << 'EOF' | sudo tee /etc/apt/sources.list.d/grafana.sources
Types: deb
URIs: https://apt.grafana.com
Suites: stable
Components: main
Signed-By: /etc/apt/keyrings/grafana.asc
EOF

# Install
sudo apt-get update
sudo apt-get install alloy

# Configure and start
sudo systemctl enable --now alloy

Configuration¶

Default Log Sources¶

Source	Type	Labels
Systemd Journal	All services	`job=systemd-journal`, `unit=<service>`
`/var/log/syslog`	System logs	`job=syslog`
`/var/log/auth.log`	Auth logs	`job=auth`
`/var/log/messages`	System messages	`job=messages`

Caddy Log Labels¶

Label	Description	Example
`request_host`	Destination service	`grafana.mdhmedia.uk`
`request_method`	HTTP method	`GET`, `POST`
`status`	HTTP status code	`200`, `404`, `502`

Adding Custom Log Sources¶

Edit /etc/alloy/config.alloy:

local.file_match "myapp" {
  path_targets = [
    {
      __path__ = "/var/log/myapp/*.log",
      job      = "myapp",
    },
  ]
}

loki.source.file "myapp" {
  targets    = local.file_match.myapp.targets
  forward_to = [loki.write.default.receiver]
  tail_from_end = true
}

Reload configuration:

sudo systemctl reload alloy

Node Metrics Configuration¶

The scripts configure system metrics collection using prometheus.exporter.unix:

prometheus.exporter.unix "node" {
  set_collectors = ["cpu", "meminfo", "diskstats", "filesystem", "loadavg", "netdev", "uname"]
  filesystem {
    fs_types_exclude = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
  }
}

prometheus.scrape "node" {
  targets    = prometheus.exporter.unix.node.targets
  forward_to = [prometheus.remote_write.default.receiver]
  scrape_interval = "15s"
  job_name   = "node"
}

Example Configurations¶

Docker Container Logs¶

local.file_match "docker" {
  path_targets = [
    {
      __path__         = "/var/lib/docker/containers/*/*.log",
      job              = "docker",
      __path_exclude__ = "/var/lib/docker/containers/*/*-json.log.*.gz",
    },
  ]
}

loki.source.file "docker" {
  targets    = local.file_match.docker.targets
  forward_to = [loki.write.default.receiver]
  tail_from_end = true
}

Nginx Logs¶

local.file_match "nginx" {
  path_targets = [
    {__path__ = "/var/log/nginx/access.log", job = "nginx", type = "access"},
    {__path__ = "/var/log/nginx/error.log", job = "nginx", type = "error"},
  ]
}

loki.source.file "nginx" {
  targets    = local.file_match.nginx.targets
  forward_to = [loki.write.default.receiver]
  tail_from_end = true
}

Querying Logs¶

LogQL Examples¶

# All logs from a specific host
{host="docker1"}

# Systemd journal logs for a specific unit
{job="systemd-journal", unit="nginx.service"}

# Search for errors across all hosts
{env="homelab"} |= "error"

# Auth failures
{job="auth"} |= "Failed password"

# Top error-producing hosts (last hour)
sum by (host) (count_over_time({env="homelab"} |= "error" [1h]))

# SSH login attempts
{job="auth"} |~ "sshd.*Accepted|sshd.*Failed"

# Systemd service failures
{job="systemd-journal"} |= "Failed to start"

# Caddy requests by service
{job="caddy", request_host="grafana.mdhmedia.uk"}

# Caddy 5xx errors
{job="caddy", status=~"5.."}

Querying Metrics¶

PromQL Examples¶

# CPU usage by host (percentage)
100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage by host (percentage)
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Disk usage by host and mountpoint (percentage)
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)

# Network traffic (bytes/sec)
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

# System load average (1 minute)
node_load1

# Top 5 hosts by CPU usage
topk(5, 100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))

# Hosts with disk usage > 80%
node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 < 20

# All metrics from a specific host
{host="caddy"}

Dashboards¶

Homelab Logs¶

File: grafana/dashboards/loki-homelab-logs.json

Section	Panels
Overview	Total logs, active hosts, errors, warnings, auth failures, successful logins
Volume	Log volume by host, log volume by job
Errors	Error/warning trends, errors by host, errors by job
Security	SSH login attempts (success/fail), failed logins by host
Systemd	Logs by systemd unit
Browser	Full log browser with search, error-only log view

Variables: host, job, search

Caddy Access Logs¶

File: grafana/dashboards/caddy-access-logs.json

Section	Panels
Traffic Overview	Total requests, 2xx/3xx/4xx/5xx counts, requests/sec
Status Codes	Requests by status code over time (stacked)
Services	Requests by service/host, top services bar chart
Distribution	Request methods pie chart, status code donut chart
Error Analysis	Error rate trends, errors by service, error log viewer
Caddy Service	Caddy systemd service errors/warnings
Access Browser	Full access log browser with filters

Variables: host (service), status, search

Homelab Overview¶

File: grafana/dashboards/homelab-overview.json

Section	Panels
Infrastructure Health	Total hosts, hosts online, errors (24h), warnings (24h)
Host Status	Host status table (up/down, last seen)
Systemd Services	Service status, recent failures
Security	Auth events, failed logins
Caddy	Requests, errors, traffic by service
System Metrics	CPU, memory, disk, network (Prometheus)

Variables: host, job, timeRange

Importing Dashboards¶

In Grafana, go to Dashboards → Import
Click Upload JSON file
Select the dashboard JSON file
Select your data sources (Loki for logs, Prometheus for metrics)
Click Import

Troubleshooting¶

Alloy Not Starting¶

sudo systemctl status alloy
sudo journalctl -u alloy -f
alloy fmt /etc/alloy/config.alloy  # Check syntax and format

Common config errors: - Unrecognized attribute: Options like fs_types_exclude must be inside nested blocks (e.g., filesystem { }) - Missing quotes: String values must be quoted - Invalid regex: Check escape characters in patterns

No Logs Appearing in Loki¶

Check Alloy is running: systemctl status alloy
Test Loki connectivity: curl http://192.168.1.230:3100/ready
Check firewall: Ensure port 3100 is open

Verify permissions:

sudo usermod -aG systemd-journal alloy
sudo usermod -aG adm alloy
sudo systemctl restart alloy

High Memory Usage (Alloy)¶

Reduce batch size in /etc/alloy/config.alloy:

loki.write "default" {
  endpoint {
    url = "http://192.168.1.230:3100/loki/api/v1/push"
    batch_size = "512KiB"
    batch_wait = "2s"
  }
}

Loki Not Starting¶

sudo systemctl status loki
sudo journalctl -u loki -f
loki -config.file=/etc/loki/loki.yaml -verify-config

Loki High Disk Usage¶

Adjust retention in /etc/loki/loki.yaml:

limits_config:
  retention_period: 720h  # 30 days

compactor:
  retention_enabled: true
  retention_delete_delay: 2h

Loki Out of Memory¶

Reduce ingester settings in /etc/loki/loki.yaml:

ingester:
  chunk_idle_period: 30m
  max_chunk_age: 1h
  chunk_target_size: 1048576

Prometheus Not Starting¶

sudo systemctl status prometheus
sudo journalctl -u prometheus -f
promtool check config /etc/prometheus/prometheus.yml

No Metrics in Prometheus¶

Check Alloy is running: systemctl status alloy
Test Prometheus connectivity: curl http://192.168.1.230:9090/-/ready
Check firewall: Ensure port 9090 is open
Verify remote_write is enabled in Prometheus (check for --web.enable-remote-write-receiver flag)
Check Alloy config has correct Prometheus URL

Prometheus High Disk Usage¶

Adjust retention in systemd override /etc/systemd/system/prometheus.service.d/override.conf:

--storage.tsdb.retention.time=15d

Then reload:

sudo systemctl daemon-reload
sudo systemctl restart prometheus