Logging & Metrics Stack¶
MDH Media homelab uses Grafana Loki for centralized log aggregation, Prometheus for metrics storage, and Grafana Alloy as the unified collection agent for both logs and metrics.
Architecture¶
flowchart TD
subgraph Hosts[Monitored Hosts]
A1[VM/LXC 1<br/>Alloy]
A2[VM/LXC 2<br/>Alloy]
A3[Caddy<br/>Alloy]
end
subgraph Grafana[Grafana VM - 192.168.1.230]
Loki[Loki<br/>:3100]
Prom[Prometheus<br/>:9090]
Graf[Grafana<br/>Dashboards]
end
A1 -->|logs| Loki
A2 -->|logs| Loki
A3 -->|logs| Loki
A1 -->|metrics| Prom
A2 -->|metrics| Prom
A3 -->|metrics| Prom
Loki --> Graf
Prom --> Graf
Services¶
Grafana Loki¶
- Host: Grafana VM (VMID 118, pve2)
- IP: 192.168.1.230
- Port: 3100
- Role: Log aggregation and storage
- Config:
/etc/loki/loki.yaml - Data:
/var/lib/loki
Prometheus¶
- Host: Grafana VM (VMID 118, pve2)
- IP: 192.168.1.230
- Port: 9090
- Role: Metrics storage (via remote_write API)
- Config:
/etc/prometheus/prometheus.yml - Data:
/var/lib/prometheus
Grafana Alloy¶
- Role: Unified log and metrics collection agent
- Config:
/etc/alloy/config.alloy - Service:
systemctl status alloy - UI:
http://localhost:12345(when enabled)
Collects:
- System metrics: CPU, memory, disk, filesystem, load, network
- Systemd journal logs
- Syslog and auth logs
- Application-specific logs (Caddy, Docker, etc.)
Installation¶
Loki (Grafana VM)¶
Run on the Grafana VM (CT 118) to set up the log aggregation server:
curl -O https://raw.githubusercontent.com/MDHMatt/mdhmedia/main/scripts/install-loki.sh
chmod +x install-loki.sh
sudo ./install-loki.sh
# Or specify retention period in days
sudo ./install-loki.sh 60 # 60 days retention
The script will:
- Install Loki from official Grafana repositories
- Create optimized configuration at
/etc/loki/loki.yaml - Set up data directories at
/var/lib/loki - Enable automatic log retention/compaction
- Start and enable the systemd service
After installation, add Loki as a data source in Grafana:
- Go to Connections → Data Sources → Add data source
- Select Loki
- Set URL to
http://localhost:3100 - Click Save & Test
Prometheus (Grafana VM)¶
Run on the Grafana VM (VMID 118) to set up the metrics storage server:
curl -O https://raw.githubusercontent.com/MDHMatt/mdhmedia/main/scripts/install-prometheus.sh
chmod +x install-prometheus.sh
sudo ./install-prometheus.sh
# Or specify retention period in days
sudo ./install-prometheus.sh 60 # 60 days retention
The script will:
- Install Prometheus from official repositories
- Configure storage at
/var/lib/prometheus - Enable the remote_write receiver for Alloy agents
- Set up automatic data retention
- Start and enable the systemd service
After installation, add Prometheus as a data source in Grafana:
- Go to Connections → Data Sources → Add data source
- Select Prometheus
- Set URL to
http://localhost:9090 - Click Save & Test
Alloy (Generic Host)¶
Use for any VM/LXC to collect system logs and metrics:
curl -O https://raw.githubusercontent.com/MDHMatt/mdhmedia/main/scripts/install-alloy.sh
chmod +x install-alloy.sh
sudo ./install-alloy.sh
# Or specify custom Grafana VM IP
sudo ./install-alloy.sh 192.168.1.230
Features:
- Collects system metrics (CPU, memory, disk, filesystem, load, network)
- Forwards metrics to Prometheus via remote_write
- Collects systemd journal, syslog, and auth logs
- Forwards logs to Loki
Alloy (Caddy Reverse Proxy)¶
Specialized script for the Caddy VM (CT 112, pve1) with JSON access log parsing:
curl -O https://raw.githubusercontent.com/MDHMatt/mdhmedia/main/scripts/install-alloy-caddy.sh
chmod +x install-alloy-caddy.sh
sudo ./install-alloy-caddy.sh
# Or specify custom Grafana VM IP
sudo ./install-alloy-caddy.sh 192.168.1.230
Features:
- Collects system metrics (CPU, memory, disk, filesystem, load, network)
- Parses Caddy JSON access logs at
/var/lib/caddy/logs/access.log - Extracts labels for filtering:
request_host,request_method,status - Collects Caddy systemd service logs
- Creates human-readable log lines from JSON
Manual Installation¶
# Add Grafana repository (Debian/Ubuntu - DEB822 format)
sudo mkdir -p /etc/apt/keyrings
sudo wget -q -O /etc/apt/keyrings/grafana.asc https://apt.grafana.com/gpg-full.key
sudo chmod 644 /etc/apt/keyrings/grafana.asc
cat << 'EOF' | sudo tee /etc/apt/sources.list.d/grafana.sources
Types: deb
URIs: https://apt.grafana.com
Suites: stable
Components: main
Signed-By: /etc/apt/keyrings/grafana.asc
EOF
# Install
sudo apt-get update
sudo apt-get install alloy
# Configure and start
sudo systemctl enable --now alloy
Configuration¶
Default Log Sources¶
| Source | Type | Labels |
|---|---|---|
| Systemd Journal | All services | job=systemd-journal, unit=<service> |
/var/log/syslog |
System logs | job=syslog |
/var/log/auth.log |
Auth logs | job=auth |
/var/log/messages |
System messages | job=messages |
Caddy Log Labels¶
| Label | Description | Example |
|---|---|---|
request_host |
Destination service | grafana.mdhmedia.uk |
request_method |
HTTP method | GET, POST |
status |
HTTP status code | 200, 404, 502 |
Adding Custom Log Sources¶
Edit /etc/alloy/config.alloy:
local.file_match "myapp" {
path_targets = [
{
__path__ = "/var/log/myapp/*.log",
job = "myapp",
},
]
}
loki.source.file "myapp" {
targets = local.file_match.myapp.targets
forward_to = [loki.write.default.receiver]
tail_from_end = true
}
Reload configuration:
Node Metrics Configuration¶
The scripts configure system metrics collection using prometheus.exporter.unix:
prometheus.exporter.unix "node" {
set_collectors = ["cpu", "meminfo", "diskstats", "filesystem", "loadavg", "netdev", "uname"]
filesystem {
fs_types_exclude = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
}
}
prometheus.scrape "node" {
targets = prometheus.exporter.unix.node.targets
forward_to = [prometheus.remote_write.default.receiver]
scrape_interval = "15s"
job_name = "node"
}
Example Configurations¶
Docker Container Logs¶
local.file_match "docker" {
path_targets = [
{
__path__ = "/var/lib/docker/containers/*/*.log",
job = "docker",
__path_exclude__ = "/var/lib/docker/containers/*/*-json.log.*.gz",
},
]
}
loki.source.file "docker" {
targets = local.file_match.docker.targets
forward_to = [loki.write.default.receiver]
tail_from_end = true
}
Nginx Logs¶
local.file_match "nginx" {
path_targets = [
{__path__ = "/var/log/nginx/access.log", job = "nginx", type = "access"},
{__path__ = "/var/log/nginx/error.log", job = "nginx", type = "error"},
]
}
loki.source.file "nginx" {
targets = local.file_match.nginx.targets
forward_to = [loki.write.default.receiver]
tail_from_end = true
}
Querying Logs¶
LogQL Examples¶
# All logs from a specific host
{host="docker1"}
# Systemd journal logs for a specific unit
{job="systemd-journal", unit="nginx.service"}
# Search for errors across all hosts
{env="homelab"} |= "error"
# Auth failures
{job="auth"} |= "Failed password"
# Top error-producing hosts (last hour)
sum by (host) (count_over_time({env="homelab"} |= "error" [1h]))
# SSH login attempts
{job="auth"} |~ "sshd.*Accepted|sshd.*Failed"
# Systemd service failures
{job="systemd-journal"} |= "Failed to start"
# Caddy requests by service
{job="caddy", request_host="grafana.mdhmedia.uk"}
# Caddy 5xx errors
{job="caddy", status=~"5.."}
Querying Metrics¶
PromQL Examples¶
# CPU usage by host (percentage)
100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage by host (percentage)
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Disk usage by host and mountpoint (percentage)
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)
# Network traffic (bytes/sec)
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
# System load average (1 minute)
node_load1
# Top 5 hosts by CPU usage
topk(5, 100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))
# Hosts with disk usage > 80%
node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 < 20
# All metrics from a specific host
{host="caddy"}
Dashboards¶
Homelab Logs¶
- File:
grafana/dashboards/loki-homelab-logs.json
| Section | Panels |
|---|---|
| Overview | Total logs, active hosts, errors, warnings, auth failures, successful logins |
| Volume | Log volume by host, log volume by job |
| Errors | Error/warning trends, errors by host, errors by job |
| Security | SSH login attempts (success/fail), failed logins by host |
| Systemd | Logs by systemd unit |
| Browser | Full log browser with search, error-only log view |
Variables: host, job, search
Caddy Access Logs¶
- File:
grafana/dashboards/caddy-access-logs.json
| Section | Panels |
|---|---|
| Traffic Overview | Total requests, 2xx/3xx/4xx/5xx counts, requests/sec |
| Status Codes | Requests by status code over time (stacked) |
| Services | Requests by service/host, top services bar chart |
| Distribution | Request methods pie chart, status code donut chart |
| Error Analysis | Error rate trends, errors by service, error log viewer |
| Caddy Service | Caddy systemd service errors/warnings |
| Access Browser | Full access log browser with filters |
Variables: host (service), status, search
Homelab Overview¶
- File:
grafana/dashboards/homelab-overview.json
| Section | Panels |
|---|---|
| Infrastructure Health | Total hosts, hosts online, errors (24h), warnings (24h) |
| Host Status | Host status table (up/down, last seen) |
| Systemd Services | Service status, recent failures |
| Security | Auth events, failed logins |
| Caddy | Requests, errors, traffic by service |
| System Metrics | CPU, memory, disk, network (Prometheus) |
Variables: host, job, timeRange
Importing Dashboards¶
- In Grafana, go to Dashboards → Import
- Click Upload JSON file
- Select the dashboard JSON file
- Select your data sources (Loki for logs, Prometheus for metrics)
- Click Import
Troubleshooting¶
Alloy Not Starting¶
sudo systemctl status alloy
sudo journalctl -u alloy -f
alloy fmt /etc/alloy/config.alloy # Check syntax and format
Common config errors:
- Unrecognized attribute: Options like fs_types_exclude must be inside nested blocks (e.g., filesystem { })
- Missing quotes: String values must be quoted
- Invalid regex: Check escape characters in patterns
No Logs Appearing in Loki¶
- Check Alloy is running:
systemctl status alloy - Test Loki connectivity:
curl http://192.168.1.230:3100/ready - Check firewall: Ensure port 3100 is open
- Verify permissions:
High Memory Usage (Alloy)¶
Reduce batch size in /etc/alloy/config.alloy:
loki.write "default" {
endpoint {
url = "http://192.168.1.230:3100/loki/api/v1/push"
batch_size = "512KiB"
batch_wait = "2s"
}
}
Loki Not Starting¶
sudo systemctl status loki
sudo journalctl -u loki -f
loki -config.file=/etc/loki/loki.yaml -verify-config
Loki High Disk Usage¶
Adjust retention in /etc/loki/loki.yaml:
limits_config:
retention_period: 720h # 30 days
compactor:
retention_enabled: true
retention_delete_delay: 2h
Loki Out of Memory¶
Reduce ingester settings in /etc/loki/loki.yaml:
Prometheus Not Starting¶
sudo systemctl status prometheus
sudo journalctl -u prometheus -f
promtool check config /etc/prometheus/prometheus.yml
No Metrics in Prometheus¶
- Check Alloy is running:
systemctl status alloy - Test Prometheus connectivity:
curl http://192.168.1.230:9090/-/ready - Check firewall: Ensure port 9090 is open
- Verify remote_write is enabled in Prometheus (check for
--web.enable-remote-write-receiverflag) - Check Alloy config has correct Prometheus URL
Prometheus High Disk Usage¶
Adjust retention in systemd override /etc/systemd/system/prometheus.service.d/override.conf:
Then reload: