What Metrics Actually Matter
Most teams monitor queues reactively — they look at the dashboard after something breaks.
The goal of a proper queue dashboard is proactive visibility: catch problems
before they impact users. These are the five metrics worth tracking:
Queue Depth (Pending Jobs) — jobs waiting to be processed. Sustained growth = your workers can't keep up.
Throughput — jobs processed per minute. Drops indicate worker crashes or slow jobs blocking the queue.
Wait Time (Latency) — time from dispatch to execution start. High latency means underpowered workers for the load.
Failure Rate — percentage of jobs that fail. Spike = a code bug, dependency down, or config change.
Job Duration — average processing time per job class. Growing duration = performance regression or resource exhaustion.
Horizon Built-in Metrics
If you're on Redis + Horizon, you already have a built-in metrics system.
Horizon's metrics are stored in Redis and displayed in the /horizon dashboard.
Take Regular Snapshots
Horizon needs a scheduled snapshot command to record metrics over time.
Without it, the Metrics tab shows no historical data:
// app/Console/Kernel.php
$schedule->command('horizon:snapshot')->everyFiveMinutes();
Horizon Metrics API
Horizon exposes its metrics via an internal API you can query programmatically:
use Laravel\Horizon\Contracts\MetricsRepository;
$metrics = app(MetricsRepository::class);
// Get throughput for a job class (jobs per minute)
$throughput = $metrics->throughputForJob(App\Jobs\SendEmail::class);
// Get average runtime in milliseconds
$runtime = $metrics->runtimeForJob(App\Jobs\SendEmail::class);
// Get queue throughput and wait time
$queueThroughput = $metrics->throughputForQueue('emails');
$waitTime = $metrics->waitTimeFor('redis', 'emails'); // seconds
Exporting Custom Metrics
For teams that need more than Horizon provides — historical data beyond 1 hour,
cross-service dashboards, or custom business metrics — export queue data to a time-series database.
Custom Metrics Collector (Scheduled Command)
// app/Console/Commands/CollectQueueMetrics.php
class CollectQueueMetrics extends Command
{
protected $signature = 'metrics:queue-collect';
protected $description = 'Collect queue metrics and push to time-series DB';
public function handle(): void
{
$queues = ['default', 'emails', 'payments', 'notifications'];
foreach ($queues as $queue) {
// Pending job count
$pending = DB::table('jobs')
->where('queue', $queue)
->count();
// Jobs reserved (currently being processed)
$reserved = DB::table('jobs')
->where('queue', $queue)
->whereNotNull('reserved_at')
->count();
// Failed in last 5 minutes
$failed = DB::table('failed_jobs')
->where('queue', $queue)
->where('failed_at', '>=', now()->subMinutes(5))
->count();
// Push to your metrics system
Metrics::gauge("queue.pending.{$queue}", $pending);
Metrics::gauge("queue.reserved.{$queue}", $reserved);
Metrics::counter("queue.failed.{$queue}", $failed);
}
}
}
// Schedule every minute
$schedule->command('metrics:queue-collect')->everyMinute()->withoutOverlapping();
Track Per-Job Duration in Middleware
// app/Jobs/Middleware/RecordMetrics.php
class RecordMetrics
{
public function handle(object $job, \Closure $next): void
{
$jobClass = class_basename($job);
$start = microtime(true);
$status = 'success';
try {
$next($job);
} catch (\Throwable $e) {
$status = 'failure';
throw $e;
} finally {
$duration = (microtime(true) - $start) * 1000; // ms
// Push to StatsD, Prometheus pushgateway, Datadog, etc.
Metrics::histogram('queue.job.duration', $duration, [
'job' => $jobClass,
'queue' => $job->queue ?? 'default',
'status' => $status,
]);
}
}
}
Prometheus + Grafana Setup
For production-grade observability, combine Prometheus (metrics storage) with Grafana (visualization).
Laravel metrics flow into Prometheus via a scrape endpoint or Pushgateway.
Option 1: Prometheus Pushgateway (Simplest)
composer require promphp/prometheus_client_php
// In your metrics collector command:
use Prometheus\CollectorRegistry;
use Prometheus\Storage\Redis as PrometheusRedis;
$registry = new CollectorRegistry(new PrometheusRedis(['host' => 'redis']));
$pendingGauge = $registry->getOrRegisterGauge(
'laravel',
'queue_pending_jobs',
'Number of pending jobs per queue',
['queue']
);
$pendingGauge->set($pending, [$queue]);
// Push to Prometheus Pushgateway
$pushGateway = new \PrometheusPushGateway\PushGateway('pushgateway:9091');
$pushGateway->pushAdd($registry, 'laravel_queue', ['instance' => gethostname()]);
Option 2: Scrape Endpoint (More Robust)
// routes/web.php — Prometheus scrape endpoint
Route::get('/metrics', function () {
$registry = app(CollectorRegistry::class);
$renderer = new \Prometheus\RenderTextFormat();
return response($renderer->render($registry->getMetricFamilySamples()), 200, [
'Content-Type' => \Prometheus\RenderTextFormat::MIME_TYPE,
]);
})->middleware('auth.prometheus'); // IP whitelist for Prometheus scraper only
Essential Grafana Dashboard Panels
Here are the panels every Laravel queue Grafana dashboard should have, with their Prometheus queries:
Panel 1: Queue Depth by Queue (Stat/Gauge)
# PromQL
laravel_queue_pending_jobs{queue="emails"}
laravel_queue_pending_jobs{queue="payments"}
# Alert threshold: > 500 pending = warning, > 2000 = critical
Panel 2: Job Throughput (Time Series)
# Jobs processed per minute, grouped by job class
rate(laravel_queue_job_duration_count[5m]) * 60
# Color code: green > 100/min, yellow > 50/min, red < 10/min
Panel 3: Job Duration Percentiles (Heatmap)
# P50, P95, P99 job duration
histogram_quantile(0.50, rate(laravel_queue_job_duration_bucket[5m]))
histogram_quantile(0.95, rate(laravel_queue_job_duration_bucket[5m]))
histogram_quantile(0.99, rate(laravel_queue_job_duration_bucket[5m]))
# P99 > 5s = investigate slow jobs
Panel 4: Failure Rate (Stat)
# Failure rate as percentage
rate(laravel_queue_job_duration_count{status="failure"}[5m])
/
rate(laravel_queue_job_duration_count[5m])
* 100
# Alert: > 5% failure rate = warning, > 20% = critical
Panel 5: Failed Jobs Table (Table)
# Not from Prometheus — pull from database in a separate panel
# Use Grafana MySQL/PostgreSQL datasource:
SELECT
DATE_FORMAT(failed_at, '%H:%i') as time,
queue,
SUBSTRING_INDEX(exception, '\\n', 1) as error,
COUNT(*) as count
FROM failed_jobs
WHERE failed_at >= NOW() - INTERVAL 1 HOUR
GROUP BY queue, error
ORDER BY count DESC
LIMIT 20;
Intelligent Alerting Rules
Good alerts fire when something needs human attention. Bad alerts fire on every blip
and cause alert fatigue — the team starts ignoring them. Here are battle-tested alert rules:
Rule 1: Sustained Queue Backlog
# Grafana Alert Rule — fires only if condition is true for 10+ minutes
# This prevents alerts from transient bursts that clear quickly
WHEN avg() OF query(laravel_queue_pending_jobs{queue="payments"}, 10m) > 200
FOR 10m
SEVERITY: critical
MESSAGE: "Payments queue backlog: {{ value }} jobs pending for 10+ minutes"
Rule 2: Failure Rate Spike
# Alert when failure rate > 10% for 5 minutes
WHEN avg() OF (failure_rate_query) > 10
FOR 5m
SEVERITY: warning
MESSAGE: "Queue failure rate {{ value }}% on {{ queue }}"
Rule 3: Worker Throughput Drop
# Alert when throughput drops > 80% below 1-hour average
# This catches worker crashes even when the queue depth is still low
WHEN (current_throughput / avg_throughput_1h) < 0.2
FOR 3m
SEVERITY: critical
MESSAGE: "Queue throughput dropped to {{ value }} jobs/min — workers may be down"
Rule 4: Job Duration Regression
# Alert when P95 job duration is 3x higher than the daily average
# Catches memory leaks, slow queries, and dependency degradation
WHEN p95_duration > (daily_avg_p95 * 3)
FOR 5m
SEVERITY: warning
MESSAGE: "Job {{ job_class }} P95 duration {{ value }}ms — 3x above normal"
Conclusion
A great queue dashboard tells you the current health, recent trends, and failure details
without requiring manual investigation. Build towards these goals:
Use Horizon's built-in metrics for Redis queues — start here before investing in Prometheus
Schedule horizon:snapshot every 5 minutes so historical data is available in the Metrics tab
Export metrics to Prometheus + Grafana for long-term retention and cross-service dashboards
Build 5 core panels: queue depth, throughput, duration percentiles, failure rate, failed job table
Write alerts with a "FOR N minutes" condition — never alert on transient blips
The throughput-drop alert is your most important one — it catches total worker failure even before the queue starts backing up