Quick Comparison
| Product | Best For | Rating |
|---|---|---|
| Datadog Logs | Best Overall | 4.7/5 |
| Grafana Loki | Best Budget | 4.6/5 |
| Splunk Enterprise | Best Premium | 4.7/5 |
| Elastic Stack | Best for Self Hosting | 4.5/5 |
| Vector by Datadog | Best Compact | 4.6/5 |
I have built or operated log aggregation systems at three companies of different sizes. The patterns that scale are different from what tutorials suggest.
The Components
Collection: Agent on each host/container that ships logs.
- Fluent Bit (lightweight, edge)
- Vector (modern, performant)
- Fluentd (mature, plugin ecosystem)
Transport/Buffer: Decouples log source from destination.
- Kafka (mature, scalable)
- Redpanda (Kafka-compatible, faster)
- Kinesis (AWS-native)
Processing: Parse, enrich, route logs.
- Vector (single tool for collection + processing)
- Logstash (Java-heavy)
- Custom (Go/Python services)
Storage/Query:
- Elasticsearch (mature, expensive)
- Loki (Prometheus model for logs, cheap)
- OpenSearch (Elasticsearch fork)
- Commercial (Datadog, Splunk, New Relic)
Visualization:
- Kibana (with Elasticsearch)
- Grafana (with Loki, multiple sources)
- Commercial dashboards
Architecture Patterns
Direct ingestion (simple)
Apps โ Log Agent โ Storage โ Query
Use case: small startups, low volume. Single point of failure.
Buffered ingestion
Apps โ Agent โ Kafka โ Processor โ Storage โ Query
Use case: most production systems. Buffer absorbs spikes. Processor enriches/routes. Survives downstream outages.
Multi-destination
Apps โ Agent โ Kafka โ [Hot Storage] + [Cold Archive] + [SIEM]
Use case: compliance + operational + security needs in parallel. Same log goes to multiple destinations with different retention.
Tool Recommendations by Scale
Solo developer / Side project: Loki + Grafana self-hosted. Free. Fits on small server.
Early startup (5-20 engineers): Loki + Grafana managed or self-hosted. at small scale.
Mid-size SaaS (50-200 engineers):
- Option A: Self-hosted ELK or Loki. Ops investment 0.5-1 engineer.
- Option B: Commercial (Datadog logs)..
Enterprise (500+ engineers):
- Multi-tier storage (hot + warm + cold)
- Custom-built or Splunk Enterprise
- Dedicated logging team
Cost Optimization
Sampling: Drop low-value logs at the source. Common rules:
- Drop DEBUG/TRACE in production
- Sample successful requests 10% (keep all errors)
- Drop health check logs entirely
- Drop synthetic test traffic
Structured logging: Faster to query than unstructured text. Index only the fields you query.
Field reduction: Donโt log full request/response bodies. Log structured event with key fields.
Retention tiering: Hot 7 days, warm 30 days, archive 1 year. Most queries donโt need long retention.
Pre-aggregation: For metric-derived-from-logs use cases, pre-aggregate before storage. Reduces query cost dramatically.
Common Mistakes
Logging everything: Most teams log 2-5x more than they ever query. Reduce volume before scaling infrastructure.
Unstructured logs: Slow to query, hard to parse. Always log structured (JSON) in modern systems.
No retention policy: Logs accumulate forever. Disk costs balloon. Set retention day 1.
Manual log parsing: Building custom parsers for every format becomes maintenance burden. Standardize log format across services.
Single point of failure: Direct app-to-storage without buffer. Downstream outage causes log loss or app blocking.
Over-investing in infrastructure:current pricingK/month logging system forcurrent pricingK/month observability value. Match investment to actual queries.
Sample Implementation (Go service)
import "github.com/rs/zerolog"
logger:= zerolog.New(os.Stdout).
With().
Timestamp().
Str("service", "api").
Str("version", buildVersion).
Logger()
// Structured log
logger.Info().
Str("user_id", userID).
Str("action", "checkout").
Int("items", len(cart.Items)).
Float64("total", cart.Total).
Dur("duration", time.Since(start)).
Msg("checkout completed")
Always: timestamp, service identifier, structured fields, no PII in fields.
Never: log full credit cards, passwords, session tokens, user PII without explicit policy.
Logging Patterns
Request lifecycle logging:
{event: "request.start", request_id: "abc", method: "POST", path: "/api/orders"}
{event: "request.end", request_id: "abc", status: 200, duration_ms: 145}
Use request_id to trace all logs from same request.
Error logging:
{level: "error", request_id: "abc", error: "payment.failed", error_code: "card_declined", retryable: true}
Structured errors enable alerting and trending.
Business event logging:
{event: "user.signed_up", user_id: "u123", source: "google_oauth", plan: "free"}
Business events for product analytics, not just engineering observability.
Compliance Considerations
GDPR/CCPA: Donโt log PII (email, name, address) without explicit policy. Use hashed identifiers. Implement deletion (right to be forgotten).
HIPAA: Healthcare data requires encrypted storage, access controls, audit trails on log access itself.
PCI-DSS: Card data never logged. Tokenized references only.
SOC 2 / ISO 27001: Logs themselves are auditable. Maintain integrity (no modification), access controls, retention proof.
Cost-Benefit Reality
Logging cost: $X Query value: 5-10x logging cost (incidents resolved, bugs prevented)
If your query value isnโt 5x cost, reduce volume or switch tools. Logs are infrastructure, not data product. Optimize for incident response time, not exhaustive history.
Recommended Stack for Most Teams
Startups + early growth:
- Vector or Fluent Bit for collection
- Kafka or Redpanda for buffer (skip for very small scale)
- Loki for storage
- Grafana for query/dashboards
- Self-hosted or Grafana Cloud
Mid-size SaaS with budget:
- Same collection but commercial backend (Datadog, New Relic Logs)
- Trade engineering time for managed service
Enterprise:
- Splunk or self-built on Elasticsearch
- Multi-tier storage
- Dedicated platform team
The right answer depends on engineering bandwidth vs cost tolerance. Thereโs no universal correct choice.
Frequently asked questions
ELK Stack vs Loki vs commercial?+
ELK (Elasticsearch+Logstash+Kibana): mature, complex, expensive to run at scale. Loki: cheaper alternative, simpler queries. Commercial (Datadog, Splunk): expensive but turnkey. Match tool to team size and budget.
What's a reasonable log volume?+
Small startup: 10-100 GB/day. Mid-size SaaS: 500 GB-5 TB/day. Enterprise: 10-100+ TB/day. Cost scales linearly with volume. Most teams overspend on logs by not sampling adequately.
Should I log everything?+
No. Log structured events. Avoid logging full request/response bodies (privacy, cost, noise). Use trace IDs to correlate. Most teams log 2-5x more than they ever query.
Retention policy?+
Hot tier (queryable): 7-30 days. Warm tier (slower queries): 30-90 days. Cold tier (archived): 6 months-7 years for compliance. Most queries hit last 7 days of hot tier.
Cost per GB?+
Self-hosted: stored. Loki/Mimir:. Datadog/Splunk:. Costs add up -current pricingK/year on logs at mid-size SaaS is common.