Quick Comparison

ProductBest ForRating
Datadog LogsBest Overall4.7/5
Grafana LokiBest Budget4.6/5
Splunk EnterpriseBest Premium4.7/5
Elastic StackBest for Self Hosting4.5/5
Vector by DatadogBest Compact4.6/5

I have built or operated log aggregation systems at three companies of different sizes. The patterns that scale are different from what tutorials suggest.

The Components

Collection: Agent on each host/container that ships logs.

  • Fluent Bit (lightweight, edge)
  • Vector (modern, performant)
  • Fluentd (mature, plugin ecosystem)

Transport/Buffer: Decouples log source from destination.

  • Kafka (mature, scalable)
  • Redpanda (Kafka-compatible, faster)
  • Kinesis (AWS-native)

Processing: Parse, enrich, route logs.

  • Vector (single tool for collection + processing)
  • Logstash (Java-heavy)
  • Custom (Go/Python services)

Storage/Query:

  • Elasticsearch (mature, expensive)
  • Loki (Prometheus model for logs, cheap)
  • OpenSearch (Elasticsearch fork)
  • Commercial (Datadog, Splunk, New Relic)

Visualization:

  • Kibana (with Elasticsearch)
  • Grafana (with Loki, multiple sources)
  • Commercial dashboards

Architecture Patterns

Direct ingestion (simple)

Apps โ†’ Log Agent โ†’ Storage โ†’ Query

Use case: small startups, low volume. Single point of failure.

Buffered ingestion

Apps โ†’ Agent โ†’ Kafka โ†’ Processor โ†’ Storage โ†’ Query

Use case: most production systems. Buffer absorbs spikes. Processor enriches/routes. Survives downstream outages.

Multi-destination

Apps โ†’ Agent โ†’ Kafka โ†’ [Hot Storage] + [Cold Archive] + [SIEM]

Use case: compliance + operational + security needs in parallel. Same log goes to multiple destinations with different retention.

Tool Recommendations by Scale

Solo developer / Side project: Loki + Grafana self-hosted. Free. Fits on small server.

Early startup (5-20 engineers): Loki + Grafana managed or self-hosted. at small scale.

Mid-size SaaS (50-200 engineers):

  • Option A: Self-hosted ELK or Loki. Ops investment 0.5-1 engineer.
  • Option B: Commercial (Datadog logs)..

Enterprise (500+ engineers):

  • Multi-tier storage (hot + warm + cold)
  • Custom-built or Splunk Enterprise
  • Dedicated logging team

Cost Optimization

Sampling: Drop low-value logs at the source. Common rules:

  • Drop DEBUG/TRACE in production
  • Sample successful requests 10% (keep all errors)
  • Drop health check logs entirely
  • Drop synthetic test traffic

Structured logging: Faster to query than unstructured text. Index only the fields you query.

Field reduction: Donโ€™t log full request/response bodies. Log structured event with key fields.

Retention tiering: Hot 7 days, warm 30 days, archive 1 year. Most queries donโ€™t need long retention.

Pre-aggregation: For metric-derived-from-logs use cases, pre-aggregate before storage. Reduces query cost dramatically.

Common Mistakes

Logging everything: Most teams log 2-5x more than they ever query. Reduce volume before scaling infrastructure.

Unstructured logs: Slow to query, hard to parse. Always log structured (JSON) in modern systems.

No retention policy: Logs accumulate forever. Disk costs balloon. Set retention day 1.

Manual log parsing: Building custom parsers for every format becomes maintenance burden. Standardize log format across services.

Single point of failure: Direct app-to-storage without buffer. Downstream outage causes log loss or app blocking.

Over-investing in infrastructure:current pricingK/month logging system forcurrent pricingK/month observability value. Match investment to actual queries.

Sample Implementation (Go service)

import "github.com/rs/zerolog"

logger:= zerolog.New(os.Stdout).
 With().
 Timestamp().
 Str("service", "api").
 Str("version", buildVersion).
 Logger()

// Structured log
logger.Info().
 Str("user_id", userID).
 Str("action", "checkout").
 Int("items", len(cart.Items)).
 Float64("total", cart.Total).
 Dur("duration", time.Since(start)).
 Msg("checkout completed")

Always: timestamp, service identifier, structured fields, no PII in fields.

Never: log full credit cards, passwords, session tokens, user PII without explicit policy.

Logging Patterns

Request lifecycle logging:

{event: "request.start", request_id: "abc", method: "POST", path: "/api/orders"}
{event: "request.end", request_id: "abc", status: 200, duration_ms: 145}

Use request_id to trace all logs from same request.

Error logging:

{level: "error", request_id: "abc", error: "payment.failed", error_code: "card_declined", retryable: true}

Structured errors enable alerting and trending.

Business event logging:

{event: "user.signed_up", user_id: "u123", source: "google_oauth", plan: "free"}

Business events for product analytics, not just engineering observability.

Compliance Considerations

GDPR/CCPA: Donโ€™t log PII (email, name, address) without explicit policy. Use hashed identifiers. Implement deletion (right to be forgotten).

HIPAA: Healthcare data requires encrypted storage, access controls, audit trails on log access itself.

PCI-DSS: Card data never logged. Tokenized references only.

SOC 2 / ISO 27001: Logs themselves are auditable. Maintain integrity (no modification), access controls, retention proof.

Cost-Benefit Reality

Logging cost: $X Query value: 5-10x logging cost (incidents resolved, bugs prevented)

If your query value isnโ€™t 5x cost, reduce volume or switch tools. Logs are infrastructure, not data product. Optimize for incident response time, not exhaustive history.

Startups + early growth:

  • Vector or Fluent Bit for collection
  • Kafka or Redpanda for buffer (skip for very small scale)
  • Loki for storage
  • Grafana for query/dashboards
  • Self-hosted or Grafana Cloud

Mid-size SaaS with budget:

  • Same collection but commercial backend (Datadog, New Relic Logs)
  • Trade engineering time for managed service

Enterprise:

  • Splunk or self-built on Elasticsearch
  • Multi-tier storage
  • Dedicated platform team

The right answer depends on engineering bandwidth vs cost tolerance. Thereโ€™s no universal correct choice.

Frequently asked questions

ELK Stack vs Loki vs commercial?+

ELK (Elasticsearch+Logstash+Kibana): mature, complex, expensive to run at scale. Loki: cheaper alternative, simpler queries. Commercial (Datadog, Splunk): expensive but turnkey. Match tool to team size and budget.

What's a reasonable log volume?+

Small startup: 10-100 GB/day. Mid-size SaaS: 500 GB-5 TB/day. Enterprise: 10-100+ TB/day. Cost scales linearly with volume. Most teams overspend on logs by not sampling adequately.

Should I log everything?+

No. Log structured events. Avoid logging full request/response bodies (privacy, cost, noise). Use trace IDs to correlate. Most teams log 2-5x more than they ever query.

Retention policy?+

Hot tier (queryable): 7-30 days. Warm tier (slower queries): 30-90 days. Cold tier (archived): 6 months-7 years for compliance. Most queries hit last 7 days of hot tier.

Cost per GB?+

Self-hosted: stored. Loki/Mimir:. Datadog/Splunk:. Costs add up -current pricingK/year on logs at mid-size SaaS is common.

Independent video for additional perspective on Log Aggregation Pipelines.

Third-party YouTube content. Watch on YouTube.
AP
Author

Alex Patel

Fitness, Sports & Outdoors Editor

Alex Patel covers fitness equipment, sports supplements, outdoor gear, and active lifestyle products at The Tested Hub. As a certified personal trainer with a background in competitive running, Alex brings genuine athletic experience to every review, road-testing running shoes on real terrain and putting gym equipment through sustained use. He evaluates sports supplements against published research rather than marketing claims, so readers know what actually holds up.