解决 Loki搜集日志慢的问题

收到反馈，当日志量每秒钟很大的时候：系统的qps为1.2w/s,产生的日志量约为8-10w/s，查看Loki的时候会看到Loki写入的日志时间和服务真实输出的日志时间差距很大，有最多40-60分钟的差距

排查

排查loki-gateway入口

排查loki-gateway服务入口，这个是Loki的simple mode的流量入口，其实就是一个nginx，看到如下的日志：

10.192.0.7 - - [09/Jun/2025:10:06:38 +0000]  204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.0.2.129 - - [09/Jun/2025:10:06:38 +0000]  429 "POST /loki/api/v1/push HTTP/1.1" 227 "-" "promtail/2.5.0" "-"
10.192.0.3 - - [09/Jun/2025:10:06:38 +0000]  204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.0.2.129 - - [09/Jun/2025:10:06:38 +0000]  204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.3 - - [09/Jun/2025:10:06:38 +0000]  204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.7 - - [09/Jun/2025:10:06:39 +0000]  204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.7 - - [09/Jun/2025:10:06:39 +0000]  204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.8 - - [09/Jun/2025:10:06:39 +0000]  429 "POST /loki/api/v1/push HTTP/1.1" 227 "-" "promtail/2.5.0" "-"
10.192.0.3 - - [09/Jun/2025:10:06:39 +0000]  204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.0.2.129 - - [09/Jun/2025:10:06:39 +0000]  429 "POST /loki/api/v1/push HTTP/1.1" 227 "-" "promtail/2.5.0" "-"
10.192.0.3 - - [09/Jun/2025:10:06:39 +0000]  204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.7 - - [09/Jun/2025:10:06:39 +0000]  429 "POST /loki/api/v1/push HTTP/1.1" 227 "-" "promtail/2.5.0" "-"
10.192.0.4 - - [09/Jun/2025:10:06:39 +0000]  204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.7 - - [09/Jun/2025:10:06:39 +0000]  429 "POST /loki/api/v1/push HTTP/1.1" 227 "-" "promtail/2.5.0" "-"
10.192.0.3 - - [09/Jun/2025:10:06:40 +0000]  204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.0.2.129 - - [09/Jun/2025:10:06:40 +0000]  204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.7 - - [09/Jun/2025:10:06:40 +0000]  204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.8 - - [09/Jun/2025:10:06:40 +0000]  204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.3 - - [09/Jun/2025:10:06:40 +0000]  429 "POST /loki/api/v1/push HTTP/1.1" 227 "-" "promtail/2.5.0" "-"
10.192.0.7 - - [09/Jun/2025:10:06:40 +0000]  429 "POST /loki/api/v1/push HTTP/1.1" 227 "-" "promtail/2.5.0" "-"
10.192.0.7 - - [09/Jun/2025:10:06:40 +0000]  204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.7 - - [09/Jun/2025:10:06:40 +0000]  204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"

可以看到，429出现了7次，总共是20次请求！
注：http协议的429编码报告的错误是：请求体太大！

排查promtail日志

由于我们使用的是sidecar的promtail搜集服务日志，所以使用功能如下命令查看:

kubectl logs pod名字 -c promtail-log-sidecar -n 业务名称
#得到如下结果：
level=warn ts=2025-06-09T10:09:24.993836565Z caller=client.go:349 component=client host=loki-gateway.logs:80 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): Ingestion rate limit exceeded for user fake (limit: 1398101 bytes/sec) while attempting to ingest '4568' lines totaling '1048551' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased"

level=warn ts=2025-06-09T10:09:25.652053072Z caller=client.go:349 component=client host=loki-gateway.logs:80 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): Ingestion rate limit exceeded for user fake (limit: 1398101 bytes/sec) while attempting to ingest '4568' lines totaling '1048551' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased"

锁定问题
看到错误日志的意思是：发送的日志的消息体过大，默认的是1.33MB/s 但是发送的日志消息体是：1048574 bytes (计算了下是约等于1MB)，已经是低于默认速度了。

解决方案

auth_enabled: false
 common:
   path_prefix: /var/loki
   replication_factor: 3
   storage:
     s3:
       access_key_id: woo-minio
       bucketnames: loki-chunks
       endpoint: minio-release.storage.svc.cluster.local:9000
       insecure: true
       s3: null
       s3forcepathstyle: true
       secret_access_key: 密钥
 limits_config:
   enforce_metric_name: false
   max_cache_freshness_per_query: 10m
   reject_old_samples: true
   reject_old_samples_max_age: 168h
   split_queries_by_interval: 15m
    
   #配置如下的信息，意思是给每个tentant组合100MB的速递，之前是4MB，然后三个loki-writer
   #每个实例只会分到1.3MB/s
   ingestion_rate_mb: 100
   ingestion_burst_size_mb: 150
   per_stream_rate_limit: "100MB"
   per_stream_rate_limit_burst: "300MB"
 memberlist:
   join_members:
   - loki-memberlist
 ruler:
   storage:
     s3:
       bucketnames: loki-ruler
 schema_config:
   configs:
   - from: "2025-01-11"
     index:
       period: 24h
       prefix: loki_index_
     object_store: s3
     schema: v12
     store: boltdb-shipper
 server:
   grpc_listen_port: 9095
   http_listen_port: 3100

promtail也需要配套的修改资源的配置：

limits: cpu: 1000m #CPU要给一个核
mem: 150Mi #内存还好点，不需要多么的大

Our Visitor

排查

排查loki-gateway入口

排查promtail日志

发送评论编辑评论

Our Visitor

排查

排查loki-gateway入口

排查promtail日志

发送评论 编辑评论

发送评论编辑评论