收到反馈,当日志量每秒钟很大的时候:系统的qps为1.2w/s,产生的日志量约为8-10w/s,查看Loki的时候会看到Loki写入的日志时间和服务真实输出的日志时间差距很大,有最多40-60分钟的差距
排查
排查loki-gateway入口
排查loki-gateway服务入口,这个是Loki的simple mode的流量入口,其实就是一个nginx,看到如下的日志:
10.192.0.7 - - [09/Jun/2025:10:06:38 +0000] 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.0.2.129 - - [09/Jun/2025:10:06:38 +0000] 429 "POST /loki/api/v1/push HTTP/1.1" 227 "-" "promtail/2.5.0" "-"
10.192.0.3 - - [09/Jun/2025:10:06:38 +0000] 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.0.2.129 - - [09/Jun/2025:10:06:38 +0000] 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.3 - - [09/Jun/2025:10:06:38 +0000] 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.7 - - [09/Jun/2025:10:06:39 +0000] 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.7 - - [09/Jun/2025:10:06:39 +0000] 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.8 - - [09/Jun/2025:10:06:39 +0000] 429 "POST /loki/api/v1/push HTTP/1.1" 227 "-" "promtail/2.5.0" "-"
10.192.0.3 - - [09/Jun/2025:10:06:39 +0000] 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.0.2.129 - - [09/Jun/2025:10:06:39 +0000] 429 "POST /loki/api/v1/push HTTP/1.1" 227 "-" "promtail/2.5.0" "-"
10.192.0.3 - - [09/Jun/2025:10:06:39 +0000] 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.7 - - [09/Jun/2025:10:06:39 +0000] 429 "POST /loki/api/v1/push HTTP/1.1" 227 "-" "promtail/2.5.0" "-"
10.192.0.4 - - [09/Jun/2025:10:06:39 +0000] 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.7 - - [09/Jun/2025:10:06:39 +0000] 429 "POST /loki/api/v1/push HTTP/1.1" 227 "-" "promtail/2.5.0" "-"
10.192.0.3 - - [09/Jun/2025:10:06:40 +0000] 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.0.2.129 - - [09/Jun/2025:10:06:40 +0000] 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.7 - - [09/Jun/2025:10:06:40 +0000] 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.8 - - [09/Jun/2025:10:06:40 +0000] 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.3 - - [09/Jun/2025:10:06:40 +0000] 429 "POST /loki/api/v1/push HTTP/1.1" 227 "-" "promtail/2.5.0" "-"
10.192.0.7 - - [09/Jun/2025:10:06:40 +0000] 429 "POST /loki/api/v1/push HTTP/1.1" 227 "-" "promtail/2.5.0" "-"
10.192.0.7 - - [09/Jun/2025:10:06:40 +0000] 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
10.192.0.7 - - [09/Jun/2025:10:06:40 +0000] 204 "POST /loki/api/v1/push HTTP/1.1" 0 "-" "promtail/2.5.0" "-"
可以看到,429出现了7次,总共是20次请求!
注:http协议的429编码报告的错误是:请求体太大!
排查promtail日志
由于我们使用的是sidecar的promtail搜集服务日志,所以使用功能如下命令查看:
kubectl logs pod名字 -c promtail-log-sidecar -n 业务名称
#得到如下结果:
level=warn ts=2025-06-09T10:09:24.993836565Z caller=client.go:349 component=client host=loki-gateway.logs:80 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): Ingestion rate limit exceeded for user fake (limit: 1398101 bytes/sec) while attempting to ingest '4568' lines totaling '1048551' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased"
level=warn ts=2025-06-09T10:09:25.652053072Z caller=client.go:349 component=client host=loki-gateway.logs:80 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): Ingestion rate limit exceeded for user fake (limit: 1398101 bytes/sec) while attempting to ingest '4568' lines totaling '1048551' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased"
锁定问题
看到错误日志的意思是:发送的日志的消息体过大,默认的是1.33MB/s 但是发送的日志消息体是:1048574 bytes (计算了下是约等于1MB),已经是低于默认速度了。
解决方案
auth_enabled: false
common:
path_prefix: /var/loki
replication_factor: 3
storage:
s3:
access_key_id: woo-minio
bucketnames: loki-chunks
endpoint: minio-release.storage.svc.cluster.local:9000
insecure: true
s3: null
s3forcepathstyle: true
secret_access_key: 密钥
limits_config:
enforce_metric_name: false
max_cache_freshness_per_query: 10m
reject_old_samples: true
reject_old_samples_max_age: 168h
split_queries_by_interval: 15m
#配置如下的信息,意思是给每个tentant组合100MB的速递,之前是4MB,然后三个loki-writer
#每个实例只会分到1.3MB/s
ingestion_rate_mb: 100
ingestion_burst_size_mb: 150
per_stream_rate_limit: "100MB"
per_stream_rate_limit_burst: "300MB"
memberlist:
join_members:
- loki-memberlist
ruler:
storage:
s3:
bucketnames: loki-ruler
schema_config:
configs:
- from: "2025-01-11"
index:
period: 24h
prefix: loki_index_
object_store: s3
schema: v12
store: boltdb-shipper
server:
grpc_listen_port: 9095
http_listen_port: 3100
promtail也需要配套的修改资源的配置:
limits: cpu: 1000m #CPU要给一个核 mem: 150Mi #内存还好点,不需要多么的大 |