1. 概述
本文档描述了Loki组件体系当中的Promtail的使用和简单介绍
官方配置告警的文档: https://grafana.com/docs/loki/latest/send-data/promtail/stages/metrics/
2. 原理介绍
主要介绍Promtail使用到的一些工具的原理,以及其内身自己的架构和其他的用到的细节
2.1 __path__匹配的原理
在我们进行编写Promtail搜集日志的配置文件的时候,有一个)__path__的属性配置,这个配置决定了promtail需要采集的日志文件路径,这个__path__解析的原理是使用了:https://github.com/bmatcuk/doublestar 仓库
3. 实践
3.1 Promtail采用DaemonSet的部署搜集K8S集群上Node的日志信息
--- # Daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: promtail-daemonset
namespace: logs
spec:
selector:
matchLabels:
name: promtail
template:
metadata:
labels:
name: promtail
spec:
serviceAccount: promtail-serviceaccount
containers:
- name: promtail-container
image: wooring.cn/cpaas-pub/grafana/promtail:2.6.1 #实际业务镜像拉取地址
args:
- -config.file=/etc/promtail/promtail.yaml
- -config.expand-env
env:
- name: 'HOSTNAME' # needed when using kubernetes_sd_configs
valueFrom:
fieldRef:
fieldPath: 'spec.nodeName'
volumeMounts:
- name: logs
mountPath: /var/log
- name: promtail-config
mountPath: /etc/promtail
- mountPath: /var/lib/docker/containers
name: varlibdockercontainers
readOnly: true
volumes:
- name: logs
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: promtail-config
configMap:
name: system-promtail-config
--- # configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: system-promtail-config
namespace: logs
data:
promtail.yaml: |
server:
http_listen_port: 9080
grpc_listen_port: 0
clients:
- url: http://loki-gateway.logs:80/loki/api/v1/push
positions:
filename: /tmp/positions.yaml
target_config:
sync_period: 10s
scrape_configs:
- job_name: pod-logs
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- docker: {}
relabel_configs:
- source_labels:
- __meta_kubernetes_pod_node_name
target_label: __host__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
replacement: $1
separator: /
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_pod_name
target_label: job
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: pod
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
target_label: container
- replacement: /var/log/pods/*$1/*.log
separator: /
source_labels:
- __meta_kubernetes_pod_uid
- __meta_kubernetes_pod_container_name
target_label: __path__
- job_name: node-system-log
static_configs:
- targets:
- localhost
labels:
node: ${HOSTNAME}
job: node-system-log
type: system
__path__: /var/log/messages
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: promtail-clusterrole
namespace: logs
rules:
- apiGroups: [""]
resources:
- nodes
- services
- pods
verbs:
- get
- watch
- list
--- # ServiceAccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: promtail-serviceaccount
namespace: logs
--- # Rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: promtail-clusterrolebinding
namespace: logs
subjects:
- kind: ServiceAccount
name: promtail-serviceaccount
namespace: logs
roleRef:
kind: ClusterRole
name: promtail-clusterrole
apiGroup: rbac.authorization.k8s.io
4. Promtail配置文件详细解释
#promtail当作一个服务端
server:
http_listen_port: 9080
grpc_listen_port: 0
#2.7.2版本增加的从新加载配置文件的配置,可以访问:http://ip:9080/reload来从新加载配置文件
enable_runtime_reload: true
clients:
#需要推送给的Loki服务端的地址,可以配置多个,同时朝向多个Loki推送日志
- url: http://loki.wooring.cn:30669/loki/api/v1/push ##根据业务地址配置
positions:
#Promtail采集到日志的时候,会把已经发送给Loki成功的日志的字节偏移量记录到这个文件当中
filename: ./positions.yaml
target_config:
#这个是同步采集的target的file的监控的的时间,默认也是10s
sync_period: 10s
scrape_configs:
- job_name: node-system-log
static_configs:
- targets:
- localhost
labels:
node: liuxu-node
job: node-system-log
type: system
__path__: /var/log/syslog
- job_name: pod-logs #job的名字,可以在loki上面搜索条件:job="pod-logs"
kubernetes_sd_configs:
- role: pod
5. Promtail的监控
1.Promtail的监控端口为:
server: http_listen_port:
9080
2.监控的地址为:
http://promtail的IP:9080/metrics
6.Promtail的监控配置
1.构建非容器化的Promtail的Endpoint和Service对象
---
apiVersion: v1
kind: Endpoints
metadata:
name: promtail-vm
namespace: logs
subsets:
- addresses:
# 添加promtail进程所在虚拟机的ip地址,注意:这些端口需要一致,都是19230才行
- ip: 10.x.x.x
- ip: 10.x.x.2
ports:
- name: monitor
port: 19230
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
name: promtail-vm
namespace: logs
labels:
app: promtail
env: prod
spec:
ports:
- protocol: TCP
port: 19230
targetPort: 19230
name: monitor
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: promtail-prometheus
namespace: logs
spec:
endpoints:
- interval: 15s
path: /metrics
port: monitor
namespaceSelector:
matchNames:
- logs
selector:
matchLabels:
app: promtail
env: prod
7. pipeline states(流水线处理日志的配置)
7.1 Pipeline的解释
7.1.1 解释
Promtail的Pipeline其实就是用来对一行日志执行操作的流水线,由统称为stages(阶段)这个专业术语来组成的。主要分为四种类型的:stages
1.Parsing stages:解析当前的日志行,提取数据,提取的数据交给下一个stages处理
2.Transform stages:转换提取的数据,从一个组映射到另外一个组
3.Action stages:拿到提取的数据做处理,可以执行如下的动作:
3.1.对标签执行新增/修改/删除
3.2.修改日志行的timestamp
3.3.修改日志的内容
3.4.基于提取的数据创建一个指标(我们需要的)
4.Filtering stages:过滤日志,按照条件对日志执行丢弃动作
其实整个处理起来和Java的stream操作,或者是Rust的stream操作都是一个思想
7.1.2 stages类型介绍
7.1.2.1 regex
regex是一个解析阶段的类型:parsing stage.用来解析日志,是否符合正则表达式.需要按照这个规则: https://github.com/google/re2/wiki/CplusplusAPI#syntax
一个综合了regex和metrics的例子:
server:
http_listen_port: 9080
grpc_listen_port: 0
enable_runtime_reload: true
clients:
- url: http://loki.wooring.cn:30669/loki/api/v1/push #根据实际业务的接收地址配置
positions:
filename: ./positions.yaml
target_config:
sync_period: 10s
scrape_configs:
- job_name: alert-log
pipeline_stages:
- match:
selector: '{app="alert-service"}'
stages:
- regex:
# 过滤出来包含了: Data truncation字符串的行
expression: ".*(?P<data_truncation>Data truncation.*)"
- metrics:
# 建立一个指标,有一次就+1
data_truncation_total:
type: Counter
description: "Data truncation: Data too long for column"
source: data_truncation
config:
action: inc
static_configs:
- targets:
- localhost
labels:
node: liuxu-node
job: alert-log
app: alert-service
type: alert
__path__: /media/liuxu/data/component/promtail/logs/**/*.log
metrics
metrics阶段是一个action阶段,可以从提取的外部数据来更新指标数值。注意:产生的指标不会被推送给Loki,而是通过/metrics暴露出来,然后配置Prometheus来抓取这个指标.看下如何配置
metrics:
http_get_total:
type: Counter #Counter类型,参看Promtheus的四种指标类型
description: <string> #描述信息
prefix: <string> #指标的前缀,默认是:promtail_custom_,例如:promtail_custom_http_get_total
source: <string> #从提取数据宕核总获取到的Key作为统计的指标来源,如果不提供,就是指标名字
max_idle_duration: <string> #指标的存活时间,避免出现业务上的指标的积累的问题,默认是:5m
config:
match_all: <bool> #如果设置为true,那么所有的日志行都会被统计到
count_entry_bytes: <bool> #如果设置为true,那么所有的日志行的bytes都会被统计
value: <string> #只有符合value的才会被统计,和match_all还有count_entry_bytes有一定的配置冲突
action: <string> #inc 或者 add这俩递增的意思。
8. 生产真实日志告警配置
8.1 mtr模块配置告警”not bound carrier channel”
server:
http_listen_port: 19230
grpc_listen_port: 0
enable_runtime_reload: true
clients:
- url: http://10.x.x.x:32375/loki/api/v1/push
positions:
filename: ./positions.yaml
target_config:
sync_period: 10s
scrape_configs:
- job_name: 10.x.x.x-log
pipeline_stages:
- match:
selector: '{app="mtreceiver"}'
stages:
- regex:
expression: "^(?P<not_bound_carrier_channel>java.lang.RuntimeException: The biz type 0 of account .*)$"
- labels:
not_bound_carrier_channel:
- metrics:
not_bound_carrier_channel_total:
type: Counter
description: "not bound carrier channel"
prefix: log_mtr_
source: not_bound_carrier_channel
config:
action: inc
static_configs:
- targets:
- localhost
labels:
job: mtreceiver-log
env: prod
app: mtreceiver
host: 10.x.x.x
__path__: /home/mosyw/mos/A-mtreceiver/logs/**/{system,warn,error,discard,remoteRequest,performance,stc_service_filter,stc_service_time,specNumTrace,forwardCpaas,login,mt,abandon,rocketmq_client,messageTrace.in}.log
- job_name: 10.x.x.x-gc-log
static_configs:
- targets:
- localhost
labels:
job: mtreceiver-gc-log
env: prod
app: mtreceiver
host: 10.x.x.x
__path__: /home/mosyw/mos/A-mtreceiver/logs/gc.*
对应的指标为:
|
- 指标名称:
log_temp_total
这是一个自定义的计数指标,total
后缀表明这是一个累计计数器,用于统计某种事件的发生次数,这里可能与 “临时”(temp)相关的日志事件有关。 - 标签(Labels):
app="wootest"
:指标来源于名为 “wootest” 的应用filename
:记录了产生该指标的日志文件路径,这里是 2025 年 8 月 18 日的错误日志job="alert-log"
:属于 “alert-log” 这个监控任务node="woo-node"
:运行在名为 “woo-node” 的节点上not_temp_test
:具体的错误信息,显示为 Java 运行时异常:”账户
的业务类型 出现errer2025@test01
type="alert"
:表明这是一个告警类型的指标
- 指标值:
2
表示截至当前,已经记录到 2 次相同的该错误事件。
在实际业务中,根据业务所使用的业务或编程脚本反馈的错误类型或对应日志,采集相关指标,用于监控 特定应用中特定异常的发生频率,便于及时发现和处理问题。