本文最后更新于 779 天前,其中的信息可能已经有所发展或是发生改变
介绍
- 黑盒探测:主要关注的现象,一般都是正在发生的东西,例如出现一个告警,业务接口不正常,那么这种监控就是站在用户的角度能看到的监控,重点在于能对正在发生的故障进行告警
- 白盒探测:主要关注的是原因,也就是系统内部暴露的一些指标,例如 redis 的 info 中显示 redis slave down,这个就是 redis info 显示的一个内部的指标,重点在于原因,可能是在黑盒监控中看到 redis down,而查看内部信息的时候,显示 redis port is refused connection
Blackbox Exporter
- Blackbox Exporter 是 Prometheus 社区提供的官方黑盒监控解决方案,其允许用户通过:HTTP、HTTPS、DNS、TCP 以及 ICMP 的方式对网络进行探测
1、HTTP 测试
- 定义 Request Header 信息
- 判断 Http status / Http Respones Header / Http Body 内容
2、TCP 测试
- 业务组件端口状态监听
- 应用层协议定义与监听
3、ICMP 测试
- 主机探活机制
4、POST 测试
- 接口联通性
5、SSL 证书过期时间
安装
1、部署yaml
version: '3'
services:
blackbox:
image: prom/blackbox-exporter:latest
container_name: blackbox
restart: always
user: root
ports:
- 9115:9115
volumes:
- /mydata/blackbox/conf:/config
- /etc/localtime:/etc/localtime
environment:
- 'TZ="Asia/Shanghai"'
command:
- '--config.file=/config/blackbox.yml'
- '--log.level=debug' #开启debug日志调试
networks:
default:
external:
name: prom_net
2、配置文件
modules:
http_2xx: #这个名字是随便写的,但是需要在 prometheus.yml 配置文件中对应起来
prober: http #进行探测的协议,可以是 http、tcp、dns、icmp,所有的探针均是以 Module 的信息进行配置
timeout: 10s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"] #此探针接受的 HTTP 版本
valid_status_codes: [200,302] # 这里最好作一个返回状态码,在grafana作图时,有明示
method: GET #探针将使用的 HTTP 方法
preferred_ip_protocol: "ip4" #TTP 探针的 IP 协议(ip4,ip6)
ip_protocol_fallback: false #default = true
no_follow_redirects: true #对于 http 服务是否启用 ssl 有强制的标准,可以使用 fail_if_ssl 和 fail_if_not_ssl 进行配置。fail_if_ssl 为 true 时,表示如果站点启>用了 ssl 则探针失败,反之成功。fail_if_not_ssl 刚好相反
http_post_2xx: # http post 监测模块
prober: http
timeout: 10s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200]
method: POST
preferred_ip_protocol: "ip4"
tcp_connect: # tcp 监测模块
prober: tcp
timeout: 10s
ping: # icmp 检测模块
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
3、HTTP就是通过GET或者POST的方式来检测应用是否正常
- job_name: "prd站点信息"
metrics_path: /probe
scrape_interval: 30s
params:
module: [http_2xx]
static_configs:
- targets:
- www.***.com
labels:
group: demo1
- targets:
- www.***.com
labels:
group: demo2
- targets:
- www.***.com
labels:
group: demo3
- targets:
- www.***.com
labels:
group: demo4
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 10.10.10.179:9115
4、重载配置后可以看到监控如下:
5、接口数据
6、告警规则
groups:
- name: blackbox-web端点
rules:
- alert: BlackboxProbeFailed
expr: probe_success == 0
for: 0m
labels:
severity: critical
annotations:
summary: Blackbox probe failed (instance {{ $labels.instance }})
description: "Probe failed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: BlackboxConfigurationReloadFailure
expr: blackbox_exporter_config_last_reload_successful != 1
for: 0m
labels:
severity: warning
annotations:
summary: Blackbox configuration reload failure (instance {{ $labels.instance }})
description: "Blackbox configuration reload failure\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: BlackboxSlowProbe
expr: avg_over_time(probe_duration_seconds[1m]) > 1
for: 1m
labels:
severity: warning
annotations:
summary: Blackbox slow probe (instance {{ $labels.instance }})
description: "Blackbox probe took more than 1s to complete\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: BlackboxProbeHttpFailure
expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
for: 0m
labels:
severity: critical
annotations:
summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
description: "HTTP status code is not 200-399\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: BlackboxSslCertificateWillExpireSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 0m
labels:
severity: warning
annotations:
summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
description: "SSL certificate expires in 30 days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: BlackboxSslCertificateWillExpireSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
for: 0m
labels:
severity: critical
annotations:
summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
description: "SSL certificate expires in 3 days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: BlackboxSslCertificateExpired
expr: probe_ssl_earliest_cert_expiry - time() <= 0
for: 0m
labels:
severity: critical
annotations:
summary: Blackbox SSL certificate expired (instance {{ $labels.instance }})
description: "SSL certificate has expired already\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: BlackboxProbeSlowHttp
expr: avg_over_time(probe_http_duration_seconds[1m]) > 1
for: 1m
labels:
severity: warning
annotations:
summary: Blackbox probe slow HTTP (instance {{ $labels.instance }})
description: "HTTP request took more than 1s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: BlackboxProbeSlowPing
expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 1
for: 1m
labels:
severity: warning
annotations:
summary: Blackbox probe slow ping (instance {{ $labels.instance }})
description: "Blackbox ping took more than 1s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
Grafana面板
1、使用9965模版倒入即可