Skywalking告警相关示例
简介
说明
本文介绍SkyWalking的告警功能的用法。
SkyWalking支持WebHook、gRPC、微信、钉钉、飞书等通知方式。
官网
alarm:https://github.com/apache/skywalking/blob/master/docs/en/setup/backend/backend-alarm.md
oal规则语法:https://github.com/apache/skywalking/blob/master/docs/en/concepts-and-designs/oal.md
范围和字段:https://github.com/apache/skywalking/blob/master/docs/en/concepts-and-designs/scope-definitions.md
事件:https://github.com/apache/skywalking/blob/master/docs/en/concepts-and-designs/event.md
配置示例
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Sample alarm rules.
rules:
# Rule unique name, must be ended with `_rule`.
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 5
message: 服务:{name}\n 指标:响应时间\n 详情:至少3次超过1000毫秒(最近10分钟内)
service_sla_rule:
# Metrics value need to be long, double or int
metrics-name: service_sla
op: "<"
threshold: 8000
# The length of time to evaluate the metrics
period: 10
# How many times after the metrics match the condition, will trigger alarm
count: 2
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 3
message: 服务:{name}\n 指标:成功率\n 详情:至少2次低于80%(最近10分钟内)
service_resp_time_percentile_rule:
# Metrics value need to be long, double or int
metrics-name: service_percentile
op: ">"
threshold: 1000,1000,1000,1000,1000
period: 10
count: 3
silence-period: 5
# 至少有一个条件达到:p50>1000、p75>1000、p90>1000、p95>1000、p99>1000
message: 服务:{name}\n 指标:响应时间\n 详情:至少3次百分位超过1000ms(最近10分钟内)
service_instance_resp_time_rule:
metrics-name: service_instance_resp_time
op: ">"
threshold: 1000
period: 10
count: 2
silence-period: 5
message: 实例:{name}\n 指标:响应时间\n 详情:至少2次超过1000毫秒(最近10分钟内)
database_access_resp_time_rule:
metrics-name: database_access_resp_time
threshold: 1000
op: ">"
period: 10
count: 2
message: 数据库访问:{name}\n 指标:响应时间\n 详情:至少2次超过1000毫秒(最近10分钟内)
endpoint_relation_resp_time_rule:
metrics-name: endpoint_relation_resp_time
threshold: 1000
op: ">"
period: 10
count: 2
message: 端点关系:{name}\n 指标:响应时间\n 详情:至少2次超过1000毫秒(最近10分钟内)
instance_jvm_old_gc_count_rule:
metrics-name: instance_jvm_old_gc_count
threshold: 1
op: ">"
period: 1440
count: 1
message: 实例:{name}\n 指标:OldGC次数\n 详情:最近1天内大于1次
instance_jvm_young_gc_count_rule:
metrics-name: instance_jvm_young_gc_count
threshold: 1
op: ">"
period: 5
count: 100
message: 实例:{name}\n 指标:YoungGC次数\n 详情:最近5分钟内大于100次
# 需要在config/oal/core.oal添加一行:endpoint_abnormal = from(Endpoint.*).filter(responseCode in [404, 500, 503]).count();
endpoint_abnormal_rule:
metrics-name: endpoint_abnormal
threshold: 1
op: ">="
period: 2
count: 1
message: 接口:{name}\n 指标:接口异常\n 详情:最近2分钟内至少1次\n
# Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
# Because the number of endpoint is much more than service and instance.
#
# endpoint_avg_rule:
# metrics-name: endpoint_avg
# op: ">"
# threshold: 1000
# period: 10
# count: 2
# silence-period: 5
# message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
webhooks:
# - http://127.0.0.1/notify/
# - http://127.0.0.1/go-wechat/
dingtalkHooks:
textTemplate: |-
{
"msgtype": "text",
"text": {
"content": "Apache SkyWalking 告警: \n %s"
}
}
webhooks:
- url: https://oapi.dingtalk.com/robot/send?access_token=<钉钉机器人的access_token>
secret: <钉钉机器人的secret>
告警简介
单独规则
Apache SkyWalking告警是由一组规则驱动。
告警规则的配置文件:SkyWalking服务端安装路径/config/alarm-settings.yml。
alarm-settings.yml中的rules.xxx_rule.metrics-name对应的是config/oal路径下的配置文件中的详细规则:core.oal、event.oal,java-agent.oal, browser.oal。
告警规则的组成部分
告警规则的定义分为三部分:
单独规则
单独规则主要有以下几点:
Rule Name 规则名称,在告警信息中显示的唯一名称,必须以_rule结尾。 metrics-name 度量名称。对应的规则在:config/oal/core.oal。 默认配置中可以用于告警的度量有:服务,实例,端点,服务关系,实例关系,端点关系。 只支持long,double和int类型。 Include-names 包含在此规则之内的实体名称列表。 Exclude-names 排除在此规则以外的实体名称列表。 Include-names-regex 提供一个正则表达式来包含实体名称。如果同时设置包含名称列表和包含名称的正则表达式,则两个规则都将生效。 Exclude-names-regex 提供一个正则表达式来排除实体名称。如果同时设置排除名称列表和排除名称的正则表达式,则两个规则都将生效。 Include-labels 包含在此规则之内的标签。 Exclude-labels 排除在此规则以外的标签。 Include-labels-regex 提供一个正则表达式来包含标签。如果同时设置包含标签列表和包含标签的正则表达式,则两个规则都将生效。 Exclude-labels-regex 提供一个正则表达式来排除标签。如果同时设置排除标签列表和排除标签的正则表达式,则两个规则都将生效。 标签的设置必须把数据存储在meter-system中,例如:Prometheus, Micrometer。以上四个标签设置必须实现LabeledValueHolder接口。 Threshold 阈值。 对于多个值指标,例如percentile,阈值是一个数组。像value1 value2 value3 value4 value5这样描述。 每个值可以作为度量中每个值的阈值。如果不想通过此值或某些值触发警报,则将值设置为 -。例如在percentile中,value1是P50的阈值,value2是P75的阈值,那么-,-,value3, value4, value5的意思是,没有阈值的P50和P75的percentile告警规则。 Op 操作符,支持>, >=, <, <=, =。 Period 多久告警规则需要被检查一下。这是一个时间窗口,与后端部署环境时间相匹配。 Count 在一个周期窗口中,如果按Op计算超过阈值的次数达到Count,则发送告警。 Only-as-condition true或者false,指定规则是否可以发送告警,或者仅作为复合规则的条件。 Silence-period 在时间N中触发报警后,在N -> N + silence-period这段时间内不告警。 默认情况下,它和period一样,这意味着相同的告警(同一个度量名称拥有相同的Id)在同一个周期内只会触发一次。 Message 该规则触发时,发送的通知消息。 里边可以使用变量。{name} 会解析成规则名称。 示例:
rules:
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 1000
period: 10
count: 2
silence-period: 10
message: 服务【{name}】的平均响应时间在最近10分钟内有2分钟超过1秒
service_instance_resp_time_rule:
metrics-name: service_instance_resp_time
op: ">"
threshold: 1000
period: 5
count: 2
silence-period: 10
message: 实例【{name}】的平均响应时间在最近5分钟内有2分钟超过1秒
endpoint_resp_time_rule:
metrics-name: endpoint_avg
threshold: 1000
op: ">"
period: 10
count: 2
message: 端点【{name}】的平均响应时间在最近10分钟内有2分钟超过1秒
复合规则
复合规则仅适用于针对相同实体级别的告警规则,例如都是服务级别的告警规则:service_percent_rule && service_resp_time_percentile_rule。
不可以编写不同实体级别的告警规则,例如服务级别的一个告警规则和端点级别的一个规则:service_percent_rule && endpoint_percent_rule。
复合规则主要有以下几点:
Rule name 规则名称,在告警信息中显示的唯一名称,必须以_rule结尾。 Expression 指定如何组成规则,支持&&, ||, ()操作符。 Message 该规则触发时,发送的通知消息。 里边可以使用变量。{name} 会解析成规则名称。 Tags Tags是与报警关联的键值对,被用来指定对用户来说很重要的属性。 示例:
rules:
# Rule unique name, must be ended with `_rule`.
endpoint_percent_rule:
# Metrics value need to be long, double or int
metrics-name: endpoint_percent
threshold: 75
op: <
# The length of time to evaluate the metrics
period: 10
# How many times after the metrics match the condition, will trigger alarm
count: 3
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 10
# Specify if the rule can send notification or just as an condition of composite rule
only-as-condition: false
tags:
level: WARNING
service_percent_rule:
metrics-name: service_percent
# [Optional] Default, match all services in this metrics
include-names:
- service_a
- service_b
exclude-names:
- service_c
# Single value metrics threshold.
threshold: 85
op: <
period: 10
count: 4
only-as-condition: false
service_resp_time_percentile_rule:
# Metrics value need to be long, double or int
metrics-name: service_percentile
op: ">"
# Multiple value metrics threshold. Thresholds for P50, P75, P90, P95, P99.
threshold: 1000,1000,1000,1000,1000
period: 10
count: 3
silence-period: 5
message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
only-as-condition: false
meter_service_status_code_rule:
metrics-name: meter_status_code
exclude-labels:
- "200"
op: ">"
threshold: 10
period: 10
count: 3
silence-period: 5
message: The request number of entity {name} non-200 status is more than expected.
only-as-condition: false
composite-rules:
comp_rule:
# Must satisfied percent rule and resp time rule
expression: service_percent_rule && service_resp_time_percentile_rule
message: Service {name} successful rate is less than 80% and P50 of response time is over 1000ms
tags:
level: CRITICAL
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!