groups: - name: runtimeErrorAlert rules: - alert: DUI正式环境K8S节点docker磁盘空间不足 expr: round(sum by(hostname,device,mountpoint)(node_disk_usage{job=~"base-exporter-d[123]-prod",hostname!~".*insight.*|.*bigdata.*|.*bdp.*",mountpoint=~"/var/lib/docker|/"})*100,0.01)>80.0 for: 3m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI正式环境K8S节点docker磁盘空间不足 - alert: 五菱私有云拨测异常 expr: speech_blackbox_testing{api=~"五菱私有云|五菱私有云cinfo",mode="availability"}!=0.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 五菱拨测异常 - alert: 大数据集群minio机器磁盘空间不足 expr: round(sum by(hostname,device,mountpoint)((node_disk_usage{hostname=~"insight-minio-.*"} and on(hostname,device) (node_disk_total>1099511627776)))*100,0.01)>95.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据集群minio机器磁盘空间不足 - alert: MEDINGDING通用产品级403 expr: delta(gateway_product{status="403",productId!="",productId!~"279599307|279595362|279598784|279599155|279613425|279600850|279598784",env="d1-prod"}[3m])>30.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: MEDINGDING通用产品级403 - alert: 一句话服务告警 expr: counter_sentence_requests{status!="0"}-counter_sentence_requests{status!="0"} offset 1m>5.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 告警 - alert: 域名解析失败 expr: sum by(dns,domain)(duimonitor_dns)!=0.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 思必驰常用域名解析失败 - alert: 404错误(全链路) expr: delta(gateway_product{matched_route_id="00000000000000015495",status="404",productId!="278578689"}[1m])>10.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 404错误(全链路) - alert: 正式长语音大于499 expr: sum(rate(gateway_fail{env="d1-prod",host="lasr.duiopen.com",proxy_upstream_name!="cloud_lasr-task-audio_9088",status=~"499|500|502|503|504",tag="apisix"}[40s]))>0.3 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 正式长语音大于499 - alert: pod异常(线上车载服务) expr: (sum by(pod)(kube_pod_status_ready{job="kube-state-metrics-d1-prod",namespace="cloud",pod=~"lyra-webhook-.*|lyra-webapi.*|lyra-octopus.*|lyra-xq-infrared.*|lyra-external-interface-service.*|softhardware-h5.*|dds-xiandou.*",condition="true"} and on (namespace,pod) kube_pod_status_phase{phase="Running"}))==0.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: pod异常(线上车载服务) - alert: 网络设备离线告警 expr: sum by(instance)(up{job="snmp-exporter"})!=1.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 网络设备离线告警 - alert: 外网访问天琴webhook超时告警 expr: sum by (route) (histogram_quantile(0.95,rate(apisix_http_latency_bucket{route="duisys>lyra-external-interface-service>prod",type="upstream"}[1m])))>2500.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 外网访问天琴webhook超时告警 - alert: 大数据redis cluster 读写异常 expr: sum by(hostname,cluster,mode)(monitor_redis_cluster)!=0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据redis cluster 读写异常 - alert: (DUI)阿里云Elasticsearch 节点内存使用率 expr: aliyun_acs_elasticsearch_NodeHeapMemoryUtilization{clusterId!="es-cn-4591czyqa00011ubo"}>70.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: (DUI)阿里云Elasticsearch 节点内存使用率 - alert: 阿里云ECS服务器7天后过期 expr: round((max by (instanceId, instanceName, submitter) (aispeech_aliyun_ecs_expired_timestamp) - time())/24/3600,0.1)<=7.4 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 阿里云ECS服务器7天后过期 - alert: doris nodes expr: max(doris_prod_node_info{type="be_node_num", state="alive"})<7.0 for: 60m labels: severity: critical annotations: summary: "{{ $value }}" description: doris nodes数量少于8 - alert: Pod异常(线上LASR服务) expr: (sum by (pod,k8scluster)(kube_pod_status_ready{k8scluster="d3-prod",namespace="cloud",pod=~"lasr-.*",condition="true"} and on (namespace,pod) kube_pod_status_phase{phase="Running"}))!=1.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: Pod异常(线上LASR服务) - alert: 录音文件转写audioReady4VadConsumer消息堆积 expr: avg(avg_over_time(rabbitmq_queue_messages_ready{queue=~"audioReady4VadConsumer"}[1m]))>50.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 录音文件转写audioReady4VadConsumer消息堆积 - alert: 阿里云-连云港DUI专线主线路流量过高 expr: round(sum by(vbrid,vbrname)(duimonitor_vbroutrate{vbrname="杭州-连云港-世纪互联-DUI主线路"})/1024/1024,0.01)>350.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 阿里云-连云港DUI专线主线路流量过高 - alert: 华东lasr-online用量 expr: delta(gateway_product{matched_route_id=~"00000000000000001500|00000000000000001524",status="403",productId!~"279593784|279600424|279595943|279601347|279603827|279604013|279596911|279597576|279606758|279597614|278587791|279604555|279608596|279608520|279599307|279596358|279608468|279599235|279602043|278590408|279598784",productId!~""}[1m])>1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 华东lasr-online用量 - alert: 系统用户修改详情 expr: sum by (k8scluster,hostip,hostname,username,action)(node_systemuser_status)==1.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 系统用户修改详情(系统用户增加删除,修改密码) - alert: apisix-proxy端口syn-send状态拥堵 expr: ((sum by(hostname)(node_tcp_synsent{hostname!~".*beta.*"}) and on(hostname) node_k8s_service{podname=~"apisix-proxy-[0-9a-z]+"}) +on(hostname) group_left(podname) node_k8s_service{podname=~"apisix-[0-9a-z]+-[0-9a-z]+"})>20.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: apisix-proxy端口syn-send状态拥堵 - alert: d3-011310告警(线上识别服务) expr: sum by(mode,status,describe,detail,cluster)(duimonitor_service_exception{service="ASR",env="prod",status="011310"})>50.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 011310(单条请求时间过长)阈值50,每分钟 - alert: 转写-报警 expr: runtimeError{env=~"prod",serviceName=~"filetrans-service|filetrans-store"} - runtimeError{env=~"prod",serviceName=~"filetrans-service|filetrans-store"} offset 1m>30.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 转写-报警 - alert: 429或500错误码过高(DUI语音合成服务) expr: round(sum by(env,matched_route_name,upstreamStatus,host)(delta(gateway_fail{host=~"tts.dui.ai|tts.duiopen.com|dds-tts-lite-legacy.lyg-internal.prod.duiopen.com",upstreamStatus=~"429|500",env=~"d1-prod|d2-prod",matched_route_name!="duisys>apisix>prod"}[5m])))>10.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 429或500错误码过高(DUI语音合成服务) - alert: 一句话服务并发监控 expr: sum by(pod_name,pod_nodename)(gauge_n_sentence_concurrency{k8scluster="d3-prod"})>60.0 for: 2m labels: severity: warning annotations: summary: "{{ $value }}" description: 一句话服务并发 - alert: (DUI)阿里云RDS 磁盘使用率 expr: aliyun_acs_rds_dashboard_DiskUsage{instanceId!~"rm-bp1b6b4v33714735u|rm-bp11e5suu33pk624d|rm-bp145f276j7q8rr06|rm-bp1m4ks59104ovz01|rr-bp144ed50n909534y"} +on(instanceId) group_left(DBInstanceDescription) (label_replace(aliyun_meta_rds_info, "instanceId", "$1", "DBInstanceId", "(.*)"))>90.0 for: 3m labels: severity: critical annotations: summary: "{{ $value }}" description: (DUI)阿里云RDS 磁盘使用率 - alert: 消费者事业部专用 redis内存使用率过高 expr: sum by(InstanceName,instanceId)(aliyun_acs_kvstore_MemoryUsage{instanceId=~"r-bp1y3wz8tbqj0lluxg.*|r-bp1o48i3md2gulte5w.*"})>80.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 消费者事业部专用 redis内存使用率过高 - alert: dds-vip故障 expr: gateway_fail{matched_route_id=~"00000000000000002181|00000000000000002180|00000000000000002170",status=~"403|404|502|503|504"}-gateway_fail{matched_route_id=~"00000000000000002181|00000000000000002180|00000000000000002170",status=~"403|404|502|503|504"} offset 1m>1.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: dds-vip故障 - alert: 外呼-online-告警 expr: runtimeError{env=~"prod",serviceName=~"smart-callout-online-boot"} - runtimeError{env=~"prod",serviceName=~"smart-callout-online-boot"} offset 1m>10.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: 外呼-online-告警 - alert: 地址完整性 expr: sum(runtimeError{serviceName=~"callcenter-dmctrl",env=~"prod"} - runtimeError{serviceName=~"callcenter-dmctrl",env=~"prod"} offset 1m)>2.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 地址完整性-异常报警 - alert: 录音文件转写audioReady4LasrConsumer消息堆积 expr: avg(avg_over_time(rabbitmq_queue_messages_ready{queue=~"audioReady4LasrConsumer"}[1m]))>50.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: audioReady4LasrConsumer消息堆积 - alert: 识别LITE服务aiuniversal-403 expr: gateway_product{matched_route_id="00000000000001244835",status="403"} - gateway_product{matched_route_id="00000000000001244835",status="403"} offset 1m>2.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 识别LITE服务aiuniversal-403 - alert: ba-gateway WebSocket连接未正常关闭,code不是1000 expr: sum(runtimeError{eventName=~"WEB_SOCKET_NON_NORMAL_CLOSE",env=~"prod|beta",serviceName=~"ba-gateway"} - runtimeError{eventName=~"WEB_SOCKET_NON_NORMAL_CLOSE",env=~"prod|beta",serviceName=~"ba-gateway"} offset 1m)>0.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: ba-gateway WebSocket连接未正常关闭,code不是1000 - alert: 识别服务403 expr: gateway_fail{host=~"asr.dui.ai|asr.duiopen.com|asr-bosch-internal.duiopen.com",status="403"} - gateway_fail{host=~"asr.dui.ai|asr.duiopen.com|asr-bosch-internal.duiopen.com",status="403"}>1.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: 识别服务403 - alert: k8s节点磁盘压力 expr: kube_node_status_condition{condition="DiskPressure",job=~"kube-state-metrics-d.-prod",status="true"}!=0.0 for: 60m labels: severity: critical annotations: summary: "{{ $value }}" description: k8s节点磁盘压力 - alert: d3-011308告警(线上识别服务) expr: sum by(mode,status,describe,detail,cluster)(duimonitor_service_exception{service="ASR",env="prod",status="011308"})>200.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 011308(计算进程timeout)阈值200,每分钟 - alert: gi expr: sum(runtimeError{serviceName=~"gi",env=~"prod"} - runtimeError{serviceName=~"gi",env=~"prod"} offset 1m)>=5.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: gi - alert: DSKDM95值耗时大于350毫秒 expr: (service_duration{mode="pt95",instanceId="dm-dispatch-server-fullduplex",latency=~"dm"})>350.0 for: 2m labels: severity: critical annotations: summary: "{{ $value }}" description: DSKDM95值耗时大于350毫秒 - alert: 车载 webhook / webapi 第三方接口错误占比告警 expr: sum by(application,target)(increase(lyra_thirdparty_requests_error_count{job="metrics-service-d1-prod"}[3m]))*100/sum by(application,target)(increase(lyra_thirdparty_requests_count{job="metrics-service-d1-prod"}[3m]))>75.0 for: 10m labels: severity: critical annotations: summary: "{{ $value }}" description: 车载 webhook / webapi 第三方接口错误占比告警 - alert: hb-011308告警 expr: sum(_exported_duimonitor_service_exception{service="ASR",env="prod",scriptname="monitor_casrserver_exception.sh",status="011308",cluster="hb-prod"}) by (mode,status,describe,detail,cluster)>200.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 011308(计算进程timeout)阈值200,每分钟 - alert: hive metastore qps expr: sum(bdp_prod_metastore_metastore{name=~"api.*",type="OneMinuteRate"})>300.0 for: 10m labels: severity: critical annotations: summary: "{{ $value }}" description: hive metastore qps超过300 - alert: smart-dm-inter-boot(FengChaoService.openCab) expr: sum(runtimeError{env=~"prod",serviceName=~"smart-dm-inter-boot",eventName=~"FengChaoService.openCab"} - runtimeError{env=~"prod",serviceName=~"smart-dm-inter-boot",eventName=~"FengChaoService.openCab"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: 专线线路丢包率过高 expr: sum by(vbrid,vbrname)(duimonitor_vbrhealthychecklossrate)>50.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 专线线路丢包率过高 - alert: AIOS服务请求异常(httpas) expr: aios_request_error{service="httpas"}>0.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: aios httpas健康检查错误 - alert: pushserver-ng服务cpu告警 expr: sum(kube_metrics_server_pods_cpu{pod_name=~"duisys-pushserver-ng.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)/1000000>0.8 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: pushserver-ng服务cpu告警 - alert: 服务可用性拨测(TTS合成服务) expr: sum by(api,mode,message,service)(speech_blackbox_testing{mode=~"availability",service=~"tts"})!=0.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务可用性拨测(TTS合成服务) - alert: beta复刻 expr: sum(rate(gateway_fail{env="d1-beta",status="404",tag="apisix",proxyUpstreamName="voice-copy-outer-service"}[40s]))>0.03 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: beta复刻 - alert: aios-apisix拨测 expr: speech_blackbox_testing{service="apisix-aios"}!=0.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 探测aios-apisix请求 - alert: customIntent expr: sum(runtimeError{serviceName=~"nlu-fusion",env=~"prod",eventName=~"customIntent"} - runtimeError{serviceName=~"nlu-fusion",env=~"prod",eventName=~"customIntent"} offset 1m)>=2.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: nlu-fusion请求定制意图存在问题,请尽快处理 - alert: apisix服务cpu占用高 expr: round(sum by(pod_name) (kube_metrics_server_pods_cpu{job="metrics-server-exporter-d1-prod",pod_name=~"apisix.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core",job="kube-state-metrics-d1-prod",pod=~"apisix.*"}, "pod_name", "$1", "pod", "(.*)")) * 1000) * 100 / 1e+06) / 100>0.75 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: apisix服务cpu占用高 - alert: 线上Nacos配置中心CPU使用率过高 expr: round((sum by (pod_name) (kube_metrics_server_pods_cpu{k8scluster="d1-prod", pod_namespace="default", pod_container_name="config-nacos"})/sum by (pod_name) (label_replace(kube_pod_container_resource_limits{k8scluster="d1-prod", namespace="default", container="config-nacos",resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)"))/1e7))>=75.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: CPU使用率超过75% - alert: MEDINGDING识别403 expr: sum by (matched_route_id,productId,host,proxyUpstreamName,status)(delta(gateway_product{status="403",productId!~"279599307|278584659|278573892|278586547|aimt_c_beta|279599155|279600850|279598784",productId!=""}[1m]))>10.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: d1-prod:s-short-asr - alert: 录音文件转写audioListReady4AsrConsumerWithPriority消息堆积 expr: avg(avg_over_time(rabbitmq_queue_messages_ready{queue=~"audioListReady4AsrConsumerWithPriority"}[1m]))>50.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: audioListReady4AsrConsumerWithPriority消息堆积 - alert: 车萝卜端口监听异常告警 expr: sum by(k8scluster,hostname,des)(probe_success{job="tcp-probestatus-cheluobo"})!=1.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 车萝卜端口监听异常告警 - alert: streaming-media内存占用率高于75% expr: round(sum by(pod_name) (kube_metrics_server_pods_mem{job="metrics-server-exporter-d3-prod",pod_name=~"streaming-media.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="memory",unit="byte",job="kube-state-metrics-d3-prod",pod=~"streaming-media.*"}, "pod_name", "$1", "pod", "(.*)")) / 1024) * 100)>75.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: streaming-media内存占用率高于75% - alert: NetFilter ConnTrack 连接数使用率过高 expr: round((sum by(hostname)(node_conntrack_usage)*100),0.01)>90.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: NetFilter ConnTrack 连接数使用率过高 - alert: DUI正式环境Pod运行异常 expr: (sum (max_over_time(kube_pod_container_status_waiting_reason{k8scluster=~"d.-prod",namespace!="idata",reason=~"CreateContainerConfigError|CreateContainerError|RunContainerError"}[2m])) by (k8scluster,pod,container,reason) or sum (max_over_time(kube_pod_init_container_status_waiting_reason{k8scluster=~"d.-prod",namespace!="idata",reason=~"CreateContainerConfigError|CreateContainerError|RunContainerError"}[2m])) by (k8scluster,pod,container,reason))>0.0 for: 6m labels: severity: emergency annotations: summary: "{{ $value }}" description: '{{ $labels.k8scluster }}集群的{{ $labels.namespace }}/{{ $labels.container }}服务容器创建或运行异常超过6分钟' - alert: 物理机磁盘空间不足(线上语义服务) expr: round((sum by(hostname,device,mountpoint)(node_disk_usage{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"cnluserver.*|olive-semantic.*-aidui.*|olive-semantic.*-bcd.*"})*100,0.01)>80.0 for: 3m labels: severity: emergency annotations: summary: "{{ $value }}" description: 物理机磁盘空间不足(线上语义服务) - alert: 顺丰全场景空语音告警 expr: sum(delta(countRoute{env=~"prod",productId=~"914009574",eventName=~"countCallSilence",cityCode=~"311|371|022|029"}[5m]))>100.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 顺丰全场景空语音数量5分支大于100 - alert: 德邦全场景线上请求NLU告警 expr: sum(runtimeError{serviceName=~"nlu-fusion",env=~"prod",eventName=~"914006674"} - runtimeError{serviceName=~"nlu-fusion",env=~"prod",eventName=~"914006674"} offset 1m)>=2.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 德邦全场景线上请求NLU存在异常,请尽快处理 - alert: 线上Nacos配置中心MEM使用率过高 expr: round(sum by(pod_name) (kube_metrics_server_pods_mem{k8scluster="d1-prod", pod_namespace="default", pod_container_name="config-nacos"}) / sum by(pod_name) (label_replace(kube_pod_container_resource_limits{k8scluster="d1-prod", namespace="default", container="config-nacos",resource="memory",unit="byte"}, "pod_name", "$1", "pod", "(.*)")) * 1024 * 100)>=85.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: MEM使用率超过85% - alert: DUI集群正式环境服务器TCP_CLOSE_WAIT过高 expr: sum by(hostname)(node_tcp_count{hostname=~"d1-.*|d2-.*|d3-.*",hostname!~".*insight.*|.*bigdata.*|.*bdp.*",status="CLOSE_WAIT"})>3000.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI集群正式环境服务器TCP_CLOSE_WAIT过高 - alert: NWA云之家工单ECS到期 expr: round((max by (domain, manager) (aispeech_expired_timestamp_ecs_aliyun) - time())/24/3600,0.1)<=7.4 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: NWA云之家工单ECS到期 - alert: 国科并发预警casr-aiinteract expr: sum by(res_type)(nginx_casr_concurrent_current{job="metrics-services-d3-prod",res_type=~"aiinteract|aicar|aihome|aitv|airobot"})>100.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 国科并发预警casr-aiinteract - alert: 大数据集群物理机load5过高 expr: round(sum by (hostname)(node_load5{hostname=~".*insight.*|.*bigdata.*|.*bdp.*"} / node_cpu_core),0.01)>3.0 for: 3m labels: severity: emergency annotations: summary: "{{ $value }}" description: 大数据集群物理机load5过高 - alert: nlu-platform-service拨测告警 expr: sum(sum_over_time(_exported_speech_blackbox_testing{service=~"nlu-platform-service"}[1m]))>=1.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: nlu-platform-service拨测告警 - alert: DSKDM 错误告警5分钟内出现500次 expr: sum(increase(DSKDM_monitor_dskdm_error{env="prod"}[5m])) by (errorId)>500.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: DSKDM 错误告警5分钟内出现500次 - alert: 车载 webhook / webapi 接口耗时告警 expr: sum by(application,uri)(increase(http_server_requests_seconds_sum{pod_project=~"lyra-webhook-.*|lyra-webapi.*",job="metrics-service-d1-prod",uri!~"/car.*|/notification.*|/dui/sgmwcloud.*"}[3m]))*1000/sum by(application,uri)(increase(http_server_requests_seconds_count{pod_project=~"lyra-webhook-.*|lyra-webapi.*",job="metrics-service-d1-prod",uri!~"/car.*|/notification.*|/dui/sgmwcloud.*"}[3m]))>1000.0 for: 10m labels: severity: warning annotations: summary: "{{ $value }}" description: 车载 webhook / webapi 接口耗时告警 - alert: apisix内存占用高 expr: sum(kube_metrics_server_pods_mem{job="metrics-server-exporter-d1-prod",pod_name=~"apisix.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{job="kube-state-metrics-d1-prod",pod=~"apisix.*",resource="memory",unit="byte"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) /1024)>0.93 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: apisix内存占用高 - alert: Pod异常重启(长语音重构) expr: sum by(k8scluster,namespace,pod)(round(delta(kube_pod_container_status_restarts_total{k8scluster=~"d[1.2.3]-prod",pod=~"me-asr-onlinelong.*"}[10m])))!=0.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: Pod异常重启(长语音重构) - alert: 对话中控语义错误码监控(P1)(线上对话服务) expr: sum by(env,errorId,errorMsg,moduleName,pod_name)(delta(dmdispatch_error_statistic{env="prod",moduleName=~"dm-dispatch-server|dm-dispatch-server-fullduplex",errorId=~"150600"}[1m]))>=50.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 对话中控语义错误码监控(P1)(线上对话服务) - alert: kylin job max time expr: bdp_prod_kylin_not_finished_job_duration>480.0 for: 10m labels: severity: critical annotations: summary: "{{ $value }}" description: kylin 任务构建时间超过8小时 - alert: 物理机内存Free不足2G(短语音重构) expr: round((sum by(k8scluster,hostname)(node_mem_free{k8scluster=~".*-prod"}) and on (hostname) node_k8s_service{servicename=~"me-asr-online.*"})/1024/1024/1024,0.01)<2.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机内存Free不足2G(短语音重构) - alert: portal-alpha-400 expr: sum(runtimeError{env=~"alpha",serviceName=~"smart-admin-portal-boot"} - runtimeError{env=~"alpha",serviceName=~"smart-admin-portal-boot"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: portal-alpha环境告警 - alert: etcd集群没有leader expr: etcd_server_has_leader{job=~"etcd-servers-d.-prod"}!=1.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: etcd没有leader - alert: 物理机CPU使用率过高(线上LASR服务) expr: (sum by(hostname)(node_cpu_usage_total{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"lasr-.*"})*100>90.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机CPU使用率过高(线上LASR服务) - alert: ba-gateway调用额度用完Large expr: sum(runtimeError{env=~"prod",serviceName=~"ba-gateway",eventName=~"RequestCountLimiter_Forbidden_Large|WebSocketCountLimiter_Forbidden_Large"} - runtimeError{env=~"prod",serviceName=~"ba-gateway",eventName=~"RequestCountLimiter_Forbidden_Large|WebSocketCountLimiter_Forbidden_Large"} offset 1m)>500.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 超过当日最大调用次数/调用额度已用完/未充值无可调用次数 - alert: casr-aihome产品403 expr: delta(gateway_product{matched_route_id="00000000000000001352",status="403"}[1m])>10.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: casr-aihome产品403 - alert: dds服务可用性拨测 expr: sum by(api,mode,message,service)(speech_blackbox_testing{mode=~"availability",service=~"dds"})>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: dds服务可用性拨测 - alert: 华东lasr-offline用量 expr: delta(gateway_product{matched_route_id="00000000000000001380",status="403",productId!~"279593784|279600424|279595943|279601347|279603827|279604013|279596911|279597576|279606758|279597614|278587791|279604555|279608596|279608520|279598120|279604692|279598784",productId!~""}[1m])>1.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 华东lasr-offline用量 - alert: 语言知识服务机器磁盘空间不足 expr: round(sum by(hostname,device,mountpoint)(node_disk_usage{k8scluster=~"kf-prod",hostname=~"kf-node-006.*"})*100,0.01)>90.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 语言知识服务机器磁盘空间不足 - alert: 国科网关cpu load高 expr: sum by (hostname)(node_load5{hostname=~"d3-apisix-.*|d3-kong-.*"}) / sum by (hostname)(node_cpu_core)>0.75 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 国科网关cpu load高 - alert: casr-aitv产品403 expr: delta(gateway_product{matched_route_id="00000000000000001642",status="403"}[1m])>10.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: casrserver-aitv产品403 - alert: 德邦全场景-写接口异常 expr: sum(runtimeError{eventName=~"DepponService.API.W.ERROR",env=~"prod",serviceName=~"smart-dm-inter-boot"} - runtimeError{eventName=~"DepponService.API.W.ERROR",env=~"prod",serviceName=~"smart-dm-inter-boot"} offset 5m)>5.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: json调用写数据接口(下单、催单、发短信) - alert: 丰巢呼入调用客户信息接口 expr: sum(runtimeError{serviceName=~"smart-dm-inter-boot",env=~"prod",serviceName=~"FengChaoService.queryCustomerInfo",level=~"error"} - runtimeError{serviceName=~"smart-dm-inter-boot",env=~"prod",serviceName=~"FengChaoService.queryCustomerInfo",level=~"error"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 丰巢呼入调用客户信息接口 - alert: 大数据物理机已离线 expr: sum by(hostip,instance)(up{job=~"base-exporter-.*",k8scluster!~"kf-.*",hostip=~"10.24.1.[0-9]+|10.24.14.[0-9]+|10.24.110.[0-9]+|10.20.163.[0-9]+|10.20.171.[0-9]+",hostip!~"10.24.110.16[1-3]"})!=1.0 for: 2m labels: severity: emergency annotations: summary: "{{ $value }}" description: 大数据物理机已离线 - alert: 华东DUI集群物理机TCP连接数过高 expr: sum by(hostname,status)(node_tcp_count{hostname!~".*insight.*|.*bigdata.*|.*bdp.*",status="ESTABLISHED"})>45000.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 华东DUI集群物理机TCP连接数过高 - alert: nlu-hotword服务内存告警 expr: sum by(pod_name) (kube_metrics_server_pods_mem{instance="metrics-server-exporter-hd.duiopen.com:80",pod_name=~"duisys-nlu-hotword.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="memory",unit="byte",instance="kube-state-metrics-hd.duiopen.com:80",pod=~"duisys-nlu-hotword.*"}, "pod_name", "$1", "pod", "(.*)")) / 1024)>0.8 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: qos 华东ACK nlu-hotword服务内存占用超过80% - alert: Pod CPU使用率过高(短语音重构) expr: round(sum by(k8scluster,pod_name)(kube_metrics_server_pods_cpu{k8scluster=~".*-prod",pod_name=~"me-asr-online.*"}) / (sum by(k8scluster,pod_name)(label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) *1000)*100/1000000)>90.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: Pod CPU使用率过高(短语音重构) - alert: duisys-nlu-hotword服务cpu占用高 expr: sum(kube_metrics_server_pods_cpu{pod_name=~"duisys-nlu-hotword.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core",pod=~"duisys-nlu-hotword.*"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)/1000000>=0.85 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: duisys-nlu-hotword服务cpu占用高 - alert: pushserver-ng待推送消息积压告警 expr: pushserver_queue_length{env="prod", host_name=~"duisys-pushserver-ng-.*", name="size"}>10.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: pushserver-ng待推送消息积压大于10 - alert: DSKDM每分钟请求数过低(预警) expr: sum by(env)(delta(DSKDM_monitor_dsk_request{env="prod"}[5m]))<20.0 for: 2m labels: severity: warning annotations: summary: "{{ $value }}" description: DSKDM每分钟请求数过低 - alert: 大数据国科虚拟IP告警 expr: sum by(mode)(bdp_monitor_lb_status)!=0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据国科虚拟IP告警 - alert: 国科正式apisix-5xx expr: sum(rate(gateway_fail{env=~"d3-prod",status=~"500|502|503|504",tag=~"apisix",host!="prometheus-gk.aispeech.com"}[1m])) >0.15 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 国科正式apisix-5xx - alert: dm-faas expr: sum(runtimeError{env=~"prod",serviceName=~"dm-faas"} - runtimeError{env=~"prod",serviceName=~"dm-faas"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: dm-faas-error - alert: 国科并发预警casr-aidialect-mix expr: sum by (res_type) (nginx_casr_concurrent_current{job="metrics-services-d3-prod",res_type="aidialect-mix"})>50.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 国科并发预警casr-aidialect-mix - alert: aiym-id-request-duration-alert expr: histogram_quantile(0.95, http_request_duration_seconds_bucket{ handler="/v1/intent/get", pod_project="aiym-intent-detection", k8scluster="d1-prod" })>=0.3 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: aiym驰必准单意图分类模型告警 - alert: hive metastore连接数 expr: sum(bdp_prod_metastore_metastore{name="open_connections"})>300.0 for: 10m labels: severity: critical annotations: summary: "{{ $value }}" description: hive metastore连接数超过300 - alert: smart-partner-OutboundLastModifyTimeException expr: sum(runtimeError{serviceName=~"smart-partner-module-boot",env=~"prod",eventName=~"OutboundLastModifyTimeException"} - runtimeError{serviceName=~"smart-partner-module-boot",env=~"prod",eventName=~"OutboundLastModifyTimeException"} offset 1m)>0.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 有外呼状态近20分钟未完成 - alert: 磁盘IO负载过高(线上识别服务) expr: round(sum by(hostname,device)(node_disk_ioutil{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"casr-.*"})>90.0 for: 2m labels: severity: warning annotations: summary: "{{ $value }}" description: 磁盘IO负载过高(线上识别服务) - alert: DUI正式环境节点NotReady expr: sum(kube_node_status_condition{k8scluster=~"d.-prod",condition="Ready",status="true"}) by (k8scluster,node)==0.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: '{{ $labels.k8scluster }}集群的{{ $labels.node }}节点NotReady' - alert: 国科正式apisix-499 expr: sum(rate(gateway_fail{env=~"d3-prod",status=~"499",tag=~"apisix"}[1m])) >1.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 国科正式apisix-499 - alert: file-resync2-回调失败告警 expr: sum(duimonitor_service_exception{service="resync2",mode="exception-callback"})>1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: file-resync2-回调失败告警 - alert: 博世私有云访问公有云异常 expr: gateway_fail{host=~"asr-bosch-internal.duiopen.com|tts-bosch-internal.duiopen.com",status=~"404|502|503"}-gateway_fail{host=~"asr-bosch-internal.duiopen.com|tts-bosch-internal.duiopen.com",status=~"404|502|503"} offset 1m>1.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: 博世私有云访问公有云异常 - alert: nlu-fusion expr: sum(runtimeError{env=~"prod",serviceName=~"nlu-fusion",eventName=~"http_nlu_time_out_except|http_nlu_except"} - runtimeError{env=~"prod",serviceName=~"nlu-fusion",eventName=~"http_nlu_time_out_except|http_nlu_except"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: rsyslog服务器磁盘空间不足 expr: round(sum by(hostname, device, mountpoint)(node_disk_usage{hostname=~"d1-public-00[12]", mountpoint=~"/data/resources/logs?"} or node_disk_usage{hostname=~"d[23]-public-001", mountpoint=~"/data"}),0.01)>85.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: rsyslog服务器磁盘空间不足 - alert: 大数据机器TCP连接数过高 expr: sum by(hostname,status)(node_tcp_count{hostname=~".*insight.*|.*bigdata.*|.*bdp.*",hostname!~".*lbint.*"})>55000.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 大数据机器TCP连接数过高 - alert: 物理机内存Free不足0.5G(线上识别服务) expr: round((sum by(hostname)(node_mem_free{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"casr-.*"})/1024/1024/1024,0.01)<0.5 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机内存Free不足0.5G(线上识别服务) - alert: 研发部-语言及知识服务 redis内存使用率过高 expr: sum by(InstanceName,instanceId)(aliyun_acs_kvstore_MemoryUsage{instanceId=~"r-bp1uapy2qcydzj22z4"} +on(instanceId) group_left(InstanceName) (label_replace(aliyun_meta_redis_info, "instanceId", "$1", "InstanceId", "(.*)")))>80.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 研发部-语言及知识服务 redis内存使用率过高 - alert: 资源同步v2 expr: sum by(errorId, errorMsg) (error_dui_resource_mover{env="prod"} - error_dui_resource_mover{env="prod"} offset 1m)>0.1 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 资源同步服务告警 - alert: r0-daemon-asr-faild expr: sum(runtimeError{serviceName=~"r0-daemon",eventName=~"data_log_alarm_asr_failed",env=~"prod"} - runtimeError{serviceName=~"r0-daemon",eventName=~"data_log_alarm_asr_failed",env=~"prod"} offset 60m)>100.0 for: 15m labels: severity: warning annotations: summary: "{{ $value }}" description: r0-daemon-asr-faild - alert: boss信控监控告警 expr: sum(runtimeError{eventName=~"boss-control-status-close",env=~"prod"} - runtimeError{eventName=~"boss-control-status-close",env=~"prod"} offset 1m)>=1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: boss信控监控告警 - alert: 健康检查失败告警(owl服务) expr: duimonitor_apisix_unhealthz_count{env="prod",service="apisix",upstream_name="00000000000000015794"}>50.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 健康检查失败告警(owl服务) - alert: 大数据集群Mysql用户连接使用率 expr: sum by(client,hostname,user)(mysql_user_connections_percent)>85.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 大数据集群Mysql用户连接使用率 - alert: ES集群物理机load5过高(大数据) expr: round(sum by(hostname)(node_load5{hostname=~"insight-search-17.aispeech.com|insight-search-18.aispeech.com|insight-search-19.aispeech.com|insight-search-20.aispeech.com|insight-search-21.aispeech.com|insight-search-22.aispeech.com|insight-search-23.aispeech.com|insight-search-24.aispeech.com"} / node_cpu_core),0.01)>1.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: ES集群物理机load5过高(大数据) - alert: 国科云端vad服务5xx expr: sum(idelta(gateway_product{k8scluster="d3-prod",status=~"500|502|503|504",matched_route_id="00000000000000002006"}[5m]))>30.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 国科云端vad服务5xx - alert: (Adam)阿里云MongoDB 磁盘使用率 expr: aliyun_acs_mongodb_DiskUtilization{instanceId="dds-bp14a46dd206ea34"}>80.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: (Adam)阿里云MongoDB 磁盘使用率 - alert: IT系统物理机5分钟内发生重启 expr: round(sum by(hostname,hostip)(node_uptime{job="base-exporter-d0-test",instance=~"d0-test-itsystem00.*"}) / 60)<5.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: IT系统物理机5分钟内发生重启 - alert: 物理机显卡温度过高(线上LASR服务 expr: (sum by(hostname,gpu)(node_gpu_temp{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"lasr-.*"})>80.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机显卡温度过高(线上LASR服务 - alert: 研发部-对话及多模态交互 redis内存使用率过高 expr: sum by(InstanceName,instanceId)(aliyun_acs_kvstore_MemoryUsage{instanceId=~"r-bp1n1u6rx5jlihivo3"} +on(instanceId) group_left(InstanceName) (label_replace(aliyun_meta_redis_info, "instanceId", "$1", "InstanceId", "(.*)")))>80.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 研发部-对话及多模态交互 redis内存使用率过高 - alert: pyfsesl expr: sum(runtimeError{env=~"prod",serviceName=~"pyfsesl",eventName=~"pyfsesl_errorcall"} - runtimeError{env=~"prod",serviceName=~"pyfsesl",eventName=~"pyfsesl_errorcall"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: doris qps expr: max(rate(doris_prod_doris_fe_query_total[5m]))>100.0 for: 30m labels: severity: critical annotations: summary: "{{ $value }}" description: doris qps超过100 - alert: nlu-platform-service时延过高(线上语义服务) expr: round(sum by (serviceName,pod_name,env)(histogram_quantile(0.90,rate(operation_histogram_bucket{env="prod",serviceName=~"nlu-platform-service|nlu-platform-service-fullduplex|nlu-platform-service-vip",type="nludispatch"}[1m]))))>800.0 for: 3m labels: severity: emergency annotations: summary: "{{ $value }}" description: nlu-platform-service时延过高(线上语义服务) - alert: dds全双工并发连接数高 expr: sum by (state)(metric_connections{env="prod",host_name=~"ddsserver-fullduplex.*",state="active"})>8000.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: dds全双工并发连接数高 - alert: portal-ba-asr/tts expr: histogram_quantile(0.95, rate(portal_histogram_bucket{env="prod",eventName="BaApi.getAsrApiConfigByPidAndType"}[5m]))>2.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: BA获取ASR/TTS超时报警 - alert: 异常状态码过高011312(短语音重构) expr: round(sum by(k8scluster,status)(delta(pid_status_total{status=~"011312"}[3m])))>50.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 异常状态码过高011312(短语音重构) - alert: Pod CPU使用率过高(me-tts-anti-corruption-service服务) expr: round(sum(kube_metrics_server_pods_cpu{k8scluster="d1-prod",pod_name=~"me-tts-anti-corruption-service.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)*100/1000000)>75.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: Pod CPU使用率过高(me-tts-anti-corruption-service服务) - alert: 五菱私有云拨测延时高 expr: speech_blackbox_testing{api=~"五菱私有云|五菱私有云cinfo",mode="costs"}>1900.0 for: 2m labels: severity: warning annotations: summary: "{{ $value }}" description: 五菱延时高 - alert: 国科识别aicar平均末帧延时costs(线上识别服务) expr: round(avg by(k8scluster,res)(histogram_quantile(0.95,rate(pid_latency_total_bucket{k8scluster="d3-prod",pod_project=~"me-asr-onlineshort-service",pod_name!~"me-asr-onlineshort-service-gray-instance.*",res="aicar"}[3m]))))>300.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 国科识别aicar平均末帧延时costs(线上识别服务) - alert: dataclean-message(queryCallCenterError) expr: sum(runtimeError{env=~"prod",serviceName=~"dataclean-message",eventName=~"queryCallCenterError"} - runtimeError{env=~"prod",serviceName=~"dataclean-message",eventName=~"queryCallCenterError"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 调用DM异常 - alert: 专线线路入口流量过高 expr: round(sum by(vbrid,vbrname)(duimonitor_vbrinrate)/1024/1024,0.01)>200.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 专线线路入口流量过高 - alert: MEDINGDING华东tts用量 expr: sum by (matched_route_id,productId,host,proxyUpstreamName,status)(delta(gateway_product{matched_route_id=~"00000000000000001538|00000000000000001536|00000000000000001476|00000000000000001338",status="403",productId!~"279602881|279605379|279603535",productId!=""}[1m]))>=1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: d1-prod:s-short-tts - alert: odcp-nlu-platform-service错误码监控 expr: delta(_exported_operation_counter{type=~"150501|150500|150502|150503|150504|150505|150506", serviceName="odcp-nlu-platform-service"}[1m])>100.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: odcp-nlu-platform-service错误码监控 - alert: 物理机内存使用率过高(线上语义服务) expr: round((sum by(hostname)(node_mem_usage{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"cnluserver.*|olive-semantic-aidui|olive-semantic-bcd"})*100,0.01)>85.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机内存使用率过高(线上语义服务) - alert: 中控对象服务感知语义置信度低告警 expr: sum by (errorId, errorMsg, env, moduleName)(delta(dmdispatch_error_statistic{host=~".*",errorId=~"150600",env=~"prod"}[1m]))>200.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 中控对象服务感知语义置信度低告警 - alert: me_cinfo-service CPU >90% expr: round(sum by(pod_name) (kube_metrics_server_pods_cpu{job="metrics-server-exporter-d1-prod",pod_name=~"me-cinfo-service.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) * 1000) * 100 / 1e+06)>90.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: me_cinfo-serviceCPU使用率90%以上 - alert: smart-dm-inter-boot expr: sum(runtimeError{env=~"prod",eventName=~"queryUserInfoError|sendEvaluateParseRespError|ZlyfService.queryUserInfo|callSfInterfaceError|querySlotsError|SfFullService.querySfOrderDetailError|queryTimeInPartError|querySfOrderDetailParseRespError|queryDeptMatch-2Error|queryTimeInWorkError|queryTrackDetailError|queryCallInStat|complaintOrderParseRespError|queryOrderByPhoneParseRespError|getPriceTimeError|getPriceTimeTwoError|queryDefaultAreaError|getCustomInfoError|queryPhoneTypeError|queryDepponSiteError|queryCustomTime|sendEvaluateError|queryComplaintCountError|complaintOrderError|queryOrderByPhoneError|getFmtDateError|sendRobotInfoError|updateConfigKeyError|getTimePriceRespError|queryDeptMatch-1Error|queryDeptMatchParseReqError",serviceName=~"smart-dm-inter-boot"} - runtimeError{env=~"prod",eventName=~"queryUserInfoError|sendEvaluateParseRespError|ZlyfService.queryUserInfo|callSfInterfaceError|querySlotsError|SfFullService.querySfOrderDetailError|queryTimeInPartError|querySfOrderDetailParseRespError|queryDeptMatch-2Error|queryTimeInWorkError|queryTrackDetailError|queryCallInStat|complaintOrderParseRespError|queryOrderByPhoneParseRespError|getPriceTimeError|getPriceTimeTwoError|queryDefaultAreaError|getCustomInfoError|queryPhoneTypeError|queryDepponSiteError|queryCustomTime|sendEvaluateError|queryComplaintCountError|complaintOrderError|queryOrderByPhoneError|getFmtDateError|sendRobotInfoError|updateConfigKeyError|getTimePriceRespError|queryDeptMatch-1Error|queryDeptMatchParseReqError",serviceName=~"smart-dm-inter-boot"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: (DUI)阿里云MongoDB 磁盘使用率 expr: sum by(instanceId,role)(aliyun_acs_mongodb_DiskUtilization{instanceId!~"dds-bp11d328e4e8da74|dds-bp17ead8b96974c4|dds-bp177ae7a19a9a34|dds-bp1f6f784b1a3024|dds-bp15c90ebc25a554|dds-bp15c9d449e3b214|dds-bp1911c580f6cb44|dds-bp1e9e91ed7bfe24|dds-8vb01b26c9eb1434"})>85.0 for: 3m labels: severity: emergency annotations: summary: "{{ $value }}" description: (DUI)阿里云MongoDB 磁盘使用率 - alert: filebeat服务内存使用过高 expr: sum by (k8scluster,node,pod) (round(label_replace(kube_metrics_server_pods_mem{k8scluster=~"d.-prod",pod_namespace="filebeat",pod_name=~"aispeech-filebeat-.*",pod_container_name="aispeech-filebeat"}, "pod", "$1", "pod_name", "(.*)") * on(k8scluster,pod) group_right kube_pod_info{k8scluster=~"d.-prod",namespace="filebeat",pod=~"aispeech-filebeat-.*"}/1024))>1024.0 for: 10m labels: severity: warning annotations: summary: "{{ $value }}" description: '{{ $labels.k8scluster }}集群{{ $labels.node }}节点filebeat的内存使用过高' - alert: 物理机ssh公钥authorized_keys文件被修改 expr: sum by (k8scluster,hostip,hostname,username)(delta(node_authorized_keys[5m]))!=0.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 物理机ssh公钥authorized_keys文件被修改 - alert: aios拨测 expr: sum by(api,mode,message,service)(speech_blackbox_testing{mode=~"availability",service=~"aihome|aicar|airobot|bcd"})>=1.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: aios拨测 - alert: CPU使用率过高(短语音重构) expr: round((sum by(k8scluster,hostname)(node_cpu_usage_total{k8scluster=~".*-prod"}) and on (hostname) node_k8s_service{namespace="cloud",servicename=~"me-asr-online.*"})*100,0.01)>90.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: CPU使用率过高(短语音重构) - alert: 物理机显存使用率过高(线上识别服务) expr: round((sum by(hostname,gpu)(node_gpu_mem_usage{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"casr-.*"})*100,0.01)>90.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机显存使用率过高(线上识别服务) - alert: 世界之树中间件服务器CPU使用率过高 expr: sum by(hostname)(node_cpu_usage_total{hostname="yggdrasil-mid-001"}*100)>80.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 世界之树中间件服务器CPU使用率过高 - alert: ba-outer expr: sum(runtimeError{env=~"prod",serviceName=~"ba-outer",eventName=~"ASR_CREATE_CONN_FAIL|TTS_TIMEOUT|TTS_ERROR|ASR_FAIL|GET_CONFIG_FAIL"} - runtimeError{env=~"prod",serviceName=~"ba-outer",eventName=~"ASR_CREATE_CONN_FAIL|TTS_TIMEOUT|TTS_ERROR|ASR_FAIL|GET_CONFIG_FAIL"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: nlu-train-gateway expr: sum(runtimeError{serviceName=~"nlu-train-gateway",env=~"prod",eventName=~"executeTrainMissionError"} - runtimeError{serviceName=~"nlu-train-gateway",env=~"prod",eventName=~"executeTrainMissionError"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 线上模型训练存在异常,请尽快处理 - alert: hb-011309告警 expr: sum(_exported_duimonitor_service_exception{service="ASR",env="prod",scriptname="monitor_casrserver_exception.sh",status="011309",cluster="hb-prod"}) by (mode,status,describe,detail,cluster)>100.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 011309(server busy)阈值100,每分钟 - alert: 物理机内存Free不足1.5G(dds-tts-lite服务) expr: round((sum by(k8scluster,hostname)(node_mem_free{k8scluster=~".*-prod"}) and on (hostname) node_k8s_service{servicename=~"dds-tts-lite.*"})/1024/1024/1024,0.01)<1.5 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机内存Free不足1.5G(dds-tts-lite服务) - alert: 实时长语音并发过高(线上LASR服务) expr: sum by(pod_name,k8scluster,pod_project)(gauge_n_concurrency{k8scluster="d3-prod"})>108.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 实时长语音并发过高(线上LASR服务) - alert: DUI防火墙公网IP上行流量过高 expr: round(sum by(ifAlias,ifDescr)(irate(ifHCOutOctets{ifDescr=~"GigabitEthernet0/0/6|GigabitEthernet1/0/8",instance="10.24.20.4"}[5m]))/1024/1024*8,0.01)>40.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI防火墙公网IP上行流量过高 - alert: 知识服务异常日志告警 expr: sum(runtimeError{serviceName=~"qa|kges|sqa|tfserving|manifold|search|QA_classification|sid|literature|baike|ltr|qa-similarity|newchat|sid-java|class-snt|gqa-p|query-match-yy|dmserver",env=~"prod"} - runtimeError{serviceName=~"qa|kges|sqa|tfserving|manifold|search|QA_classification|sid|literature|baike|ltr|qa-similarity|newchat|sid-java|class-snt|gqa-p|query-match-yy|dmserver",env=~"prod"} offset 1m)>2.0 for: 2m labels: severity: warning annotations: summary: "{{ $value }}" description: 如果有error日志,发到群里 - alert: 集群Prometheus的内存使用率超过90% expr: round(sum by(pod_name,k8scluster)(kube_metrics_server_pods_mem{pod_namespace=~"monitoring",pod_name=~"prometheus-.*"}) / (sum by(pod_name,k8scluster)(label_replace(kube_pod_container_resource_limits{resource="memory",unit="byte"}, "pod_name", "$1", "pod", "(.*)")) /1024)*100)>90.0 for: 3m labels: severity: emergency annotations: summary: "{{ $value }}" description: 集群Prometheus的内存使用率超过90% - alert: 大数据集群物理机磁盘IO负载过高 expr: round(sum by(hostname,device)(node_disk_ioutil{hostname!=".*insight.*|.*bigdata.*|.*bdp.*"}))>99.0 for: 30m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据集群物理机磁盘IO负载过高 - alert: WEBSOCKET_LIMIT_EXCEED expr: sum(runtimeError{eventName=~"WEBSOCKET_LIMIT_EXCEED"} - runtimeError{eventName=~"WEBSOCKET_LIMIT_EXCEED"} offset 1m)>0.1 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: WEBSOCKET_LIMIT_EXCEED - alert: 车载 webhook/webapi 内存占用告警 expr: sum by(pod_name)(jvm_memory_used_bytes{pod_project=~"lyra-webhook-.*|lyra-webapi.*",job="metrics-service-d1-prod",area="heap"})*100/sum by(pod_name)(jvm_memory_max_bytes{pod_project=~"lyra-webhook-.*|lyra-webapi.*",job="metrics-service-d1-prod",area="heap"})>85.0 for: 10m labels: severity: emergency annotations: summary: "{{ $value }}" description: 车载 webhook/webapi 内存占用告警 - alert: 大数据集群物理机磁盘坏道Media_Error expr: sum by(hostname,slot)(node_disk_media_error{hostname=~".*insight.*|.*bigdata.*|.*bdp.*"})>50.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据集群物理机磁盘坏道Media_Error - alert: DUI正式环境K8S节点非docker非glusterfs磁盘空间不足 expr: round(sum by(hostname,device,mountpoint)(node_disk_usage{job=~"base-exporter-d[123]-prod",hostname!~".*glusterfs.*",mountpoint!~"/var/lib/docker|/data/gluster/storage.*",})*100,0.01)>80.0 for: 3m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI正式环境K8S节点非docker非glusterfs磁盘空间不足 - alert: NWA云之家工单Redis到期 expr: round((max by (domain, manager) (aispeech_expired_timestamp_redis_aliyun) - time())/24/3600,0.1)<=7.4 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: NWA云之家工单Redis到期 - alert: (DUI)阿里云RDS CPU使用率 expr: aliyun_acs_rds_dashboard_CpuUsage{instanceId!~"rm-bp1b6b4v33714735u|rm-bp11e5suu33pk624d|rm-bp145f276j7q8rr06|rm-bp1m4ks59104ovz01|rr-bp144ed50n909534y"} +on(instanceId) group_left(DBInstanceDescription) (label_replace(aliyun_meta_rds_info, "instanceId", "$1", "DBInstanceId", "(.*)"))>80.0 for: 3m labels: severity: critical annotations: summary: "{{ $value }}" description: (DUI)阿里云RDS CPU使用率 - alert: portal-ba-asr/tts-test expr: histogram_quantile(0.95, rate(portal_histogram_bucket{eventName="BaApi.getAsrApiConfigByPidAndType"}[5m]))>2.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: BA获取ASR/TTS超时报警 - alert: 顺丰全场景线上请求NLU告警 expr: sum(runtimeError{serviceName=~"nlu-fusion",env=~"prod",eventName=~"914009574"} - runtimeError{serviceName=~"nlu-fusion",env=~"prod",eventName=~"914009574"} offset 1m)>=2.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 顺丰全场景线上请求NLU存在异常,请尽快处理 - alert: 一句话pod内存告警 expr: sum by(pod_name) (kube_metrics_server_pods_mem{k8scluster="d3-prod",pod_name=~"me-asr-onesentence.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="memory",unit="byte"}, "pod_name", "$1", "pod", "(.*)")) / 1024)>0.9 for: 2m labels: severity: warning annotations: summary: "{{ $value }}" description: 一句话pod内存告警 - alert: 智能客服oss探针返回非200 expr: sum by(des,k8scluster)(probe_http_status_code{job=~"http-probestatus-.*",des=~"http://oss.talkinggenie.com/minio/health/live"})!=200.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 智能客服oss探针返回非200 - alert: 异常状态码过高(线上识别服务) expr: sum by(mode,status,env)(duimonitor_service_exception{env=~"prod",service="ASR",status!~"011307|011300|011301|011303|011304|011306|011311|011307[ws计算进程interrupt]|011300[http upload new 失败]|011301[http协议上传参数非法]|011303[http请求参数非json]|011304[http请求参数非法]|011306[ws请求参数非法]"})>20.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 异常状态码过高(线上识别服务) - alert: 物理机磁盘空间不足(线上识别服务) expr: round((sum by(hostname,device,mountpoint)(node_disk_usage{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"casr-.*"})*100,0.01)>80.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机磁盘空间不足(线上识别服务) - alert: 顺丰全场景流量低 expr: sum(delta(countRoute{cityCode=~"311|371|022|029",eventName=~"countCallIn",env=~"prod",productId=~"914009574"}[60m]))<10.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 顺丰全场景一小时流量低于10 - alert: NWA计划任务执行结果 expr: min by(taskId,taskName,endTime) (aispeech_nwa_task_cron_status{})!=1.0 for: 5m labels: severity: emergency annotations: summary: "{{ $value }}" description: 计划任务执行结果 - alert: 外呼监控 expr: sum(call_out_cdr_counter{connectStatus="0",customerName="顺丰速运",env="alpha",eventName="countCdr",serviceName="smart-callout-boot",taskName="顺丰速运-测试"})/sum(call_out_cdr_counter{customerName="顺丰速运",env="alpha",eventName="countCdr",serviceName="smart-callout-boot",taskName="顺丰速运-测试"})>0.9 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 外呼接通率监控 - alert: oauth2 CPU监控告警 expr: round(sum(kube_metrics_server_pods_cpu{pod_namespace="odcp",pod_name=~"mscp-component-oauth2-v2.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{namespace="odcp",pod=~"mscp-component-oauth2-v2.*",resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)*100/1000000)>75.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: oauth2 CPU监控告警 - alert: file-resync2-预读失败告警 expr: sum(duimonitor_service_exception{service="resync2",mode="exception-preread"})>1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: file-resync2-预读失败告警 - alert: me-asr-vad-service pod内存过高 expr: round(sum by(pod_name) (kube_metrics_server_pods_mem{k8scluster=~"d.-prod",pod_name=~"me-asr-vad-service.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="memory",unit="byte"}, "pod_name", "$1", "pod", "(.*)")) / 1024)*100,0.01)>90.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: me-asr-vad-service pod内存过高 - alert: 外呼机器人邮件发送告警 expr: sum(runtimeError{env=~"prod",module=~"smart-outbound-admin-boot",eventName=~"doSendEmail error"} - runtimeError{env=~"prod",module=~"smart-outbound-admin-boot",eventName=~"doSendEmail error"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 外呼机器人、基础充值邮件通知失败提醒 - alert: 资源同步v1-redisagent expr: sum by(errorId, errorMsg) (autosyncserver_monitor_redis_write{env="prod"} - autosyncserver_monitor_redis_write{env="prod"} offset 1m)>1.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: redisagent failed - alert: 大数据集群物理机5分钟内发生重启 expr: round(sum by(hostname,hostip)(node_uptime{hostname=~".*insight.*|.*bigdata.*|.*bdp.*"}) / 60)<5.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据集群物理机5分钟内发生重启 - alert: ba-gateway调用额度用完 expr: sum(runtimeError{env=~"prod",serviceName=~"ba-gateway",eventName=~"RequestCountLimiter_Forbidden|WebSocketCountLimiter_Forbidden"} - runtimeError{env=~"prod",serviceName=~"ba-gateway",eventName=~"RequestCountLimiter_Forbidden|WebSocketCountLimiter_Forbidden"} offset 1m)>150.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 超过当日最大调用次数/调用额度已用完/未充值无可调用次数 - alert: 国科LB告警 expr: sum by (mode)(ops_monitor)!=0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 国科LB告警 - alert: 大数据iData celery队列0告警(Prod环境) expr: sum by(hostname,env,queue,status)(monitor_celery_tasks{env="prod",status="True",queue=~"idata_rsbj_ba_backend|idata_others|idata_data_engineer"})>16.0 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据iData celery队列0告警(Prod环境) - alert: smart-partner-OutboundStatusException expr: sum(runtimeError{serviceName=~"smart-partner-module-boot",env=~"prod",eventName=~"OutboundStatusException"} - runtimeError{serviceName=~"smart-partner-module-boot",env=~"prod",eventName=~"OutboundStatusException"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: (DUI)阿里云MongoDB 连接数使用率 expr: sum by(instanceId,role)(aliyun_acs_mongodb_ConnectionUtilization{instanceId!~"dds-bp11d328e4e8da74|dds-bp17ead8b96974c4|dds-bp177ae7a19a9a34|dds-bp1f6f784b1a3024|dds-bp1c769e0b14fc94|dds-bp1ab24d3ff1b734|dds-bp15c90ebc25a554|dds-bp15c9d449e3b214|dds-bp1911c580f6cb44|dds-bp1e9e91ed7bfe24"})>70.0 for: 3m labels: severity: critical annotations: summary: "{{ $value }}" description: (DUI)阿里云MongoDB 连接数使用率 - alert: 物理机相对load5过高(线上声纹和其他服务) expr: round((sum by(hostname)(node_load5{hostname!~".*beta.*"}/node_cpu_core) and on (hostname) node_k8s_service{servicename=~"me-asr-vad-.*|me-vpr-.*|vpr-dp-sr-.*|vpr-lti-sr-.*|vpr-sti-sr-.*|vpr-ti-sr-.*|vpr-verify-.*|vpr-supplement"}),0.0001)>1.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机相对load5过高(线上声纹和其他服务) - alert: 正式长语音国科大于499 expr: sum(rate(gateway_fail{env="d3-prod",host="lasr.duiopen.com",status=~"499|500|502|503|504",tag="apisix"}[40s]))>0.03 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 正式长语音国科大于499 - alert: 401错误(一句话识别服务) expr: delta(gateway_product{matched_route_id="00000000000000404564",productId!="279606220",status="401"}[2m])>2.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 401错误(一句话识别服务) - alert: dmEngine_niuhuihua expr: sum(runtimeError{env=~"prod",serviceName=~"dm-engine|dm-runtime",skillId=~"914010874|914011463|914007699|914007697|914007698|914007701|914008232|914008367|914008380|914008677|914012447|914011536|914011532|914011745|914011746|914011850|914010856"} - runtimeError{env=~"prod",serviceName=~"dm-engine|dm-runtime",skillId=~"914010874|914011463|914007699|914007697|914007698|914007701|914008232|914008367|914008380|914008677|914012447|914011536|914011532|914011745|914011746|914011850|914010856"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: bot JSON有问题,请尽快处理 - alert: ES集群物理机SSD磁盘空间不足(大数据) expr: round(sum by(hostname,device,mountpoint)(node_disk_usage{hostname=~"insight-search-17.aispeech.com|insight-search-18.aispeech.com|insight-search-19.aispeech.com|insight-search-20.aispeech.com|insight-search-21.aispeech.com|insight-search-22.aispeech.com|insight-search-23.aispeech.com|insight-search-24.aispeech.com",mountpoint=~"/data1|/data2"})*100,0.01)>75.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: ES集群物理机SSD磁盘空间不足(大数据) - alert: im-adapter(all) expr: sum(runtimeError{serviceName=~"im-adapter",env=~"prod"} - runtimeError{serviceName=~"im-adapter",env=~"prod"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: cdmserver错误码(P0)(线上对话服务) expr: sum by(env,errorId,errorMsg,moduleName,pod_name)(delta(cdmsvr2_error_statistic{env="prod",moduleName="cdmserver-v2",errorId=~"010531|010801|080020-01|080020-21"}[1m]))>=10.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: cdmserver错误码(P0)(线上对话服务) - alert: (DUI)阿里云Elasticsearch 节点CPU使用率 expr: aliyun_acs_elasticsearch_NodeCPUUtilization{}>70.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: (DUI)阿里云Elasticsearch 节点CPU使用率 - alert: nlu-platform-service之cnluserver错误码高于1000(线上语义服务) expr: round(sum by(serviceName,env, type)(delta(operation_counter{env="prod",serviceName=~"nlu-platform-service|nlu-platform-service-fullduplex|odcp-nlu-platform-service|nlu-platform-service-vip",type="150501"}[2m])))>1000.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: nlu-platform-service之cnluserver错误码高于1000(线上语义服务) - alert: kylin query latency expr: avg(bdp_prod_kylin_metrics_queryduration_95thpercentile)>10000.0 for: 10m labels: severity: critical annotations: summary: "{{ $value }}" description: kylin 查询延迟超过10秒 - alert: 录音文件转写asrReturn4AudioConsumer消息堆积 expr: avg(avg_over_time(rabbitmq_queue_messages_ready{queue=~"asrReturn4AudioConsumer"}[1m]))>50.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 录音文件转写asrReturn4AudioConsumer消息堆积 - alert: 阿里云ECS服务器3天后过期 expr: round((max by (instanceId, instanceName, submitter) (aispeech_aliyun_ecs_expired_timestamp) - time())/24/3600,0.1)<3.4 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 阿里云ECS服务器3天后过期 - alert: DUI物理机已离线 expr: sum by(k8scluster,hostip,instance)(up{job=~"base-exporter-.*",k8scluster!~"kf-.*",hostip!~"10.24.1.[0-9]+|10.24.14.[0-9]+|10.24.110.[0-9]+|10.20.163.[0-9]+|10.20.171.[0-9]+"} or up{job=~"base-exporter-.*",hostip=~"10.24.110.16[1-3]"})!=1.0 for: 2m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI物理机已离线 - alert: DUI集群物理机磁盘状态异常 expr: sum by(hostname,slot,Size)(node_disk_status{hostname!~".*insight.*|.*bigdata.*|.*bdp.*|.*kf.*"})!=1.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: DUI集群物理机磁盘状态异常 - alert: 连云港me-tts-nlu专用redisCPU过高 expr: rate(redis_cpu_user_seconds_total{instance="10.24.12.38:9122"}[5m]) * 100>70.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 连云港me-tts-nlu专用redisCPU过高 - alert: apisix-aios告警5xx expr: rate(gateway_fail{env="d1-prod",host="s.api.aispeech.com",status=~"500|502|503|504",tag="apisix"}[40s])>0.5 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: apisix-aios告警5xx - alert: dds-v3鉴权不过 expr: sum by (matched_route_id,productId)(delta(gateway_product{matched_route_id=~"00000000000000001680|00000000000000001542|00000000000000001490|00000000000000001468|00000000000000001424|00000000000000001336|00000000000000001332",productId!~"278578029|279602826|279605355|279602059|279607587|279602700|279611343",productId!~"",status="401"}[1m]))>20.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: ddsv3接口鉴权不过 - alert: bdp-lb机器TCP连接数过高 expr: sum by(hostname,status)(node_tcp_count{hostname=~".*-bdp-lbint.*"})>60000.0 for: 10m labels: severity: emergency annotations: summary: "{{ $value }}" description: bdp-lb机器TCP连接数过高 - alert: 对话中控语义错误码监控(P0)(线上对话服务) expr: sum by(env,errorId,errorMsg,moduleName,pod_name)(delta(dmdispatch_error_statistic{env="prod",moduleName=~"dm-dispatch-server|dm-dispatch-server-fullduplex",errorId=~"010405"}[1m]))>=50.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 对话中控语义错误码监控(P0)(线上对话服务) - alert: MEETING_REDIS_HOST(cloud) redis内存使用率过高 expr: sum by(InstanceName,instanceId)(aliyun_acs_kvstore_MemoryUsage{instanceId=~"r-bp1nte6duqfcb92zc2"} +on(instanceId) group_left(InstanceName) (label_replace(aliyun_meta_redis_info, "instanceId", "$1", "InstanceId", "(.*)")))>80.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: MEETING_REDIS_HOST(cloud) redis内存使用率过高 - alert: 内核服务(aihome、airobot、aitv)并发过高 expr: sum by(pod_project,pod_name)(gauge_op_concurrencies{k8scluster=~"d.-prod",pod_project=~"me-asr-online-wfst-aihome|me-asr-online-wfst-airobot|me-asr-online-wfst-aitv"})>172.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 内核服务(aihome、airobot、aitv)并发过高 - alert: 公有云转写服务任务堆积 expr: sum(runtimeError{eventName=~"Transfer_Task_Count"} - runtimeError{eventName=~"Transfer_Task_Count"} offset 1m)>100.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 公有云转写服务任务堆积 - alert: file-resync2-server pod cpu高告警 expr: round(sum(kube_metrics_server_pods_cpu{pod_name=~"file-resync2-server.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)*100/1000000)>75.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: file-resync2-server pod cpu告警 - alert: 错误码过高(lasr-task-reactor服务) expr: round(sum by(k8scluster,status,pod_project)(idelta(counter_n_errors{k8scluster=~"d[1,2,3]-prod",pod_project="lasr-task-reactor"}[3m])))>10.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 错误码过高(lasr-task-reactor服务) - alert: error code 过高(ddsserver服务)华东集群 expr: round(sum by(error_code,host_name,error_msg)(delta(metric_errorcode{env=~"prod",error_code=~"010300|010301|010302|010304|010305|010306|010307|010308|010309|010310|010311|010316|010700"}[1m])))>15.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: error code 过高(ddsserver服务)华东集群 - alert: 资源同步v2-qos-cpu expr: sum by(pod_name)(kube_metrics_server_pods_cpu{pod_name=~"duisys-resource-mover.*"}) / (sum by(pod_name)(label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) * 1000 * 1e+06)>0.8 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: qos-cpu - alert: 智能汽车事业部 redis内存使用率过高 expr: sum by(InstanceName,instanceId)(aliyun_acs_kvstore_MemoryUsage{instanceId=~"r-bp1mb5vfwgiknw7nn3.*|r-bp1ew4pz6gugfofdvb.*"})>80.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 智能汽车事业部 redis内存使用率过高 - alert: k8s节点内核死锁 expr: kube_node_status_condition{condition="KernelDeadlock", job=~"kube-state-metrics-d.-prod",status="true"}!=0.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: k8s节点内核死锁 - alert: BossCustomer CPU监控告警 expr: round(sum(kube_metrics_server_pods_cpu{pod_namespace="odcp",pod_name=~"mscp-boss-customer.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{namespace="odcp",pod=~"mscp-boss-customer.*",resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)*100/1000000)>75.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: BossCustomer CPU监控告警 - alert: 物理机显存使用率过高(线上LASR服务) expr: round((sum by(hostname,gpu)(node_gpu_mem_usage{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"lasr-.*"})*100,0.01)>90.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机显存使用率过高(线上LASR服务) - alert: filebeat服务unavailable expr: sum by (k8scluster) (kube_daemonset_status_number_unavailable{daemonset ="aispeech-filebeat", namespace ="filebeat",k8scluster=~"d.-prod"})>0.0 for: 10m labels: severity: warning annotations: summary: "{{ $value }}" description: filebeat服务unavailable超过10分钟 - alert: webhook(lua)cpu使用率过高 expr: round(sum(kube_metrics_server_pods_cpu{pod_namespace="cloud",pod_name=~"webhook-[0-9]+-smooth.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{namespace="cloud",pod=~"webhook-[0-9]+-smooth.*",resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)*100/1000000)>95.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: webhook(lua)cpu使用率过高 - alert: DUI集群物理机瞬时流入带宽提高10倍 expr: round(sum by (hostname,netdev)( ((node_net_ratein{hostname!~".*insight.*|.*bigdata.*|.*bdp.*"} - avg_over_time(node_net_ratein[30m]) ) / avg_over_time(node_net_ratein[30m])) and on (hostname,netdev) (avg_over_time(node_net_ratein[30m]) > 10000000) ),0.01)>10.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: DUI集群物理机瞬时流入带宽提高10倍 - alert: apisix上游熔断(线上LASR服务) expr: sum by (upstream_id) (duimonitor_apisix_unhealthz_count{upstream_id="00000000000000004872"})>10.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: apisix上游熔断(线上LASR服务) - alert: cdmserver错误码(P1)(线上对话服务) expr: sum by(env,errorId,errorMsg,moduleName,pod_name)(delta(cdmsvr2_error_statistic{env="prod",moduleName="cdmserver-v2",errorId=~"080002|080003|080015|080016|080017|080018|080019|080020-02|080020-03|080020-04|080020-22|080020-41|080020-90|080020-91|080020-92"}[1m]))>=10.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: cdmserver错误码(P1)(线上对话服务) - alert: me-asr-onesentence-service报错 expr: me_asr_onesentence_error>25.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: me-asr-onesentence-service报错 - alert: 阿里云-国科DUI专线备用线路流量过高 expr: round(sum by(vbrid,vbrname)(duimonitor_vbroutrate{vbrname="杭州-国科-网银互联-备用线路"})/1024/1024,0.01)>450.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 阿里云-国科DUI专线备用线路流量过高 - alert: ES集群物理机内存利用率过高(大数据) expr: round(sum by(hostname)(node_mem_usage{hostname=~"insight-search-17.aispeech.com|insight-search-18.aispeech.com|insight-search-19.aispeech.com|insight-search-20.aispeech.com|insight-search-21.aispeech.com|insight-search-22.aispeech.com|insight-search-23.aispeech.com|insight-search-24.aispeech.com"})*100,0.01)>90.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: ES集群物理机内存利用率过高(大数据) - alert: 大数据集群MySQL连接数使用率 expr: mysql_global_status_threads_connected/mysql_global_variables_max_connections>0.8 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据集群MySQL连接数使用率 - alert: doris query error expr: sum(rate(doris_prod_doris_fe_query_err[5m]))>3.0 for: 10m labels: severity: critical annotations: summary: "{{ $value }}" description: doris query error超过5 - alert: apisix-双活配置异常告警 expr: sum(duimonitor_apisix_dualactive_error_count)>=20.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: apisix-双活配置异常告警 - alert: 403错误(全链路服务) expr: delta(gateway_product{matched_route_id=~"00000000000000015495|00000000000000015487|00000000000002273411|00000000000002273395|00000000000001437849|00000000000002273415|00000000000000015493|00000000000000015494|00000000000000015485|00000000000002273401|00000000000000015486",status="403",productId!~"278573892|279601578|278587526|279602422|278585085|279599183|279607850|279608367|279608363"}[2m])>4.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 403错误(全链路服务) - alert: hotword错误码过高 expr: sum by (error_code) (delta(metric_hotword_errorcode{env=~"prod|lyg-prod",error_code!~"130003|130035"}[5m]))>200.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: hotword错误码过高 - alert: Pod异常重启(aiwork、aiot服务) expr: sum by(namespace,pod)(round(delta(kube_pod_container_status_restarts_total{job="kube-state-metrics-d1-prod",pod=~"aiwork.*|aiot.*"}[10m])))!=0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: Pod异常重启(aiwork、aiot服务) - alert: logbus nodes expr: count(logbus_prod_logbus_cache_lag)/2<7.0 for: 10m labels: severity: critical annotations: summary: "{{ $value }}" description: logbus nodes少于8 - alert: Pod异常重启(短语音重构) expr: sum by(k8scluster,namespace,pod)(round(delta(kube_pod_container_status_restarts_total{k8scluster=~"d[1.2.3]-prod",pod=~"me-asr-online-wfst.*|me-asr-onlineshort.*"}[10m])))!=0.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: Pod异常重启(短语音重构) - alert: kylin nodes expr: count(bdp_prod_kylin_jvm_threadcount)<4.0 for: 10m labels: severity: critical annotations: summary: "{{ $value }}" description: kylin nodes数量少于4 - alert: DUI集群物理机瞬时流出带宽提高10倍 expr: round(sum by (hostname,netdev)( ((node_net_rateout{hostname!~".*insight.*|.*bigdata.*|.*bdp.*"} - avg_over_time(node_net_rateout[30m]) ) / avg_over_time(node_net_rateout[30m])) and on (hostname,netdev) (avg_over_time(node_net_rateout[30m]) > 10000000) ),0.01)>10.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI集群物理机瞬时流出带宽提高10倍 - alert: 线下ies生产环境mysql异常告警 expr: mysql_global_status_threads_connected{job="yggdrasil-mid-d0-test",hostname="d0-ies-001"}==0.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 线下ies生产环境mysql异常告警 - alert: 情绪识别 expr: sum(runtimeError{eventName=~"emotion_analysis",env=~"prod",serviceName=~"nlu-fusion"} - runtimeError{eventName=~"emotion_analysis",env=~"prod",serviceName=~"nlu-fusion"} offset 1m)>=2.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: nlu-fusion请求情绪识别存在异常,请尽快处理 - alert: smarthome-3rdproxy CPU占用高于85% expr: round(sum by(pod_name) (kube_metrics_server_pods_cpu{job="metrics-server-exporter-d1-prod",pod_name=~"smarthome-3rdproxy.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core",job="kube-state-metrics-d1-prod",pod=~"smarthome-3rdproxy.*"}, "pod_name", "$1", "pod", "(.*)")) * 1000) * 100 / 1e+06) >85.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: smarthome-3rdproxy CPU占用高于85% - alert: GPU使用率过高(短语音重构) expr: round((sum by(k8scluster,hostname,gpu)(node_gpu_usage{k8scluster=~".*-prod"}) and on (hostname) node_k8s_service{servicename=~"me-asr-online.*"}),0.01)>99.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: GPU使用率过高(短语音重构) - alert: hotword热词训练403 expr: gateway_product{matched_route_id="00000000000000471636",status="403",productId!="278572254",productId!="279600850",productId!="278573494",productId!="279595275"} - gateway_product{matched_route_id="00000000000000471636",status="403",productId!="278572254",productId!="279600850",productId!="278573494",productId!="279595275"} offset 1m>30.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: hotword热词训练403 - alert: nlu-platform-service之cnluserver错误码高于300(线上语义服务) expr: round(sum by(serviceName,env, type)(delta(operation_counter{env="prod",serviceName=~"nlu-platform-service|nlu-platform-service-fullduplex|odcp-nlu-platform-service|nlu-platform-service-vip",type="150501"}[2m])))>300.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: nlu-platform-service之cnluserver错误码高于300(线上语义服务) - alert: duisys-nlu-hotword Memory 告警 expr: round(sum(kube_metrics_server_pods_mem{pod_name=~"duisys-nlu-hotword.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_requests{resource="memory",unit="byte"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name)/1000)*100,0.01)>90.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: duisys-nlu-hotword Memory 告警 - alert: 大数据集群K8S节点docker磁盘空间不足 expr: round(sum by(hostname,device,mountpoint)(node_disk_usage{job=~"base-exporter-d[1..3]-prod",hostname=~".*insight.*|.*bigdata.*|.*bdp.*",mountpoint=~"/var/lib/docker|/"})*100,0.01)>80.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 大数据集群K8S节点docker磁盘空间不足 - alert: ms-gateway expr: sum(runtimeError{env=~"prod",serviceName=~"ms-gateway",eventName=~"ms_gateway_request_alert"} - runtimeError{env=~"prod",serviceName=~"ms-gateway",eventName=~"ms_gateway_request_alert"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: iot-proxy服务CPU占用率高于85% expr: sum(sum_over_time(kube_metrics_server_pods_cpu{pod_name=~"iot-proxy"}[1m]))>85.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: iot-proxy服务CPU占用率高于85% - alert: me-cinfo拨测 expr: speech_blackbox_testing{service=~"me-cinfo-service"}>0.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: me-cinfo拨测告警 - alert: 国科热词cpu load高 expr: sum by (hostname)(node_load5{hostname=~"d3-cpu-001|d3-thacp-001"}) / sum by (hostname)(node_cpu_core)>1.3 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 国科热词cpu load高 - alert: Pod异常(aiwork、aiot服务) expr: sum by (pod)(kube_pod_status_ready{job="kube-state-metrics-d1-prod",pod=~"aiwork.*|aiot.*"} and on (pod) kube_pod_status_phase{phase="Running"} )!=1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: Pod异常(aiwork、aiot服务) - alert: smart-clean-boot expr: sum(runtimeError{serviceName=~"smart-clean-boot",env=~"prod"} - runtimeError{serviceName=~"smart-clean-boot",env=~"prod"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: 集群DNS解析异常 expr: sum by(service, reason)(duimonitor_service_exception{service="dns"})>20.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 集群DNS解析异常 - alert: 异常状态码过高011313(短语音重构) expr: round(sum by(k8scluster,status)(delta(pid_status_total{status=~"011313"}[3m])))>20.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 异常状态码过高011313(短语音重构) - alert: (DUI)阿里云Redis 内存使用率 expr: sum by (instanceId,nodeId)(aliyun_acs_kvstore_StandardMemoryUsage{instanceId!~"r-bp11zi6yr1o10y01t3|r-bp177cdiib1dvyc1si|r-bp1y1uzzhvbcqz0e25|r-bp19d64fe48b9384|r-bp156a0ddfcc0c74|r-bp1a0b2be6ccc644|r-bp18ce30a05fe2a4"}or aliyun_acs_kvstore_ShardingMemoryUsage)>93.0 for: 3m labels: severity: emergency annotations: summary: "{{ $value }}" description: (DUI)阿里云Redis 内存使用率 - alert: 录音文件转写speechssc消息堆积 expr: avg(avg_over_time(rabbitmq_queue_messages_ready{queue=~"smart-inspection-file-new"}[1m]))>50.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: rabbitmq消费堆积 - alert: aimt-external-controller接口调用超时(会议宝业务) expr: max by(moduleName, method, uri)(rate(http_server_requests_seconds_sum{env="prod",moduleName="external-controller",lang="java",uri!~".*prometheus.*|/\\*\\*|/healthz|/asr/upload/audio.*|/internal/asr/callback.*"}[15m]) / rate(http_server_requests_seconds_count{env="prod",moduleName="external-controller",lang="java",uri!~".*prometheus.*|/\\*\\*|/healthz|/asr/upload/audio.*|/internal/asr/callback.*"}[15m]))>1.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: aimt-external-controller接口调用超时(会议宝业务) - alert: 一句话服务并发过高(线上LASR服务) expr: sum by(pod_name,pod_project)(gauge_n_sentence_concurrency)>70.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 一句话服务并发过高(线上LASR服务) - alert: aiym-intent-detection服务”/v1/intent/get"接口延时过高 expr: sum by(handler,k8scluster,pod_name,method)(histogram_quantile(0.95,http_request_duration_seconds_bucket{k8scluster=~"d1-prod",handler="/v1/intent/get"})*1000)>300.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: aiym-intent-detection服务”/v1/intent/get"接口延时过高 - alert: 对话中控耗时监控(P0)(线上对话服务) expr: sum by(instanceId,latency,mode)(service_duration{mode="pt95",instanceId="dm-dispatch-server-fullduplex",latency=~"dmDispatchDuration|dm|nludispatch|ba"})>1000.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: 对话中控耗时监控(P0)(线上对话服务) - alert: 大数据supervisor进程重启告警 expr: sum by(service,stats)(idelta(supervisor_ontime[5m]))<0.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据supervisor进程重启告警 - alert: d3-011311告警(线上识别服务) expr: sum by(mode,status,describe,detail,cluster)(duimonitor_service_exception{service="ASR",env="prod",status="011311"})>500.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 011311(热词训练失败)阈值500,每分钟 - alert: DUI集群物理机磁盘挂载点读写异常 expr: sum by(device,fs,mountpoint,hostname)(node_disk_mount{hostname!~".*beta.*|insight.*|.*bigdata.*"})!=1.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: DUI集群物理机磁盘挂载点读写异常 - alert: 网关异常(线上LASR服务) expr: round(sum by(host,status,matched_route_id)(delta(gateway_fail{env="d3-prod",host="lasr.duiopen.com",status=~"499|500|502|503|504",tag="apisix"}[1m])))>1.0 for: 3m labels: severity: critical annotations: summary: "{{ $value }}" description: 网关异常(线上LASR服务) - alert: 物理机显卡使用率过高(线上LASR服务) expr: (sum by(hostname,gpu)(node_gpu_usage{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"lasr-.*"})>90.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机显卡使用率过高(线上LASR服务) - alert: 德邦全场景-接口异常 expr: sum(runtimeError{eventName=~"DepponService.API.RW.ERROR",env=~"prod",serviceName=~"smart-dm-inter-boot"} - runtimeError{eventName=~"DepponService.API.RW.ERROR",env=~"prod",serviceName=~"smart-dm-inter-boot"} offset 5m)>5.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: json调用接口异常 - alert: 录音文件转写internalVadReturn4AudioConsumer消息堆积 expr: avg(avg_over_time(rabbitmq_queue_messages_ready{queue=~"internalVadReturn4AudioConsumer"}[1m]))>50.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: internalVadReturn4AudioConsumer消息堆积 - alert: kubelet证书剩余30天到期提醒 expr: (apiserver_client_certificate_expiration_seconds_count{job=~"kube-apiserver-.*"} > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job=~"kube-apiserver-.*"}[5m]))) < 30*3600*24)>0.0 for: 30m labels: severity: critical annotations: summary: "{{ $value }}" description: kubelet证书剩余30天到期提醒 - alert: 智能客服ES集群不健康(kf-alpha-es) expr: sum by(es_name,k8scluster,mode)(kf_monitor_elasticsearch{k8scluster="kf-alpha",mode="health"})!=0.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: 智能客服ES集群不健康(kf-alpha-es) - alert: wechat-assistant-java expr: sum(runtimeError{env=~"prod",serviceName=~"wechat-assistant-java",eventName=~"wechat_admin_logout"} - runtimeError{env=~"prod",serviceName=~"wechat-assistant-java",eventName=~"wechat_admin_logout"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: 国科宿主机DNS请求数过高 expr: sum(rate(coredns_dns_requests_total{job=~"coredns-host-d3-prod"}[2m])) by (server)>2000.0 for: 2m labels: severity: emergency annotations: summary: "{{ $value }}" description: 国科宿主机DNS请求数过高 - alert: (DUI)阿里云MongoDB 内存使用率 expr: sum by(instanceId,role)(aliyun_acs_mongodb_MemoryUtilization{instanceId!~"dds-bp11d328e4e8da74|dds-bp17ead8b96974c4|dds-bp177ae7a19a9a34|dds-bp1f6f784b1a3024|dds-bp15c90ebc25a554|dds-bp15c9d449e3b214|dds-bp1911c580f6cb44|dds-bp1e9e91ed7bfe24"})>90.0 for: 3m labels: severity: critical annotations: summary: "{{ $value }}" description: (DUI)阿里云MongoDB 内存使用率 - alert: 大数据集群物理机磁盘fstab卷丢失 expr: sum by(device,fs,mountpoint,hostname)(node_disk_mount{hostname=~".*insight.*|.*bigdata.*|.*bdp.*"})==0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据集群物理机磁盘fstab卷丢失 - alert: 物理机CPU使用率过高(线上tts服务) expr: round((sum by(hostname)(node_cpu_usage_total{job="base-exporter-d1-prod"}) and on (hostname) node_k8s_service{namespace="cloud",servicename=~"dds-tts-lite|tts-kf"})*100,0.01)>75.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机CPU使用率过高(线上tts服务) - alert: 阿里云-连云港DUI专线备用线路流量过高 expr: round(sum by(vbrid,vbrname)(duimonitor_vbroutrate{vbrname="杭州-连云港-天地祥云-备用线路"})/1024/1024,0.01)>50.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 阿里云-连云港DUI专线备用线路流量过高 - alert: apisix/auth/proxy/stream端口syn-recv状态拥堵 expr: ((sum by(hostname)(node_tcp_synsent{hostname!~".*beta.*"}) and on(hostname) node_k8s_service{podname=~"apisix-[0-9a-z]+-[0-9a-z]+"}) +on(hostname) group_left(podname) node_k8s_service{podname=~"apisix-[0-9a-z]+-[0-9a-z]+"})>200.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: apisix/auth/proxy/stream端口syn-recv状态拥堵 - alert: Pod异常重启(线上对话服务) expr: sum by(namespace,pod)(round(delta(kube_pod_container_status_restarts_total{job="kube-state-metrics-d1-prod",pod=~"dm-dispatch-.*|dsk-dm-.*|cdmserver-.*"}[10m])))!=0.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: Pod异常重启(线上对话服务) - alert: apisix-mirror异常 expr: sum(duimonitor_apisix_mirror_error)>=8000.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: apisix-mirror异常 - alert: 大数据Airflow Queued Tasks数 expr: sum by(env,hostname)(monitor_airflow_queued_tasks_sum{env="prod"})>200.0 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据Airflow Queued Tasks数 - alert: 设备激活QPS告警(owl服务) expr: sum by(moduleName,uri)(rate(http_server_requests_seconds_count{lang="java", env="prod", moduleName="mscp-owl", uri=~".*/device/register"}[5m]))>900.0 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: 设备激活QPS告警(owl服务) - alert: 顺丰下单超时 expr: sum(runtimeError{env=~"prod",serviceName=~"dataclean"} - runtimeError{env=~"prod",serviceName=~"dataclean"} offset 1m)>2.0 for: 2m labels: severity: critical annotations: summary: "{{ $value }}" description: 超时告警 - alert: 错误码告警(线上声纹服务) expr: sum by(k8scluster,pod_namespace,pod_project,status)(idelta(counter_error_tasks{k8scluster="d3-prod",status!~"203|401|3|0|4"}[1m]))>5.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 错误码告警(线上声纹服务) - alert: nlu-platform-service拨测(线上语义服务) expr: sum by(api,mode,message,service)(speech_blackbox_testing{mode=~"availability",service="nlu-platform-service"})!=0.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: nlu-platform-service拨测(线上语义服务) - alert: 大数据logbus-public-edge SLB最大连接数使用率过高 expr: sum by(instanceId,vip)(aliyun_acs_slb_dashboard_InstanceMaxConnection{instanceId="lb-bp1a3wpvpq3i4iylr1dye"})/10000>130.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 大数据logbus-public-edge SLB最大连接数使用率过高 - alert: 资源同步v1-rsync expr: sum by(errorId, errorMsg) (autosyncserver_monitor_rsync{env="prod"} - autosyncserver_monitor_rsync{env="prod"} offset 1m)>1.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: rsync失败 - alert: 大数据代理 服务状态告警 expr: bdp_monitor_lb_status!=0.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据代理 服务状态告警 - alert: runtime error(ddsserver服务) expr: delta(metric_runtime_error{env="prod"}[1m])>0.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: runtime error(ddsserver服务) - alert: 国科单点mongo挂了 expr: ops_mongo_nfs001_monitor==0.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 国科单点mongo挂了 - alert: 车载 webhook / webapi 信源慢-接口耗时告警 expr: sum by(application,uri)(increase(http_server_requests_seconds_sum{pod_project=~"lyra-webhook-.*|lyra-webapi.*",job="metrics-service-d1-prod",uri=~"/car.*|/notification.*|/dui/sgmwcloud.*"}[3m]))*1000/sum by(application,uri)(increase(http_server_requests_seconds_count{pod_project=~"lyra-webhook-.*|lyra-webapi.*",job="metrics-service-d1-prod",uri=~"/car.*|/notification.*|/dui/sgmwcloud.*"}[3m]))>3500.0 for: 10m labels: severity: warning annotations: summary: "{{ $value }}" description: 车载 webhook / webapi 信源慢-接口耗时告警 - alert: doris connections expr: max(doris_prod_doris_fe_connection_total)>=1000.0 for: 60m labels: severity: critical annotations: summary: "{{ $value }}" description: doris connections超过100 - alert: TSDB时序数据库磁盘使用率过高 expr: sum by(tsdbid,tsdbname)(duimonitor_tsdb_diskusage)>80.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: TSDB时序数据库磁盘使用率过高 - alert: DUI阿里云SLB后端实例不健康 expr: sum by(instanceId,instance_name,instance_dept,instance_status,vip,port,url)((aliyun_acs_slb_dashboard_UnhealthyServerCount or aliyun_acs_slb_dashboard_UnhealthyServerCountWithRule) *on(instanceId) group_left(instance_name,instance_dept,instance_status) aispeech_aliyun_slb_spec_info{instance_name!~"智能客服.*|kf-.*"})!=0.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI阿里云SLB后端实例不健康 - alert: runtime error(线上识别服务) expr: duimonitor_runtimex{env="prod",service="casrserver"}>1.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: runtime error(线上识别服务) - alert: scpbackend CPU占用率高于75% expr: round(sum by(pod_name) (kube_metrics_server_pods_cpu{job="metrics-server-exporter-d1-prod",pod_name=~"scpbackend.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core",job="kube-state-metrics-d1-prod",pod=~"scpbackend.*"}, "pod_name", "$1", "pod", "(.*)")) * 1000) * 100 / 1e+06)>75.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: scpbackend CPU占用率高于75% - alert: 错误码告警(voice-copy-synthesis-v2服务) expr: sum by(errorId)(delta(errorCounter{env="prod",service="voice-copy-synthesis-v2",errorId!~"011000|011006|011008"}[3m]))>12.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 错误码告警(voice-copy-synthesis-v2服务) - alert: file-resync2-user-res-agent cpu高告警 expr: round(sum(kube_metrics_server_pods_cpu{pod_name=~"file-resync2-agent-user-res.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)*100/1000000)>90.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: file-resync2-user-res-agent cpu高告警 - alert: nlu-hotword服务CPU告警 expr: sum by(pod_name) (kube_metrics_server_pods_cpu{pod_name=~"duisys-nlu-hotword.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core",pod=~"duisys-nlu-hotword.*"}, "pod_name", "$1", "pod", "(.*)")) * 1000 * 1000000)>0.8 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: qos 华东ACK nlu-hotword服务cpu占用超过80% - alert: duisys-pasc-server 5xx告警 expr: sum(increase(monitor_pasc_status_code{env="prod", code=~"500|502|503|504"}[5m]))>=1.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: duisys-pasc-server 5xx状态码监控 - alert: 大数据SeaweedFS旧集群磁盘空间不足 expr: round(sum by(hostname,device,mountpoint)(node_disk_usage{hostname=~"insight-audiofs-data-.*"})*100,0.01) >90.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据SeaweedFS旧集群磁盘空间不足 - alert: 雅迪dp-sr耗时拨测(线上声纹服务) expr: speech_blackbox_testing{cluster="vpr",message="200",api="/vpr/v3/register",mode="cost",service="dp-sr-register"}>1.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 雅迪dp-sr耗时拨测(线上声纹服务) - alert: DUI正式环境Pod镜像异常 expr: (sum (max_over_time(kube_pod_container_status_waiting_reason{k8scluster=~"d.-prod",namespace!="idata",reason=~"ImagePullBackOff|ErrImagePull|InvalidImageName"}[2m])) by (k8scluster,pod,container,reason) or sum (max_over_time(kube_pod_init_container_status_waiting_reason{k8scluster=~"d.-prod",namespace!="idata",reason=~"ImagePullBackOff|ErrImagePull|InvalidImageName"}[2m])) by (k8scluster,pod,container,reason))>0.0 for: 6m labels: severity: emergency annotations: summary: "{{ $value }}" description: '{{ $labels.k8scluster }}集群的{{ $labels.namespace }}/{{ $labels.container }}服务镜像拉取异常超过6分钟' - alert: (DUI)阿里云RDS 内存使用率 expr: aliyun_acs_rds_dashboard_DiskUsage{instanceId!~"rm-bp1b6b4v33714735u|rm-bp11e5suu33pk624d|rm-bp145f276j7q8rr06|rm-bp1m4ks59104ovz01|rr-bp144ed50n909534y"} +on(instanceId) group_left(DBInstanceDescription) (label_replace(aliyun_meta_rds_info, "instanceId", "$1", "DBInstanceId", "(.*)"))>80.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: (DUI)阿里云RDS 内存使用率 - alert: DUI防火墙公网IP下行流量过高 expr: round(sum by(ifAlias,ifDescr)(irate(ifHCInOctets{ifDescr=~"GigabitEthernet0/0/6|GigabitEthernet1/0/8",instance="10.24.20.4"}[5m]))/1024/1024*8,0.01)>40.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI防火墙公网IP下行流量过高 - alert: 阿里云exporter离线 expr: up{job="aliyun-exporter"}!=1.0 for: 3m labels: severity: emergency annotations: summary: "{{ $value }}" description: 阿里云exporter离线 - alert: ba-gateway expr: sum(runtimeError{env=~"prod",serviceName=~"ba-gateway",eventName=~"request_count_overlimiter|redis_failure_alert"} - runtimeError{env=~"prod",serviceName=~"ba-gateway",eventName=~"request_count_overlimiter|redis_failure_alert"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: aiym-intent-detection服务Pod异常 expr: kube_pod_status_ready{k8scluster="d1-prod",pod=~"aiym-intent-detection.*",condition="true"}!=1.0 for: 2m labels: severity: warning annotations: summary: "{{ $value }}" description: aiym-intent-detection服务Pod异常 - alert: dcaserver CPU占用率高于75% expr: round(sum by(pod_name) (kube_metrics_server_pods_cpu{job="metrics-server-exporter-d1-prod",pod_name=~"dcaserver.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core",job="kube-state-metrics-d1-prod",pod=~"dcaserver.*"}, "pod_name", "$1", "pod", "(.*)")) * 1000) * 100 / 1e+06)>85.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: dcaserver CPU占用率高于75% - alert: 异常状态码过高其它(短语音重构) expr: sum by(k8scluster,status)(delta(pid_status_total{status!~"0|200|011312|011313"}[1m]))>30.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 异常状态码过高其它(短语音重构) - alert: 车载 webhook / webapi 超时占比告警 expr: sum by(application,target)(increase(lyra_requests_timeout_count{job="metrics-service-d1-prod"}[3m]))*100/sum by(application,target)(increase(lyra_requests_count{job="metrics-service-d1-prod"}[3m]))>20.0 for: 10m labels: severity: critical annotations: summary: "{{ $value }}" description: 车载 webhook / webapi 超时占比告警 - alert: MEDINGDING产品级并发预警 expr: sum by(productId,service,conn,max,pod_name) (servcie_connections_ratio{env="d1-prod",productId!~"279598784|279599155|279613425|279600850"})>50.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: MEDINGDING产品级并发预警 - alert: 资源同步v1-qos-cpu expr: sum by(pod_name) (kube_metrics_server_pods_cpu{pod_name=~"auto-sync-server.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) * 1000 * 1e+06) >0.8 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: qos-cpu - alert: ES集群物理机CPU利用率过高(大数据) expr: round(sum by(hostname)(node_cpu_usage_total{hostname=~"insight-search-17.aispeech.com|insight-search-18.aispeech.com|insight-search-19.aispeech.com|insight-search-20.aispeech.com|insight-search-21.aispeech.com|insight-search-22.aispeech.com|insight-search-23.aispeech.com|insight-search-24.aispeech.com"})*100,0.01) >80.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: ES集群物理机CPU利用率过高(大数据) - alert: DUI正式环境Gluster机器磁盘空间不足 expr: round(sum by(hostname,device,mountpoint)(node_disk_usage{k8scluster!~"d1-.*",hostname=~".*glusterfs.*"})*100,0.01)>94.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI正式环境Gluster机器磁盘空间不足 - alert: 华东lasr-sentence用量 expr: delta(gateway_product{matched_route_id="00000000000000001560",status="403",productId!~"279593784|279600424|279595943|279601347|279603827|279604013|279596911|279597576|279606758|279597614|278587791|279604555|279608596|279608520|279595362|279608893|278587440|279594815|278589443|279612015",productId!~""}[1m])>1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 华东lasr-sentence用量 - alert: 华东apisix上游熔断 expr: sum by (upstream_id,upstream_ip) (duimonitor_apisix_unhealthz_count{host=~"d1.*|apisix-internal-auth-648dc85d44.*|apisix-internal-24682-smooth-57648cddbd.*"})>10.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 华东apisix上游熔断 - alert: 世界之树中间件服务器磁盘使用率过高 expr: round(sum by(hostname,device,mountpoint)(node_disk_usage{hostname="yggdrasil-mid-001"}*100),0.01)>80.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 世界之树中间件服务器磁盘使用率过高 - alert: k8s节点Pod网络不通 expr: probe_success{job=~"podnetwork-check-.*", pod_ready="true"}!=1.0 for: 10m labels: severity: critical annotations: summary: "{{ $value }}" description: k8s节点Pod网络不通 - alert: 国科大数据kafka1.0离线告警 expr: count(sum by(name,hostname)(kafka_server_brokertopicmetrics{job="bdp-prod-kafka-d3-bigdata",name=~"MessagesInPerSec",type="FiveMinuteRate"}>1)) !=6.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 国科大数据kafka1.0离线告警 - alert: DSKDM 错误码告警5分钟内出现60次 expr: sum(increase(DSKDM_monitor_dskdm_error{env="prod"}[5m])) by (errorId)>60.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: DSKDM 错误码告警5分钟内出现60次 - alert: 综合业务-DUI开放平台 redis内存使用率过高 expr: sum by(InstanceName,instanceId)(aliyun_acs_kvstore_MemoryUsage{instanceId=~"r-bp1244cq42udnug1dl.*"})>80.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 综合业务-DUI开放平台 redis内存使用率过高 - alert: 大数据防火墙公网IP下行流量过高 expr: round(sum by(ifAlias,ifDescr)(irate(ifHCInOctets{ifDescr=~"GigabitEthernet0/0/1|GigabitEthernet1/0/4",instance="10.24.20.4"}[5m]))/1024/1024*8,0.01)>100.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 大数据防火墙公网IP下行流量过高 - alert: 连云港me-tts-nlu专用redis内存过高 expr: 100 * (redis_memory_used_bytes{instance="10.24.12.38:9122"} / redis_memory_max_bytes{instance="10.24.12.38:9122"} )>80.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 连云港me-tts-nlu专用redis内存过高 - alert: ES集群物理机磁盘IO负载过高(大数据) expr: sum by(hostname,device)(node_disk_ioutil{hostname=~"insight-search-17.aispeech.com|insight-search-18.aispeech.com|insight-search-19.aispeech.com|insight-search-20.aispeech.com|insight-search-21.aispeech.com|insight-search-22.aispeech.com|insight-search-23.aispeech.com|insight-search-24.aispeech.com"})>90.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: ES集群物理机磁盘IO负载过高(大数据) - alert: 大数据集群物理机磁盘Inode使用率过高 expr: sum by(hostname,device,mountpoint)(node_disk_inode_usage{hostname=~".*insight.*|.*bigdata.*|.*bdp.*"})>0.8 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据集群物理机磁盘Inode使用率过高 - alert: (DUI)阿里云RDS 连接数使用率 expr: aliyun_acs_rds_dashboard_ConnectionUsage{instanceId!~"rm-bp1b6b4v33714735u|rm-bp11e5suu33pk624d|rm-bp145f276j7q8rr06|rm-bp1m4ks59104ovz01|rr-bp144ed50n909534y"} +on(instanceId) group_left(DBInstanceDescription) (label_replace(aliyun_meta_rds_info, "instanceId", "$1", "DBInstanceId", "(.*)"))>70.0 for: 3m labels: severity: critical annotations: summary: "{{ $value }}" description: (DUI)阿里云RDS 连接数使用率 - alert: NEW-MSCP-OWL-REDIS redis内存使用率过高 expr: sum by(nodeId,instanceId)(aliyun_acs_kvstore_ShardingMemoryUsage{instanceId=~"r-bp13qfgkrjnzqcvot6"})>90.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: NEW-MSCP-OWL-REDIS redis内存使用率过高 - alert: 小鹏产品404 expr: delta(gateway_product{matched_route_id="00000000000000001378",status=~"404|403|500|502|503|504",productId="278578689"}[1m])>50.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 小鹏产品404 - alert: 连云港宿主机DNS请求数过高 expr: sum(rate(coredns_dns_requests_total{job=~"coredns-host-d2-prod"}[2m])) by (server)>2000.0 for: 2m labels: severity: emergency annotations: summary: "{{ $value }}" description: 连云港宿主机DNS请求数过高 - alert: 物理机上有大于100M的盘未挂载(单位:G) expr: round(sum by(hostname,device,fstype)(node_disk_noused{device!~".*ceph.*",disktype!~"loop|swap|dm|rom|",fstype!~"swap|vfat|LVM2_member|zfs_member"})/1024/1024/1024)>0.1 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 物理机{{ $labels.hostname }}上有大于100M的盘未挂载 - alert: d3-lb-001机器TCP连接数过高 expr: sum by(hostname,status)(node_tcp_count{hostname=~"d3-lb-001"})>60000.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: d3-lb-001机器TCP连接数过高 - alert: file-resync2-server内存高告警 expr: round(sum by(pod_name) (kube_metrics_server_pods_mem{pod_name=~"file-resync2-server.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="memory",unit="byte"}, "pod_name", "$1", "pod", "(.*)")) / 1024) * 100)>80.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: file-resync2-server内存高告警 - alert: 太仓挪车外部接口报警 expr: sum(runtimeError{serviceName=~"smart-dm-inter-boot",eventName=~" TaiCangService.loadCarOwnerInfo|TaiCangService.loadCarOwnerInfo.wd",level=~"error",env=~"prod"} - runtimeError{serviceName=~"smart-dm-inter-boot",eventName=~" TaiCangService.loadCarOwnerInfo|TaiCangService.loadCarOwnerInfo.wd",level=~"error",env=~"prod"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 太仓挪车外部接口报警 - alert: DUI正式环境Pod被驱逐 expr: sum by (k8scluster, namespace, pod) (kube_pod_status_reason{k8scluster=~"d.-prod", reason="Evicted"})>0.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI正式环境Pod被驱逐 - alert: 海尔识别超限 expr: delta(gateway_product{productId="279598153",status="403"}[1m])>1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 海尔识别超限 - alert: 车载 webhook/webapi CPU占用告警 expr: sum by(pod_name)(system_cpu_usage{pod_project=~"lyra-webhook-.*|lyra-webapi.*",job="metrics-service-d1-prod"})*100>85.0 for: 10m labels: severity: emergency annotations: summary: "{{ $value }}" description: 车载 webhook/webapi CPU占用告警 - alert: 大数据集群物理机小于1T的磁盘空间不足 expr: round(sum by(hostname,device,mountpoint)((node_disk_usage{hostname=~".*insight.*|.*bigdata.*|.*bdp.*"} and on(hostname,device) (node_disk_total<1099511627777)))*100,0.01)>85.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据集群物理机小于1T的磁盘空间不足 - alert: accountlink内存占用率高于75% expr: round(sum by(pod_name) (kube_metrics_server_pods_mem{job="metrics-server-exporter-d1-prod",pod_name=~"accountlink.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="memory",unit="byte",job="kube-state-metrics-d1-prod",pod=~"accountlink.*"}, "pod_name", "$1", "pod", "(.*)")) / 1024) * 100)>75.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: accountlink内存占用率高于75% - alert: 欧瑞博热词超并发 expr: sum by (productId) (delta(gateway_product{matched_route_id="00000000000000471636",status="403",productId=~"278574405|278585312"}[1m]))>100.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 欧瑞博热词超并发 - alert: gi expr: sum(runtimeError{serviceName=~"nlu-fusion",eventName=~"greeting|gi|person-name",env=~"prod"} - runtimeError{serviceName=~"nlu-fusion",eventName=~"greeting|gi|person-name",env=~"prod"} offset 1m)>=2.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: nlu-fusion请求gi or gi-server存在异常,请尽快处理 - alert: 对话中控错误码(P0)(线上对话服务) expr: sum by(env,errorId,errorMsg,moduleName,pod_name)(delta(dmdispatch_error_statistic{env="prod",moduleName=~"dm-dispatch-server|dm-dispatch-server-fullduplex",errorId=~"010415|010501|010502|080015|080016|150000|150001|150002|150003|010411"}[1m]))>0.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: 对话中控错误码(P0)(线上对话服务) - alert: 大数据RabbitMQ读写告警 expr: sum without(hostip,instance,job,scriptname)(bdp_monitor_rabbitmq)!=1.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 大数据RabbitMQ读写告警 - alert: 研发部-语言及知识服务-应用服务redis内存使用率过高 expr: sum by(InstanceName,instanceId)(aliyun_acs_kvstore_StandardMemoryUsage{instanceId=~"r-bp1uapy2qcydzj22z4"} +on(instanceId) group_left(InstanceName) (label_replace(aliyun_meta_redis_info, "instanceId", "$1", "InstanceId", "(.*)")))>80.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 研发部-语言及知识服务-应用服务redis内存使用率过高 - alert: DSKDM_monitor_dskdm_error(线上对话服务) expr: sum by(env,errorId)(irate(DSKDM_monitor_dskdm_error{env="prod"}[5m]))>=10.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: DSKDM_monitor_dskdm_error(线上对话服务) - alert: DUI集群物理机磁盘坏道Other_Error expr: sum by(hostname,slot)(node_disk_other_error{hostname!~".*insight.*|.*bigdata.*|.*bdp.*|.*kf.*"})>50.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: DUI集群物理机磁盘坏道Other_Error - alert: 快递资源告警(NLU引擎) expr: sum(runtimeError{serviceName=~"nlu-fusion",env=~"prod",eventName=~"aics|aiexreminder|aiytofull|aideppon|aidate|ailetterid|ailetterid_digits|aiamount|aiexmodel"} - runtimeError{serviceName=~"nlu-fusion",env=~"prod",eventName=~"aics|aiexreminder|aiytofull|aideppon|aidate|ailetterid|ailetterid_digits|aiamount|aiexmodel"} offset 1m)>=2.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 快递资源:aics、aiexreminder、aiytofull、aiamount、aiexmodel、aideppon - alert: IT系统K8S集群节点NotReady expr: sum by(k8scluster,node)(kube_node_status_condition{node=~"d0-test-itsystem00.*",condition="Ready",status="true"})!=1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: IT系统K8S集群节点NotReady - alert: Pod重启告警(线上语义服务) expr: (sum by (pod,k8scluster)(kube_pod_status_ready{k8scluster=~"d.-prod",pod=~".*cnluserver.*|olive-semantic-aidui-.*|olive-semantic-bcd-.*|nlu-platform-service-.*",condition="true"} and on (namespace,pod) kube_pod_status_phase{phase="Running"}))!=1.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: Pod重启告警(线上语义服务) - alert: DUI集群物理机磁盘Inode使用率过高 expr: sum by(hostname,device,mountpoint)(node_disk_inode_usage{hostname!~".*insight.*|.*bigdata.*|.*bdp.*|.*kf.*"})>0.8 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: DUI集群物理机磁盘Inode使用率过高 - alert: business服务15分钟5xx错误率 > 60% expr: sum(delta(kong_http_status{uri=~"/business.*",code=~"5.*",env="prod",moduleGroup="aimt"}[15m])) / sum(delta(kong_http_status{uri=~"/business.*",env="prod",moduleGroup="aimt"}[15m]))>0.6 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 服务告警 - alert: 大数据集群物理机磁盘挂载点读写异常 expr: sum by(device,fs,mountpoint,hostname)(node_disk_mount{hostname=~".*insight.*|.*bigdata.*|.*bdp.*"})!=1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据集群物理机磁盘挂载点读写异常 - alert: 大数据iData celery队列1告警(Prod环境) expr: sum by(hostname,env,queue,status)(monitor_celery_tasks{env="prod",status="True",queue=~"worker2"})>8.0 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据iData celery队列1告警(Prod环境) - alert: kylin内存异常 expr: max(bdp_prod_kylin_gc_memory_usage{type="used",space="G1 Old Gen"})>2.68435005E10 for: 5m labels: severity: critical annotations: summary: "{{ $value }}" description: kylin内存异常 - alert: doris latency expr: max(doris_prod_doris_fe_query_latency_ms{quantile="0.95"})>60000.0 for: 60m labels: severity: critical annotations: summary: "{{ $value }}" description: doris latency超过60秒 - alert: 阿里云-国科DUI专线主线路流量过高 expr: round(sum by(vbrid,vbrname)(duimonitor_vbroutrate{vbrname="杭州-国科-天地祥云-DUI主线路"})/1024/1024,0.01)>1300.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 阿里云-国科DUI专线主线路流量过高 - alert: DUI专线线路出口流量过高 expr: round(sum by(vbrid,vbrname)(duimonitor_vbroutrate{vbrname!="杭州-国科-网银互联-大数据主线路"})/1024/1024,0.01)>1400.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI专线线路出口流量过高 - alert: smart-partner-OutboundToManyRequest expr: sum(runtimeError{env=~"prod",eventName=~"OutboundToManyRequest",serviceName=~"smart-partner-module-boot"} - runtimeError{env=~"prod",eventName=~"OutboundToManyRequest",serviceName=~"smart-partner-module-boot"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: pushserver-ng Memory QoS expr: sum by(pod_name) (kube_metrics_server_pods_mem{instance="metrics-server-exporter-hd.duiopen.com:80",pod_name=~"duisys-pushserver-ng.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="memory",unit="byte",instance="kube-state-metrics-hd.duiopen.com:80",pod=~"duisys-pushserver-ng.*"}, "pod_name", "$1", "pod", "(.*)")) / 1000)>0.92 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: pushserver-ng Memory QoS告警 - alert: accountlink CPU占用率高于75% expr: round(sum by(pod_name) (kube_metrics_server_pods_cpu{job="metrics-server-exporter-d1-prod",pod_name=~"accountlink.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core",job="kube-state-metrics-d1-prod",pod=~"accountlink.*"}, "pod_name", "$1", "pod", "(.*)")) * 1000) * 100 / 1e+06)>75.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: accountlink CPU占用率高于75% - alert: 阿里复刻训练队列满报警 expr: min(min_over_time(_exported_voice_copy_external_monitor{env=~"prod",org=~"ALI",type=~"waiting_queue_length"}[1m]))>5.0 for: 2m labels: severity: emergency annotations: summary: "{{ $value }}" description: 报警:阿里复刻训练队列连续两分钟超5个 - alert: pid不支持服务类型 expr: sum(runtimeError{env=~"prod",serviceName=~"ba-gateway",eventName=~"ServiceTypeValidation_Unsupported"} - runtimeError{env=~"prod",serviceName=~"ba-gateway",eventName=~"ServiceTypeValidation_Unsupported"} offset 1m)>5.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: pid不支持服务类型 - alert: 连云港apisixstream异常 expr: sum by(mode)(lyg_lb_apisixstream)!=0.0 for: 3m labels: severity: emergency annotations: summary: "{{ $value }}" description: 连云港apisixstream异常 - alert: etcd集群的leader频繁变更 expr: sum by (etcdusage) (rate(etcd_server_leader_changes_seen_total{job=~"etcd-servers-d.-prod"}[30m]))>3.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: etcd的leader频繁变更 - alert: 华东正式apisix-5xx expr: sum by (matched_route_name)(idelta(gateway_fail{env=~"d1-prod",status=~"500|502|503|504",tag=~"apisix-internal|apisix|apisix-auth|apisix-auth-internal",proxyUpstreamName!="cloud_ama-mobile-api_8080",host!~"118.178.44.129|s-test.api.aispeech.com",matched_route_name!="tts>tts-kf>prod>279594681"}[1m]))>20.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 华东正式apisix-5xx - alert: 国科入口haproxy异常 expr: sum by(mode)(ops_monitor)!=0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 国科入口haproxy异常 - alert: 汇置项目-dev redis内存是使用率过高 expr: sum by(InstanceName,instanceId)(aliyun_acs_kvstore_MemoryUsage{instanceId=~"r-bp13rmzdkqh0u90m3q"} +on(instanceId) group_left(InstanceName) (label_replace(aliyun_meta_redis_info, "instanceId", "$1", "InstanceId", "(.*)")))>80.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 汇置项目-dev redis内存是使用率过高 - alert: 磁盘空间不足(短语音重构) expr: round((sum by(hostname,device,mountpoint)(node_disk_usage{k8scluster=~".*-prod"}) and on (hostname) node_k8s_service{servicename=~"me-asr-online.*"})*100,0.01)>80.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 磁盘空间不足(短语音重构) - alert: iot-proxy服务内存占用率高于85% expr: sum(sum_over_time(kube_metrics_server_pods_mem{pod_name=~"iot-proxy"}[1m]))>85.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: iot-proxy服务内存占用率高于85% - alert: dsk-dm当请求数小于3(线上对话服务) expr: sum by(env)(delta(DSKDM_monitor_dsk_request{env="prod"}[5m]))<3.0 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: dsk-dm当请求数小于3(线上对话服务) - alert: DUI集群物理机磁盘坏道Media_Error expr: sum by(hostname,slot)(node_disk_media_error{hostname!~".*insight.*|.*bigdata.*|.*bdp.*|.*kf.*"})>50.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: DUI集群物理机磁盘坏道Media_Error - alert: 华为识别并发告警 expr: sum(service_connecter{servicename=~"casrserver-aihome-hw.*"})>550.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 华为识别并发告警 - alert: 翻译服务403 expr: sum(rate(gateway_fail{env="d1-prod",host="translation.duiopen.com",status="403",tag="apisix"}[40s]))>0.03 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 翻译服务403 - alert: kubelet证书剩余7天到期提醒 expr: (apiserver_client_certificate_expiration_seconds_count{job=~"kube-apiserver-.*"} > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job=~"kube-apiserver-.*"}[5m]))) < 7*3600*24)>0.0 for: 30m labels: severity: emergency annotations: summary: "{{ $value }}" description: kubelet证书剩余7天到期提醒 - alert: hotword服务内存告警 expr: sum by(pod_name) (kube_metrics_server_pods_mem{instance="metrics-server-exporter-gk.duiopen.com:80",pod_name=~"hotword.*",pod_namespace="cloud"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="memory",unit="byte",instance="kube-state-metrics-gk.duiopen.com:80",pod=~"hotword.*"}, "pod_name", "$1", "pod", "(.*)")) / 1024)>0.92 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: qos 国科hotword服务内存占用高 - alert: 中台及基础服务服务 redis内存使用率过高 expr: sum by(InstanceName,instanceId)(aliyun_acs_kvstore_MemoryUsage{instanceId=~"r-bp1440cd176b14e4|r-bp1a4f98e1f90b74|r-bp141guxetaxewmvop|r-bp12b0d2e7a9e264|r-bp11974f56f4f24|r-bp105d6c2f728c94|r-bp1eb121dd3243a4|r-bp18952d465cf734|r-bp1eb7874765cad4|r-bp12edafa9122394|r-bp1974c9e27528a4|r-bp18c8eccfc83434|r-bpan0da3p6gb9qpl5|r-bp1qw6vs7uc08ic1dz|r-bp1qxgfw7rlwih24l4|r-bp1c4a8297e91d54|r-bp11a78f9059ec4|r-bp1173dc70134424|r-bp1oyovlpvuqu4i0ma|r-bp1pz0so0mftgkez5w|r-bp166o63ibnmyjlypk"} +on(instanceId) group_left(InstanceName) (label_replace(aliyun_meta_redis_info, "instanceId", "$1", "InstanceId", "(.*)")))>95.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 中台及基础服务服务 redis内存使用率过高 - alert: TSDB时序数据库CPU使用率过高 expr: sum by(tsdbid,tsdbname)(duimonitor_tsdb_cpuusage)>80.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: TSDB时序数据库CPU使用率过高 - alert: ansible客户端失去连接 expr: sum by(env,instance,hostip)((ansible_status and on(hostip) up{job=~"base-exporter-d.*"}) * on(hostip) group_left(instance) up{job=~"base-exporter-d.*"} )!=1.0 for: 10m labels: severity: emergency annotations: summary: "{{ $value }}" description: ansible客户端失去连接 - alert: lasr-sentence拨测(线上LASR服务) expr: sum by(api,mode,message,service)(speech_blackbox_testing{mode=~"availability",service="lasr-sentence"})>0.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: lasr-sentence拨测(线上LASR服务) - alert: nlu-platform-service之非cnluserver错误码过高(线上语义服务) expr: sum by(serviceName,env, type)(delta(operation_counter{env="prod",serviceName=~"nlu-platform-service|nlu-platform-service-fullduplex|odcp-nlu-platform-service|nlu-platform-service-vip",type!="150501"}[5m]))>200.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: nlu-platform-service之非cnluserver错误码过高(线上语义服务) - alert: dm-runtime error报警 expr: sum(runtimeError{serviceName=~"dm-runtime",env=~"prod",eventName=~"ExceptionError"} - runtimeError{serviceName=~"dm-runtime",env=~"prod",eventName=~"ExceptionError"} offset 1m)>0.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: dm-runtime error报警 - alert: Pod异常(线上识别服务) expr: sum by (k8scluster,pod)(kube_pod_status_ready{k8scluster=~"d2-prod|d3-prod",namespace="cloud",pod=~"casr-.*|casrserver-.*",condition="true"} and on (namespace,pod) kube_pod_status_phase{phase="Running"} )!=1.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: Pod异常(线上识别服务) - alert: SONG_CONFIG_REDIS_HOST redis内存使用率过高 expr: sum by(InstanceName,instanceId)(aliyun_acs_kvstore_MemoryUsage{instanceId=~"r-bp1fa3201f6b9174"} +on(instanceId) group_left(InstanceName) (label_replace(aliyun_meta_redis_info, "instanceId", "$1", "InstanceId", "(.*)")))>80.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: SONG_CONFIG_REDIS_HOST redis内存使用率过高 - alert: 质检后台业务报警 expr: runtimeError{serviceName=~"aistore-ds-transfer",env=~"prod"} - runtimeError{serviceName=~"aistore-ds-transfer",env=~"prod"} offset 1m>3.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 质检后台业务报警 - alert: DUI集群Beta环境物理机load1过高 expr: round(sum by (hostname)(node_load1{hostname=~".*beta.*"}) / sum by (hostname)(node_cpu_core),0.01)>2.0 for: 2m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI集群Beta环境物理机load1过高 - alert: 限流次数过多Large expr: sum(runtimeError{env=~"prod",serviceName=~"ba-gateway",eventName=~"RateLimiter_TooManyRequests_Large"} - runtimeError{env=~"prod",serviceName=~"ba-gateway",eventName=~"RateLimiter_TooManyRequests_Large"} offset 1m)>50.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 被限流后仍继续进行大量请求 - alert: me-cinfo-service FD告警(最大连接) expr: avg(avg_over_time(process_files_open_files{application=~"me-cinfo-service",env=~"prod"}[1m]))>10000.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: me-cinfo-service 最大连接超过10000报警 - alert: runtime error(线上语义服务) expr: sum by(service)(duimonitor_runtimex{service="nlu-platform-service",env="prod"})>0.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: runtime error(线上语义服务) - alert: 词典管理告警 expr: sum(runtimeError{serviceName=~"nlu-dict-manager",env=~"prod"} - runtimeError{serviceName=~"nlu-dict-manager",env=~"prod"} offset 1m)>0.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 词典管理服务告警 - alert: d1-beta-adam-redis和D1-PROD-ADAM-REDIS redis内存使用率过高 expr: sum by(InstanceName,instanceId)(aliyun_acs_kvstore_MemoryUsage{instanceId=~"r-bp1eec3dd0e5db54|r-bp1dd0f12d7ed514"} +on(instanceId) group_left(InstanceName) (label_replace(aliyun_meta_redis_info, "instanceId", "$1", "InstanceId", "(.*)")))>80.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: d1-beta-adam-redis和D1-PROD-ADAM-REDIS redis内存使用率过高 - alert: 研发部-开发中心-终端SDK&APP redis内存使用率过高 expr: sum by(InstanceName,instanceId)(aliyun_acs_kvstore_MemoryUsage{instanceId=~"r-bp12lkxleo1jafo48e"} +on(instanceId) group_left(InstanceName) (label_replace(aliyun_meta_redis_info, "instanceId", "$1", "InstanceId", "(.*)")))>90.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 研发部-开发中心-终端SDK&APP redis内存使用率过高 - alert: d3-011307告警(线上识别服务) expr: sum by(mode,status,describe,detail,cluster)(duimonitor_service_exception{service="ASR",env="prod",status="011307"})>2000.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 011307(计算进程中断)阈值1000,每分钟 - alert: apisix/others端口syn-recv状态拥堵 expr: sum by(hostname)(node_tcp_synrecv{hostname!~".*beta.*"} and on(hostname) node_k8s_service{podname=~"apisix-.*",podname!~"apisix-[0-9a-z]+-[0-9a-z]+"})>200.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: apisix/others端口syn-recv状态拥堵 - alert: NWA平台SLB证书3天后过期 expr: round((max by (domain, manager) (aispeech_nwa_https_expired_timestamp) - time())/24/3600,0.1)<=3.4 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: NWA平台SLB证书3天后过期 - alert: smart-clean-boot(all) expr: sum(runtimeError{serviceName=~"smart-clean-boot",env=~"prod"} - runtimeError{serviceName=~"smart-clean-boot",env=~"prod"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: 物理机CPU使用率过高(线上识别服务) expr: (sum by(hostname)(node_cpu_usage_total{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"casr-.*"})*100>90.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机CPU使用率过高(线上识别服务) - alert: eva请求semantic告警 expr: sum by(errorId, errorMsg) (eva_monitor_semantic{env = "prod"} - eva_monitor_semantic{env = "prod"} offset 1m)>25.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: eva请求semantic告警 - alert: 华东正式环境rabbitmqSocket满了 expr: sum by(node)(rabbitmq_sockets_used{job="rabbitmq-exporter-d1-prod"})>2000.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 华东正式环境rabbitmqSocket满了 - alert: asrcpbackend CPU占用高于85% expr: round(sum by(pod_name) (kube_metrics_server_pods_cpu{job="metrics-server-exporter-d1-prod",pod_name=~"asrcpbackend.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core",job="kube-state-metrics-d1-prod",pod=~"asrcpbackend.*"}, "pod_name", "$1", "pod", "(.*)")) * 1000) * 100 / 1e+06)>85.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: asrcpbackend CPU占用高于85% - alert: 正式环境-上手快项目 redis内存使用率过高 expr: sum by(InstanceName,instanceId)(aliyun_acs_kvstore_MemoryUsage{instanceId=~"r-bp1r11oq6epp4w0bqd"} +on(instanceId) group_left(InstanceName) (label_replace(aliyun_meta_redis_info, "instanceId", "$1", "InstanceId", "(.*)")))>80.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 正式环境-上手快项目 redis内存使用率过高 - alert: Pod异常重启(线上tts服务) expr: sum by(namespace,pod)(round(delta(kube_pod_container_status_restarts_total{k8scluster="d1-prod",pod=~"dds-tts-lite-.*|tts-kf-.*|me-tts-anti-corruption-service-.*"}[10m])))!=0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: Pod异常重启(线上tts服务) - alert: scptrainer CPU占用率高于75% expr: round(sum by(pod_name) (kube_metrics_server_pods_cpu{job="metrics-server-exporter-d1-prod",pod_name=~"scptrainer.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) * 1000) * 100 / 1e+06)>75.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: scptrainer CPU占用率高于75% - alert: scptrainer内存占用率高于75% expr: round(sum by(pod_name) (kube_metrics_server_pods_mem{job="metrics-server-exporter-d1-prod",pod_name=~"scptrainer.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="memory",unit="byte",job="kube-state-metrics-d1-prod",pod=~"scptrainer.*"}, "pod_name", "$1", "pod", "(.*)")) / 1024) * 100)>75.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: scptrainer内存占用率高于75% - alert: 质检后台业务 expr: runtimeError{env=~"prod",serviceName=~"aistore-ds-api|aistore-ds-analyse|aistore-ds-storage|aistore-ds-prehandle"} - runtimeError{env=~"prod",serviceName=~"aistore-ds-api|aistore-ds-analyse|aistore-ds-storage|aistore-ds-prehandle"} offset 1m>4.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 质检后台业务报警 - alert: baike expr: sum(runtimeError{env=~"prod",module=~"baike",eventName=~"sys_crf|sys_ltp_pattern|sys_input_output|sys_es_request|threadPool_is_full|sys_mongo_search|sys_rest_kgdb"} - runtimeError{env=~"prod",module=~"baike",eventName=~"sys_crf|sys_ltp_pattern|sys_input_output|sys_es_request|threadPool_is_full|sys_mongo_search|sys_rest_kgdb"} offset 1m)>30.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: 服务告警,请实时处理 - alert: 错误码过高(tts-kf服务) expr: sum by(errorId,service)(idelta(errorCounter{env="prod"}[5m]))>50.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 错误码过高(tts-kf服务) - alert: vad-asr错误告警 expr: sum by(k8scluster,pod_name,status)(increase(counter_cloudvad_asr_status_requests{status!="0"}[1m]))>5.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: vad-asr错误告警 - alert: d3-011302告警(线上识别服务) expr: sum by(mode,status,describe,detail,cluster)(duimonitor_service_exception{service="ASR",env="prod",status="011302"})>5.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 011302(http协议new worker失败)阈值5,每分钟 - alert: pod异常重启(me-cinfo-service服务) expr: sum by(namespace,pod)(round(delta(kube_pod_container_status_restarts_total{job="kube-state-metrics-d1-prod",pod=~"me-cinfo-service.*"}[10m])))!=0.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: pod异常重启(me-cinfo-service服务) - alert: 对话和全双工拨测(线上对话服务) expr: sum by(api,mode,message,service)(speech_blackbox_testing{mode=~"availability",service=~"dm|dskdm|dm-dispatch-server-fullduplex"})>0.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 对话和全双工拨测(线上对话服务) - alert: 限流次数过多 expr: sum(runtimeError{env=~"prod",serviceName=~"ba-gateway",eventName=~"RateLimiter_TooManyRequests"} - runtimeError{env=~"prod",serviceName=~"ba-gateway",eventName=~"RateLimiter_TooManyRequests"} offset 1m)>160.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 被限流后仍继续进行大量请求 - alert: 磁盘IO负载过高(短语音重构) expr: round(sum by(k8scluster,hostname,device)(node_disk_ioutil{k8scluster=~".*-prod"}) and on (hostname) node_k8s_service{servicename=~"me-asr-online.*"})>95.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 磁盘IO负载过高(短语音重构) - alert: 实时话者分离拨测 expr: speech_blackbox_testing{cluster="ssc",service="me-lasr-plus-service",message="200",api="/lasrplus/",mode="availability"}!=0.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 实时话者分离拨测 - alert: 翻译服务-qos-cpu expr: sum by(pod_name) (kube_metrics_server_pods_cpu{instance="metrics-server-exporter-hd.duiopen.com:80",pod_name=~"cloud-translation-server.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core",instance="kube-state-metrics-hd.duiopen.com:80",pod=~"cloud-translation-server.*"}, "pod_name", "$1", "pod", "(.*)")) * 1000 * 1e+06) >0.8 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: qos-cpu - alert: file-resync2-dcopy内存高 expr: round(sum by(pod_name) (kube_metrics_server_pods_mem{pod_name=~"file-resync2-dcopy.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="memory",unit="byte"}, "pod_name", "$1", "pod", "(.*)")) / 1024) * 100)>80.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: file-resync2-dcopy内存高 - alert: 物理机内存Free不足2G(tts-kf服务) expr: round((sum by(hostname)(node_mem_free{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"tts-kf"})/1024/1024/1024,0.01)<2.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机内存Free不足2G(tts-kf服务) - alert: (DUI)阿里云Redis CPU使用率 expr: sum by (instanceId,nodeId)(aliyun_acs_kvstore_StandardCpuUsage{instanceId!~"r-bp11zi6yr1o10y01t3|r-bp177cdiib1dvyc1si|r-bp1y1uzzhvbcqz0e25|r-bp19d64fe48b9384|r-bp156a0ddfcc0c74|r-bp1a0b2be6ccc644|r-bp18ce30a05fe2a4|r-bp115974f56f4f24"} or aliyun_acs_kvstore_ShardingCpuUsage) >90.0 for: 3m labels: severity: critical annotations: summary: "{{ $value }}" description: (DUI)阿里云Redis CPU使用率 - alert: nginx-ingress-controller服务Pod异常重启 expr: sum by(job,pod)(idelta(kube_pod_container_status_restarts_total{job=~"kube-state-metrics-d.*-prod", namespace="default", pod=~"nginx-ingress-controller.*"}[5m]))!=0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: nginx-ingress-controller服务Pod异常重启 - alert: aimt-kong-business服务15分钟5xx错误率>60%(会议魔方业务) expr: sum(delta(kong_http_status{uri=~"/business.*",code=~"5.*",env="prod",moduleGroup="aimt"}[15m])) / sum(delta(kong_http_status{uri=~"/business.*",env="prod",moduleGroup="aimt"}[15m]))>0.6 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: aimt-kong-business服务15分钟5xx错误率>60%(会议魔方业务) - alert: 华东hotword用量 expr: delta(gateway_product{matched_route_id="00000000000000001610",status="403",productId!~"278572254"}[1m])>10.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 华东hotword用量 - alert: dm相关服务Pod的CPU使用率过高(线上对话服务) expr: round(sum(kube_metrics_server_pods_cpu{pod_name=~"dsk-dm-.*|cdmserver-v2-.*|dm-dispatch-server-.*|dm-dispatch-server-fullduplex-.*|ba-simulator-server-.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)*100/1000000)>80.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: dm相关服务Pod的CPU使用率过高(线上对话服务) - alert: 集群Prometheus离线 expr: (up{job=~"prometheus-.*",job!~".*kf-.*"} or up{job=~"federate-.*",job!~".*kf-.*"})!=1.0 for: 5m labels: severity: emergency annotations: summary: "{{ $value }}" description: 集群Prometheus离线 - alert: 智能家居控制成功率比例过低 expr: sum(rate(DSKDM_monitor_smarthome_error{env="prod", code="OPT-0"}[5m]))/sum(rate(DSKDM_monitor_smarthome_error{env="prod"}[5m]))<0.1 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: 智能家居 控制请求成功比例 - alert: filebeat服务CPU使用率过高-连云港 expr: sum by (k8scluster,node,pod) (round(label_replace(kube_metrics_server_pods_cpu{k8scluster="d2-prod",pod_namespace="filebeat",pod_name=~"aispeech-filebeat-.*",pod_container_name="aispeech-filebeat"}, "pod", "$1", "pod_name", "(.*)") * on(k8scluster,pod) group_right kube_pod_info{k8scluster="d2-prod",namespace="filebeat",pod=~"aispeech-filebeat-.*"}/1e6))>4000.0 for: 10m labels: severity: warning annotations: summary: "{{ $value }}" description: '{{ $labels.k8scluster }}集群{{ $labels.node }}节点filebeat的CPU使用率过高' - alert: 物理机磁盘空间不足(线上LASR服务) expr: round((sum by(hostname,device,mountpoint)(node_disk_usage{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"lasr-.*"})*100,0.01)>80.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机磁盘空间不足(线上LASR服务) - alert: eva请求asr告警 expr: sum by(errorId, errorMsg) (_exported_eva_monitor_asr{env = "prod"} - _exported_eva_monitor_asr{env = "prod"} offset 1m)>25.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: eva请求asr告警 - alert: 汇置项目-prod redis内存使用率过高 expr: sum by(InstanceName,instanceId)(aliyun_acs_kvstore_MemoryUsage{instanceId=~"r-bp1jgk1fjewcqb8lgw"} +on(instanceId) group_left(InstanceName) (label_replace(aliyun_meta_redis_info, "instanceId", "$1", "InstanceId", "(.*)")))>80.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 汇置项目-prod redis内存使用率过高 - alert: BossBilling CPU预警 expr: round(sum(kube_metrics_server_pods_cpu{pod_namespace="odcp",pod_name=~"mscp-boss-billing.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{namespace="odcp",pod=~"mscp-boss-billing.*",resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)*100/1000000)>75.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: BossBilling CPU预警 - alert: buffcache两分钟内下降超过5G(线上识别服务) expr: round((sum by(hostname)(abs(delta(node_mem_buffcache{hostname!~".*beta.*"}[2m]))) and on (hostname) node_k8s_service{servicename=~"casr-.*"})/1024/1024/1024,0.01) >5>5.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: buffcache两分钟内下降超过5G(线上识别服务) - alert: 物理机磁盘空间不足(线上声纹和其他服务) expr: round((sum by(hostname,device,mountpoint)(node_disk_usage{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"me-asr-vad-.*|me-vpr-.*|vpr-dp-sr-.*|vpr-lti-sr-.*|vpr-sti-sr-.*|vpr-ti-sr-.*|vpr-verify-.*|vpr-supplement"})*100,0.01)>90.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机磁盘空间不足(线上声纹和其他服务) - alert: 国科Mongo集群挂了 expr: ops_mongo_monitor!=0.0 for: 3m labels: severity: emergency annotations: summary: "{{ $value }}" description: 国科Mongo集群挂了 - alert: (DUI)阿里云Elasticsearch 节点磁盘使用率 expr: aliyun_acs_elasticsearch_NodeDiskUtilization{}>85.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: (DUI)阿里云Elasticsearch 节点磁盘使用率 - alert: GPU温度过高(短语音重构) expr: round((sum by(k8scluster,hostname,gpu)(node_gpu_temp{k8scluster=~".*-prod"}) and on (hostname) node_k8s_service{servicename=~"me-asr-online.*"}),0.01)>80.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: GPU温度过高(短语音重构) - alert: kylin job error num expr: sum(bdp_prod_kylin_error_job)>40.0 for: 30m labels: severity: critical annotations: summary: "{{ $value }}" description: kylin 任务错误数量 - alert: wechat-platform-authorize(all) expr: sum(runtimeError{serviceName=~"wechat-platform-authorize",env=~"prod"} - runtimeError{serviceName=~"wechat-platform-authorize",env=~"prod"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: 国科apisix上游熔断 expr: sum by (upstream_id,upstream_ip) (duimonitor_apisix_unhealthz_count{host=~"d3.*"})>10.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 国科apisix上游熔断 - alert: dds-tts-lite服务总并发数超过120(语音合成服务) expr: sum by(servicename)(service_connecter{servicename="dds-tts-lite"})>100.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: dds-tts-lite服务总并发数超过120(语音合成服务) - alert: 离线转写实时率高于0.3(线上LASR服务) expr: avg(summary_task_real_rate_sum{env="prod"}/summary_task_real_rate_count)>0.3 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 离线转写实时率高于0.3(线上LASR服务) - alert: DUI集群正式环境物理机load1过高 expr: round(sum by (hostname,hostip)(node_load1{hostname!~"kf-.*|.*beta.*|.*bdp.*|.*bigdata.*|insight.*|carrobot.*"} / node_cpu_core),0.01)>1.5 for: 2m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI集群正式环境物理机load1过高 - alert: 五菱私有云合成并发高 expr: sum(service_connecter_private{env="sgmw", servicename="dds-tts-lite"}) by (servicename)>45.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 五菱私有云合成并发超过45 - alert: dm_DepponExpress_warining expr: sum(runtimeError{env=~"prod",serviceName=~"dm-runtime|dm-engine",skillId=~"914006674"} - runtimeError{env=~"prod",serviceName=~"dm-runtime|dm-engine",skillId=~"914006674"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 德邦bot json配置存在问题,请尽快解决 - alert: dcaserver 内存占用率高于75% expr: round(sum by(pod_name) (kube_metrics_server_pods_mem{job="metrics-server-exporter-d1-prod",pod_name=~"dcaserver.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="memory",unit="byte",job="kube-state-metrics-d1-prod",pod=~"dcaserver.*"}, "pod_name", "$1", "pod", "(.*)")) / 1024) * 100)>75.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: dcaserver 内存占用率高于75% - alert: 外呼单通 expr: sum(runtimeError{eventName=~"callout-legB-error",instance=~"exporter-prod"} - runtimeError{eventName=~"callout-legB-error",instance=~"exporter-prod"} offset 1m)>0.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 外呼单通 - alert: ba-gateway(Websocket_Err) expr: sum(runtimeError{env=~"prod",serviceName=~"ba-gateway",eventName=~"Websocket_Err"} - runtimeError{env=~"prod",serviceName=~"ba-gateway",eventName=~"Websocket_Err"} offset 1m)>5.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: error code 过高(hotword服务) expr: sum by(error_code,job)(delta(metric_hotword_errorcode{env="prod",error_code=~"130009|130010"}[1m]))>10.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: error code 过高(hotword服务) - alert: 系统用户被修改 expr: sum by (k8scluster,hostip,hostname,username)(delta(node_shadow_md5[5m]))!=0.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 系统用户被修改(/etc/shadow文件变化,创建删除用户,修改密码) - alert: 质检-统一nlu expr: runtimeError{serviceName=~"aistore-analyse",env=~"prod"} - runtimeError{serviceName=~"aistore-analyse",env=~"prod"} offset 1m>5.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 质检-统一nlu - alert: 集群Prometheus的CPU使用率超过80% expr: round(sum by(pod_name,k8scluster)(kube_metrics_server_pods_cpu{pod_namespace=~"monitoring",pod_name=~"prometheus-.*"}) / (sum by(pod_name,k8scluster)(label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) *1000)*100/1000000)>90.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 集群Prometheus的CPU使用率超过80% - alert: smart-inspection expr: sum(runtimeError{env=~"prod",serviceName=~"smart-inspection",eventName=~"voiceTransferDataForDevPart3_callback_error|voiceTransferDataForSpeechvi_callback_error|ASR response an incomplete result|asr proxy establish connection to ASR|create_task_for_devPart3_error"} - runtimeError{env=~"prod",serviceName=~"smart-inspection",eventName=~"voiceTransferDataForDevPart3_callback_error|voiceTransferDataForSpeechvi_callback_error|ASR response an incomplete result|asr proxy establish connection to ASR|create_task_for_devPart3_error"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: aimt-kong-流媒体服务15分钟5xx错误率>60%(会议魔方业务) expr: sum(delta(kong_http_status{uri=~"/audio.*",code=~"5.*",env="prod",moduleGroup="aimt"}[15m])) / sum(delta(kong_http_status{uri=~"/audio.*",env="prod",moduleGroup="aimt"}[15m]))>0.6 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: aimt-kong-流媒体服务15分钟5xx错误率>60%(会议魔方业务) - alert: 训练意图 expr: sum(runtimeError{eventName=~"nlu_train_gateway",env=~"prod",serviceName=~"nlu-fusion"} - runtimeError{eventName=~"nlu_train_gateway",env=~"prod",serviceName=~"nlu-fusion"} offset 1m)>=2.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: nlu-fusion请求nlu_train_gateway存在异常,请尽快处理 - alert: traffic-limit告警 expr: delta(metric_errorcode{error_code="010605",env="prod",host_name=~"ddsserver-.*"}[1m])>100.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: qps流控告警 - alert: hbase master响应时间 expr: max(bdp_prod_hbase_master{sub="IPC",type="TotalCallTime_99th_percentile"})>300000.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: hbase master响应超时5分钟 - alert: filebeat日志采集发生积压 expr: count by (k8scluster, node_name, pod_namespace, pod_name, container_name) (aispeech_filebeat_pods_stdout_bytes{})>3.0 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: 该容器正在采集的stdout日志文件数量大于3 - alert: 单个Pod并发数使用率过高(线上识别服务) expr: (sum by(servicename,export_host,pod_name)(service_connecter{env=~"d3-prod",servicename=~"casr.*"}) / (sum by (servicename,export_host,pod_name)(service_connecter{env=~"d3-prod",servicename=~"casr.*"}) + sum by (servicename,export_host,pod_name)(service_connecter_without{env=~"d3-prod",servicename=~"casr.*"}) ) )* 100>90.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 单个Pod并发数使用率过高(线上识别服务) - alert: DSKDM请求耗时告警95值(线上对话服务) expr: (service_duration{mode="pt95",instanceId="dm-dispatch-server-fullduplex",latency=~"dm"})>1000.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: DSKDM95值耗时大于1秒(线上对话服务) - alert: dds-tts-lite服务机器load5过高 expr: round((sum by (hostname)(node_load5{k8scluster=~".*-prod"}) / sum by (hostname)(node_cpu_core{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{podname=~"dds-tts-lite.*"}),0.01)>1.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: dds-tts-lite服务机器load5过高 - alert: 地址服务 expr: sum(runtimeError{serviceName=~"dm-common-service",env=~"prod"} - runtimeError{serviceName=~"dm-common-service",env=~"prod"} offset 1m)>2.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 地址服务-异常报警 - alert: casr-airobot产品403 expr: delta(gateway_product{matched_route_id="00000000000000001364",status="403"}[1m])>10.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: casr-airobot产品403 - alert: unimrcp expr: sum(runtimeError{env=~"prod",serviceName=~"uniMrcp",eventName=~"MRCP_CONNECT_TTS_SERVER_FAIL|unexcept_close_to_asr"} - runtimeError{env=~"prod",serviceName=~"uniMrcp",eventName=~"MRCP_CONNECT_TTS_SERVER_FAIL|unexcept_close_to_asr"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: 国科预发布apisix-5xx expr: sum(rate(gateway_fail{env=~"d3-beta",status=~"502|503|504",tag=~"apisix"}[40s])) >1.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 国科预发布apisix-5xx - alert: 智能客服ES集群不健康(kf-prod-es) expr: sum by(es_name,k8scluster,mode)(kf_monitor_elasticsearch{k8scluster="kf-prod",mode="health"})!=0.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: 智能客服ES集群不健康(kf-prod-es) - alert: aimt-asr-switchCPU占用率高于85% expr: round(sum by(pod_name) (kube_metrics_server_pods_cpu{job="metrics-server-exporter-d1-prod",pod_name=~"aimt-asr-switch.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core",job="kube-state-metrics-d1-prod",pod=~"aimt-asr-switch.*"}, "pod_name", "$1", "pod", "(.*)")) * 1000) * 100 / 1e+06)>85.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: aimt-asr-switchCPU占用率高于85% - alert: Pod异常重启(线上一句话) expr: sum by(namespace,pod)(round(delta(kube_pod_container_status_restarts_total{k8scluster=~"d3-prod"}[10m])) and on(pod) (kube_pod_info{pod=~"me-asr-onesentence.*|cuda-.*"} and on(node) kube_pod_info{pod=~"me-asr-onesentence.*"}))!=0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: Pod异常重启(线上一句话) - alert: 合众 expr: delta(gateway_product{matched_route_id=~"00000000000000001332|00000000000000001656",status=~"404|401|403|500|502|503|504",productId=~"279601198|279605355"}[1m])>1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 合众 - alert: cinfo服务cpu占用高 expr: sum(kube_metrics_server_pods_cpu{pod_name=~"cinfo.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core",pod=~"cinfo.*"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000*1000000)>0.75 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: cinfo服务cpu占用高 - alert: aiym-id-status-code-alert expr: count( sum by (job) ( http_request_duration_seconds_bucket{ handler="/v1/intent/get", pod_project="aiym-intent-detection", status!="2xx", k8scluster="d1-prod" } ) )>=3.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 驰必准单意图分类状态码告警。 - alert: file-resync2-swift-s3 pod cpu高告警 expr: round(sum(kube_metrics_server_pods_cpu{pod_name=~"file-resync2-swift-s3.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)*100/1000000)>75.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: file-resync2-swift-s3 pod cpu高告警 - alert: 物理机CPU使用率过高(线上声纹服务) expr: round((sum by(hostname)(node_cpu_usage_total{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"me-vpr-.*|vpr-dp-sr-.*|vpr-lti-sr-.*|vpr-sti-sr-.*|vpr-ti-sr-.*|vpr-verify-.*|vpr-supplement"})*100,0.01)>75.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机CPU使用率过高(线上声纹服务) - alert: 国科大数据kafka2.0离线告警 expr: count(sum by(name,hostname)(kafka_server_brokertopicmetrics{job="bdp-prod-kafka2-d3-bigdata",name=~"MessagesInPerSec",type="FiveMinuteRate"}>1))!=8.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 国科大数据kafka2.0离线告警 - alert: NWA平台SLB证书7天后过期 expr: round((max by (domain, manager) (aispeech_expired_timestamp_nwa_https) - time())/24/3600,0.1)<=7.4 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: NWA平台SLB证书7天后过期 - alert: 连云港识别aicar平均末帧延时costs(线上识别服务) expr: round(avg by(k8scluster,res)(histogram_quantile(0.85,rate(pid_latency_total_bucket{k8scluster="d2-prod",pod_project=~"me-asr-onlineshort-service",pod_name!~"me-asr-onlineshort-service-gray-instance.*",res="aicar"}[3m]))))>300.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 连云港识别aicar平均末帧延时costs(线上识别服务) - alert: 物理机CPU使用率过高(线上语义服务odcp) expr: round((sum by(hostname)(node_cpu_usage_total{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{namespace="odcp",servicename=~".*cnluserver.*|.*olive-semantic.*aidui"})*100,0.01)>90.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机CPU使用率过高(线上语义服务odcp) - alert: DUI集群物理机5分钟内发生重启 expr: round(sum by(hostname,hostip)(node_uptime{hostname=~"d[1-3]-.*"}) / 60)<5.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI集群物理机5分钟内发生重启(请结合物理机离线告警判断) - alert: 错误码过高(长语音重构服务) expr: round(sum by(k8scluster,status,pod_project)(delta(counter_n_errors{k8scluster=~"d[1,2,3]-prod",pod_project="me-asr-onlinelong-service"}[3m])))>10.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 错误码过高(长语音重构服务) - alert: me-asr-vad-service并发过高 expr: sum by(k8scluster,pod_name,pod_project)(gauge_cloudvad_connected_concurrency)>400.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: me-asr-vad-service并发过高 - alert: 内容服务网关告警 expr: sum(rate(gateway_fail{host=~"apis.duiopen.com,apis.dui.ai,d.api.aispeech.com,dcmp-api.iot.aispeech.com",proxyUpstreamName=~"adam-kong-8000,adam-adam-9000,adam-adam-dcmp-search-9090",status=~"500|502|503|504"}[40s]))>0.03 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 内容服务网关告警 - alert: https域名证书7天内到期 expr: round(sum by(des)((probe_ssl_last_chain_expiry_timestamp_seconds-time())/3600/24),0.01)<7.0 for: 540m labels: severity: emergency annotations: summary: "{{ $value }}" description: https域名证书7天内到期 - alert: freeswitch expr: sum(runtimeError{env=~"prod",serviceName=~"freeswitch",eventName=~"Freeswitch_Recognize_Failed|Freeswitch_GetScene_Error|Freeswitch_SpeechChannelAdd_Error|Freeswitch_SpeechChannelStop_ChannelError|Freeswitch_SpeakInProcess_Timeout|Freeswitch_Lua_ConfigError|Freeswitch_GetTTS_Error|Freeswitch_AudioQueue_Overflow|Freeswitch_OpenMRCPSession_Timeout|Freeswitch_StopMRCPSession_Timeout|Freeswitch_DefineGrammar_Timeout|Freeswitch_TerminateMRCPSession_Timeout|Freeswitch_RecogChannelStart_Timeout|FreeswitchFreeswitch_Speak_InvalidTTSModule|Freeswitch_CleanupMRCPSession_Timeout|Freeswitch_FILEURL_unreachable"} - runtimeError{env=~"prod",serviceName=~"freeswitch",eventName=~"Freeswitch_Recognize_Failed|Freeswitch_GetScene_Error|Freeswitch_SpeechChannelAdd_Error|Freeswitch_SpeechChannelStop_ChannelError|Freeswitch_SpeakInProcess_Timeout|Freeswitch_Lua_ConfigError|Freeswitch_GetTTS_Error|Freeswitch_AudioQueue_Overflow|Freeswitch_OpenMRCPSession_Timeout|Freeswitch_StopMRCPSession_Timeout|Freeswitch_DefineGrammar_Timeout|Freeswitch_TerminateMRCPSession_Timeout|Freeswitch_RecogChannelStart_Timeout|FreeswitchFreeswitch_Speak_InvalidTTSModule|Freeswitch_CleanupMRCPSession_Timeout|Freeswitch_FILEURL_unreachable"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: 线上Nacos配置中心UP状态异常 expr: sum by (k8scluster, pod_namespace, pod_name) (up{k8scluster="d1-prod",pod_project="config-nacos",pod_namespace="default"})!=1.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: UP状态异常告警 - alert: nlu-fusion请求ba-bnluserver告警 expr: sum(runtimeError{eventName=~"aihiveboxsec|aidate_festival",env=~"prod",serviceName=~"nlu-fusion"} - runtimeError{eventName=~"aihiveboxsec|aidate_festival",env=~"prod",serviceName=~"nlu-fusion"} offset 1m)>=2.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: nlu-fusion线上请求ba-bnluserver存在异常,请尽快处理 - alert: 账号Account CPU预警 expr: round(sum(kube_metrics_server_pods_cpu{pod_namespace="odcp",pod_name=~"mscp-account.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{namespace="odcp",pod=~"mscp-account.*",resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)*100/1000000)>75.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 账号Account CPU预警 - alert: 物理机显卡使用率过高(线上识别服务) expr: (sum by(hostname,gpu)(node_gpu_usage{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"casr-.*"})>90.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机显卡使用率过高(线上识别服务) - alert: smart-dm-inter-boot(YtoFullSceneService_consultOrder) expr: sum(runtimeError{env=~"prod",eventName=~"YtoFullSceneService_consultOrder",serviceName=~"smart-dm-inter-boot"} - runtimeError{env=~"prod",eventName=~"YtoFullSceneService_consultOrder",serviceName=~"smart-dm-inter-boot"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: 预发布并发预警me-cinfo-service expr: sum by (env,service) (servcie_connections{env="d1-beta",service="me-cinfo-service"})>5.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 预发布并发预警me-cinfo-service - alert: nlu-platform-service-vip之cnluserver错误码告警 expr: sum(sum(_exported_operation_counter{serviceName="nlu-platform-service-vip",instance=~"nlu-platform-service-vip.*",type="150501",scripthost="d1-kafka-007"}) BY (serviceName, instance, env, type) - sum(_exported_operation_counter{serviceName="nlu-platform-service-vip",instance=~"nlu-platform-service-vip.*",type="150501",scripthost="d1-kafka-007"} OFFSET 5m) BY (serviceName, instance, env, type))WITHOUT (instance)>200.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: nlu-platform-service-vip之cnluserver错误码告警 - alert: 国科热词内存占用高 expr: sum by(pod_name) (kube_metrics_server_pods_mem{job="metrics-server-exporter-d3-prod",pod_name=~"hotword.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="memory",unit="byte",job="kube-state-metrics-d3-prod",pod=~"hotword.*"}, "pod_name", "$1", "pod", "(.*)")) / 1024)>0.9 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 国科热词内存占用高 - alert: Pod 内存使用率过高(短语音重构) expr: round(sum by(k8scluster,pod_name)(kube_metrics_server_pods_mem{k8scluster=~".*-prod",pod_name=~"me-asr-online.*"}) / (sum by(k8scluster,pod_name)(label_replace(kube_pod_container_resource_limits{resource="memory",unit="byte"}, "pod_name", "$1", "pod", "(.*)")) / 1024) * 100)>95.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: Pod 内存使用率过高(短语音重构) - alert: 大数据redis_cluster_slots_fail expr: sum without(hostip,instance,job,scriptname)(redis_cluster_slots_fail)!=0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据redis_cluster_slots_fail - alert: HTTP探针返回非200(blackbox-exporter) expr: sum by(des,k8scluster)(probe_http_status_code{job=~"http-probestatus-d.*",des!~"https://api-platform.oss.dui.ai|https://console.oss.dui.ai|https://dds.dui.ai|https://res.download.dui.ai|https://res.download.duiopen.com|https://ais.aispeech.com.cn|https://mis.aispeech.com.cn"})!=200.0 for: 3m labels: severity: emergency annotations: summary: "{{ $value }}" description: HTTP探针返回非200(blackbox-exporter) - alert: apisix-proxy并发连接数过高 expr: sum by (pod_project) (apisix_nginx_http_current_connections{pod_project="apisix-proxy",state="active"})>55000.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: apisix-proxy并发连接数过高 - alert: kylin query success expr: _exported_bdp_prod_kylin_access<0.95 for: 5m labels: severity: critical annotations: summary: "{{ $value }}" description: kylin query success小于95% - alert: FreeSwitch_HttpRequest_Error expr: sum(runtimeError{eventName=~"FreeSwitch_HttpRequest_Error"} - runtimeError{eventName=~"FreeSwitch_HttpRequest_Error"} offset 5m)>10.0 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: FreeSwitch_HttpRequest_Error - alert: 大数据防火墙公网IP上行流量过高 expr: round(sum by(ifAlias,ifDescr)(irate(ifHCOutOctets{ifDescr=~"GigabitEthernet0/0/1|GigabitEthernet1/0/4",instance="10.24.20.4"}[5m]))/1024/1024*8,0.01)>80.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 大数据防火墙公网IP上行流量过高 - alert: 连云港大数据kafka1.0离线告警 expr: count(sum by(name,hostname)(kafka_server_brokertopicmetrics{job="bdp-prod-kafka-d2-bigdata",name=~"MessagesInPerSec",type="FiveMinuteRate"}>1))!=3.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 连云港大数据kafka1.0离线告警 - alert: 鉴授权owl CPU预警 expr: round(sum(kube_metrics_server_pods_cpu{pod_namespace="odcp",pod_name=~"mscp-owl.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{namespace="odcp",pod=~"mscp-owl.*",resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)*100/1000000)>75.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 鉴授权owl CPU预警 - alert: DUI正式环境非K8S节点磁盘空间不足 expr: round(sum by(hostname,device,mountpoint)(node_disk_usage{job=~"base-exporter-d[123]-others",hostname!~".*insight.*|.*bigdata.*|.*bdp.*|.*glusterfs.*"})*100,0.01)>90.0 for: 3m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI正式环境非K8S节点磁盘空间不足 - alert: 华东nlp用量 expr: delta(gateway_product{matched_route_id="00000000000000001614",status="403"}[1m])>1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 华东nlp用量 - alert: Redis健康检查(CDC) expr: sum by(hostname,bind,vip)(health_redis)!=1.0 for: 2m labels: severity: emergency annotations: summary: "{{ $value }}" description: Redis健康检查(CDC) - alert: DUI正式环境Pod频繁崩溃 expr: (sum(max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", k8scluster=~"d.-prod"}[2m])) by (k8scluster,pod,container) or sum(max_over_time(kube_pod_init_container_status_waiting_reason{reason="CrashLoopBackOff", k8scluster=~"d.-prod"}[2m])) by (k8scluster,pod,container))>0.0 for: 6m labels: severity: emergency annotations: summary: "{{ $value }}" description: '{{ $labels.k8scluster }}集群的{{ $labels.container }}服务CrashLoopBackOff超过6分钟' - alert: ACK正式环境rabbitmq内存满了 expr: sum by(node)(rabbitmq_node_mem_used{ job="rabbitmq-exporter-d1-prod"}) /1024/1024>900.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: ACK正式环境rabbitmq内存满了 - alert: cinfo服务内存占用高 expr: sum(kube_metrics_server_pods_mem{pod_name=~"cinfoserver.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_requests{resource="memory",unit="byte",pod=~"cinfo.*"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name)/1000)>0.95 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: cinfo服务内存占用高 - alert: sfSendOrderError expr: sum(runtimeError{serviceName=~"smart-dm-inter-boot",env=~"beta",eventName=~"sfSendOrderError"} - runtimeError{serviceName=~"smart-dm-inter-boot",env=~"beta",eventName=~"sfSendOrderError"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 顺丰全场景下单异常 - alert: ES集群物理机HDD磁盘空间不足(大数据) expr: round(sum by(hostname,device,mountpoint)(node_disk_usage{hostname=~"insight-search-17.aispeech.com|insight-search-18.aispeech.com|insight-search-19.aispeech.com|insight-search-20.aispeech.com|insight-search-21.aispeech.com|insight-search-22.aispeech.com|insight-search-23.aispeech.com|insight-search-24.aispeech.com",mountpoint=~"/data3|/data4|/data5"})*100,0.01)>80.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: ES集群物理机HDD磁盘空间不足(大数据) - alert: file-resync2-swift-s3内存高 expr: round(sum by(pod_name) (kube_metrics_server_pods_mem{pod_name=~"file-resync2-swift-s3.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="memory",unit="byte"}, "pod_name", "$1", "pod", "(.*)")) / 1024) * 100)>80.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: file-resync2-swift-s3内存高 - alert: 英文意图 expr: sum(runtimeError{eventName=~"english_nlu_engine",env=~"prod",serviceName=~"nlu-fusion"} - runtimeError{eventName=~"english_nlu_engine",env=~"prod",serviceName=~"nlu-fusion"} offset 1m)>=2.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: nlu-fusion请求英文意图存在异常,请尽快处理 - alert: Pod异常重启(线上语义服务) expr: sum by(k8scluster,namespace,pod)(round(delta(kube_pod_container_status_restarts_total{k8scluster=~"d.-prod",pod=~".*cnluserver.*|olive-semantic-aidui-.*|olive-semantic-bcd-.*|nlu-platform-service-.*"}[10m])))!=0.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: Pod异常重启(线上语义服务) - alert: 国科预发布ASR-5xx expr: sum(rate(gateway_fail{env="d3-beta",status=~"499|500|502|503|504",host=~"lasr.beta.duiopen.com|asr.beta.dui.ai"}[40s]))>10.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 国科预发布ASR-5xx - alert: tts-kf并发告警 expr: sum by (servicename)(service_connecter{servicename="tts-kf"})>150.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: tts-kf并发告警 - alert: 日外呼量监控 expr: sum(call_out_cdr_counter{customerName="顺丰速运",env="alpha",eventName="countCdr",serviceName="smart-callout-boot"})<5.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 顺丰客户日外呼量监控 - alert: 大数据Airflow 定时任务跨周期 (Prod环境) expr: sum by(env,hostname,count,dag_name)(monitor_airflow_delay_tasks{env="prod"})>20.0 for: 60m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据Airflow 定时任务跨周期 (Prod环境) - alert: 五菱私有云NAS可用空间不足10GB expr: round(nas_disk_free{device="09358dd0-l07m.cn-zhangjiakou.extreme.nas.aliyuncs.com:/share"}/1024/1024/1024,0.01)<10.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 五菱私有云NAS可用空间不足10GB - alert: 线上Nacos配置中心Pod状态NotReady expr: sum by (k8scluster, namespace, pod) (kube_pod_status_ready{k8scluster="d1-prod", namespace="default", pod=~"config-nacos-.", condition="true"})!=1.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: Pod状态NotReady告警 - alert: 大数据集群物理机磁盘状态异常 expr: sum by(hostname,slot,Size)(node_disk_status{hostname=~".*insight.*|.*bigdata.*|.*bdp.*"})!=1.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据集群物理机磁盘状态异常 - alert: 合成tts鉴权不过 expr: gateway_product{matched_route_id=~"00000000000000816311| 00000000000000816315",status="401",productId!="279594681"} - gateway_product{matched_route_id=~"00000000000000816311| 00000000000000816315",status="401",productId!="279594681"} offset 1m>20.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 合成服务鉴权不过 - alert: 质检服务异常告警 expr: sum(runtimeError{serviceName=~"nlu-task-center|nlu-quality-check|ie-gateway|short-text-similarity",env=~"prod"} - runtimeError{serviceName=~"nlu-task-center|nlu-quality-check|ie-gateway|short-text-similarity",env=~"prod"} offset 1m)>2.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 质检服务异常 - alert: 大数据集群物理机TCP连接数过高 expr: sum by(hostname,status)(node_tcp_count{hostname=~".*insight.*|.*bigdata.*|.*bdp.*",status="ESTABLISHED"})>40000.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据集群物理机TCP连接数过高 - alert: 大数据集群Mysql-mha进程异常 expr: sum by(hostname)(mysql_mha_process_status)!=1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据集群Mysql-mha进程异常 - alert: 长语音403 expr: delta(gateway_product{matched_route_id=~"00000000000000404560|00000000000000404561| 00000000000000404564|00000000000000001524|00000000000000001500",status="403",productId!~"279593784|279600424|279595943|279601347|279603827|279604013|279596911|279597576|279606758|279597614|278587791|279604555|279608596|279608520|279598784"}[1m])>3.5 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 长语音403 - alert: 大数据集群Mysql健康检查 expr: sum by(bind,hostname,vip)(health_mysql)!=1.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 大数据集群Mysql健康检查 - alert: k8s节点内存压力 expr: kube_node_status_condition{condition="MemoryPressure",job=~"kube-state-metrics-d.*",status="true"}==1.0 for: 60m labels: severity: critical annotations: summary: "{{ $value }}" description: k8s节点内存压力 - alert: d3-011312告警(线上识别服务) expr: sum by(mode,status,describe,detail,cluster)(duimonitor_service_exception{service="ASR",env="prod",status="011312"})>50.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 011312(音频解码失败)阈值50,每分钟 - alert: prod复刻 expr: sum(rate(gateway_fail{env="d1-prod",status="404",tag="apisix",proxyUpstreamName="voice-copy-outer-service"}[40s]))>0.03 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: prod复刻 - alert: metric-agent内存监控 expr: sum(process_vm_rss{serviceName="metric-agent"})by(hostname,namespace,podName,containerName,serviceName,job)/1024>150.0 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: "{{$labels.hostname}}" - alert: DUI非正式环境节点磁盘空间不足 expr: round(sum by(hostname,device,mountpoint)(node_disk_usage{k8scluster!~"d[1..3]-prod",hostname!~"daoker|.*insight.*|.*bigdata.*|.*bdp.*|kf-.*"})*100,0.01)>92.0 for: 3m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI非正式环境节点磁盘空间不足 - alert: 物理机显卡温度过高(线上识别服务) expr: (sum by(hostname,gpu)(node_gpu_temp{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"casr-.*"})>80.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机显卡温度过高(线上识别服务) - alert: 连云港apisix上游熔断 expr: sum by (upstream_id,upstream_ip) (duimonitor_apisix_unhealthz_count{host=~"d2.*|apisix-internal-59dd5bc9c.*"})>1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 连云港apisix上游熔断 - alert: 资源部署服务 expr: max(max_over_time(gauge_me_model_publish_service_not_finish_pod_count{application=~"me_model_publish_service"}[2m]))>0.0 for: 20m labels: severity: warning annotations: summary: "{{ $value }}" description: 未完成计划异常告警 - alert: BossControl CPU预警 expr: round(sum(kube_metrics_server_pods_cpu{pod_namespace="odcp",pod_name=~"mscp-boss-control.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{namespace="odcp",pod=~"mscp-boss-control.*",resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)*100/1000000)>75.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: BossControl CPU预警 - alert: glusterfs机器cpu负载高告警 expr: sum by (hostname)(node_load5{hostname=~"d[1-4]-glusterfs.*"}) / sum by (hostname)(node_cpu_core)>0.95 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: glusterfs机器负载高告警 - alert: hive metastore latency expr: avg(bdp_prod_metastore_metastore{name=~"api.*",type="99thPercentile"})>30000.0 for: 10m labels: severity: critical annotations: summary: "{{ $value }}" description: hive metastore latency 超过60秒 - alert: nlu-fusion请求nlu-task-processor告警 expr: sum(runtimeError{env=~"prod",serviceName=~"nlu-fusion",eventName=~"aiproperty|aioutcall|aicocs_insurance|aifactor_number|aifactor_entity|aicocs_zhengquan|ailife|aicocs|aifactor|aihivebox|aiexobd"} - runtimeError{env=~"prod",serviceName=~"nlu-fusion",eventName=~"aiproperty|aioutcall|aicocs_insurance|aifactor_number|aifactor_entity|aicocs_zhengquan|ailife|aicocs|aifactor|aihivebox|aiexobd"} offset 1m)>=2.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 监控资源:aifactor、aicocs、ailife、aioutcall、aiproperty - alert: d3-500告警(线上识别服务) expr: sum by(mode,status,describe,detail,cluster)(duimonitor_service_exception{service="ASR",env="prod",status="500"})>20.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 500(识别异常)阈值20,每分钟 - alert: r0-daemon-queue-overstock expr: runtimeError{serviceName=~"r0-daemon",eventName=~"data_log_alarm_push_queue_exceed_queue",env=~"prod"} - runtimeError{serviceName=~"r0-daemon",eventName=~"data_log_alarm_push_queue_exceed_queue",env=~"prod"} offset 1m>=1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: r0-daemon 队列积压 - alert: d3-011309告警(线上是识别服务) expr: sum by(mode,status,describe,detail,cluster)(service_exception{service="ASR",env="prod",status="011309",cluster="gk-prod"})>100.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 011309(server busy)阈值100,每分钟 - alert: IT系统机器离线 expr: sum by(k8scluster,instance,hostip)(up{job="base-exporter-d0-test",instance=~"d0-test-itsystem00.*"})!=1.0 for: 2m labels: severity: warning annotations: summary: "{{ $value }}" description: IT系统机器离线 - alert: prod复刻401 expr: sum(rate(gateway_fail{env="d1-prod",status="401",proxyUpstreamName="voice-copy-outer-service"}[40s]))>0.03 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: prod复刻401 - alert: 大数据集群物理机磁盘坏道Other_Error expr: sum by(hostname,slot)(node_disk_other_error{hostname=~".*insight.*|.*bigdata.*|.*bdp.*"})>50.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据集群物理机磁盘坏道Other_Error - alert: 大数据集群Mysql-mha虚拟IP报警 expr: sum by(bind,hostname,vip)(mysql_mha_vip_status)!=1.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 大数据集群Mysql-mha虚拟IP报警 - alert: 五菱私有云识别并发高(线上识别服务) expr: sum(service_connecter_private{env="sgmw", servicename="casrserver-aicar"} ) by (servicename)>100.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 五菱私有云识别并发超过100 - alert: streaming-media CPU占用率高于75% expr: round(sum by(pod_name) (kube_metrics_server_pods_cpu{job="metrics-server-exporter-d3-prod",pod_name=~"streaming-media.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core",job="kube-state-metrics-d3-prod",pod=~"streaming-media.*"}, "pod_name", "$1", "pod", "(.*)")) * 1000) * 100 / 1e+06)>75.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 流媒体服务CPU占用率高于75% - alert: 域名即将过期 expr: sum(dns_expire_date{}) by (domain)<30.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 域名即将过期,请提前续费 - alert: filebeat服务misscheduled expr: sum by (k8scluster) (kube_daemonset_status_number_misscheduled{daemonset ="aispeech-filebeat", namespace ="filebeat",k8scluster=~"d.-prod"})>0.0 for: 10m labels: severity: warning annotations: summary: "{{ $value }}" description: filebeat服务misscheduled超过10分钟 - alert: ICMP机器层面丢包率超过50% expr: round(sum by(job,des,prober)((1-avg_over_time(probe_success{job=~"ping-probestatus-.*"}[1m]))*100),0.01)>50.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: ICMP机器拨测丢包 - alert: apisix后端服务不健康数量过高V2 expr: sum by(upstream_id)(duimonitor_apisix_unhealthz_count)>60.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: apisix后端服务不健康数量过高 - alert: 华东dds用量 expr: delta(gateway_product{matched_route_id=~"00000000000000001680|00000000000000001656|00000000000000001614|00000000000000001584|00000000000000001574|00000000000000001542|00000000000000001490|00000000000000001468|00000000000000001424|00000000000000001386|00000000000000001378|00000000000000001336|00000000000000001332",status="403",productId!~"278573892|279601578|278587526|279602422|278585085|279599183|279607850|279608367|279608363|279601067|279599155|279600850",productId!~""}[1m])>1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 华东dds用量 - alert: aimt-kong-流媒体服务15分钟5xx错误率>95%(会议魔方业务) expr: sum(delta(kong_http_status{uri=~"/audio.*",code=~"5.*",env="prod",moduleGroup="aimt"}[15m])) / sum(delta(kong_http_status{uri=~"/audio.*",env="prod",moduleGroup="aimt"}[15m]))>0.95 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: aimt-kong-流媒体服务15分钟5xx错误率>95%(会议魔方业务) - alert: coredump error(线上识别服务) expr: sum by(service,env)(duimonitor_coredumpx{service=~"casrserver"})>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: coredump error(线上识别服务) - alert: DSKDM最近一分钟没有收到请求 expr: sum(delta(DSKDM_monitor_dsk_request{env="prod"}[1m]))<1.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: DSKDM最近一分钟没有收到请求 - alert: ba-outer-portal-timeout expr: sum(runtimeError{eventName=~"GET_CONFIG_FAIL",env=~"prod",serviceName=~"ba-outer"} - runtimeError{eventName=~"GET_CONFIG_FAIL",env=~"prod",serviceName=~"ba-outer"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: ba-outer请求portal接口超时 - alert: 华东translation用量 expr: delta(gateway_product{matched_route_id="00000000000000001566",status="403",productId!~"278586547|279596337"}[1m])>1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 华东translation用量 - alert: 连云港入口机器5分钟内发生重启 expr: round(sum by(hostname,hostip)(node_uptime{hostname=~"d2-lb-001|d2-lb-002"}) / 60)<5.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 连云港入口机器5分钟内发生重启 - alert: asrplus服务告警 expr: gateway_product{matched_route_id=~"00000000000000668038|00000000000000014988|00000000000000014987",status=~"403|404|499|500|502|503|504"} - gateway_product{matched_route_id=~"00000000000000668038|00000000000000014988|00000000000000014987",status=~"403|404|499|500|502|503|504"} offset 1m>5.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: asrplus服务告警 - alert: casr-aicar产品403 expr: delta(gateway_product{matched_route_id="00000000000000001456",status="403"}[1m]) >10.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: casrserver-aicar产品403 - alert: sfComplaintOrderError expr: sum(runtimeError{eventName=~"sfComplaintOrderError",serviceName=~"smart-dm-inter-boot"} - runtimeError{eventName=~"sfComplaintOrderError",serviceName=~"smart-dm-inter-boot"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 顺丰全场景催单异常 - alert: filebeat服务CPU使用率过高 expr: sum by (k8scluster,node,pod) (round(label_replace(kube_metrics_server_pods_cpu{k8scluster=~"d(1|3)-prod",pod_namespace="filebeat",pod_name=~"aispeech-filebeat-.*",pod_container_name="aispeech-filebeat"}, "pod", "$1", "pod_name", "(.*)") * on(k8scluster,pod) group_right kube_pod_info{k8scluster=~"d(1|3)-prod",namespace="filebeat",pod=~"aispeech-filebeat-.*"}/1e6))>3000.0 for: 10m labels: severity: warning annotations: summary: "{{ $value }}" description: '{{ $labels.k8scluster }}集群{{ $labels.node }}节点filebeat的CPU使用率过高' - alert: apisix服务发现pod异常 expr: sum by (k8scluster,pod)(kube_pod_status_ready{k8scluster=~"d2-prod|d3-prod|d1-prod",namespace="cloud",pod=~"apisix-ingress-.*",condition="true"} and on (namespace,pod) kube_pod_status_phase{phase="Running"} )!=1.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: apisix服务发现pod异常 - alert: 大数据iData celery队列告警(Prod环境) expr: sum by(hostname,env,queue,status)(monitor_celery_tasks{env="prod",status="True",queue=~"celery_data_engineer|celery_others|celery_rsbj_ba_backend"})>70.0 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据iData celery队列告警(Prod环境) - alert: dds全双工cpu占用高 expr: sum(kube_metrics_server_pods_cpu{pod_name=~"ddsserver.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{pod=~"ddsserver.*",resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000*1000000)>0.8 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: dds全双工cpu占用高 - alert: 智能客服offline通知 expr: sum(runtimeError{serviceName=~"smart-voice-base-boot|smart-control-admin-boot",env=~"prod"} - runtimeError{serviceName=~"smart-voice-base-boot|smart-control-admin-boot",env=~"prod"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 智能客服offline通知 - alert: 五菱私有云情感合成 expr: sum(service_connecter_private{env="sgmw", servicename="qg-ttsserver"}) by (servicename)>15.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: 情感合成并发数 - alert: 异常状态码过高011309(短语音重构) expr: round(sum by(k8scluster,status)(delta(pid_status_total{status=~"011309"}[3m])))>5.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 异常状态码过高011309(短语音重构),服务server busy告警 - alert: 外呼-consumers/smart-provider-boot/smart-partner-module-boot/xxjob-告警 expr: sum(runtimeError{env=~"prod",serviceName=~"smart-consumer-boot|smart-provider-boot|smart-partner-module-boot|smart-extend-admin-boot"} - runtimeError{env=~"prod",serviceName=~"smart-consumer-boot|smart-provider-boot|smart-partner-module-boot|smart-extend-admin-boot"} offset 1m)>20.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: 外呼-consumers/smart-provider-boot/smart-partner-module-boot/xxjob-告警 - alert: IT系统域名证书7天内到期 expr: round(sum by(k8scluster,des)((probe_ssl_last_chain_expiry_timestamp_seconds{des=~"https://ais.aispeech.com.cn|https://mis.aispeech.com.cn"}-time())/3600/24),0.01)<7.0 for: 540m labels: severity: warning annotations: summary: "{{ $value }}" description: IT系统域名证书7天内到期 - alert: 阿里云-国科大数据专线主线路流量过高 expr: round(sum by(vbrid,vbrname)(duimonitor_vbroutrate{vbrname="杭州-国科-网银互联-大数据主线路"})/1024/1024,0.01)>950.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 阿里云-国科大数据专线主线路流量过高 - alert: casr-aienglish-mix产品403 expr: delta(gateway_product{matched_route_id="00000000000000001390",status="403",productId!="279599307"}[1m])>10.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: casrserver-aienglish-mix产品403 - alert: apisix/others端口syn-send状态拥堵 expr: sum by(hostname)(node_tcp_synsent{hostname!~".*beta.*"} and on(hostname) node_k8s_service{podname=~"apisix-.*",podname!~"apisix-[0-9a-z]+-[0-9a-z]+"})>250.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: apisix/others端口syn-send状态拥堵 - alert: 大数据集群物理机内存使用率过高 expr: sum by(hostname)(node_mem_usage{hostname=~".*insight.*|.*bigdata.*|.*bdp.*"})>90.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据集群物理机内存使用率过高 - alert: 大数据集群入口流量超过180M/s expr: round(sum by(hostname,netdev)(node_net_ratein{hostname=~"insight-service-app-1|insight-service-app-2"}/1024/1024),0.01)>180.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据集群入口流量超过180M/s - alert: logbus latency expr: histogram_quantile(0.99, sum by (le) (rate(logbus_prod_echo_http_request_duration_seconds_bucket{}[5m])))>10.0 for: 10m labels: severity: critical annotations: summary: "{{ $value }}" description: logbus latency大于10秒 - alert: apisix监听端口accept队列拥堵 expr: (sum by(hostname)(node_tcp_recvq) and on(hostname) node_k8s_service{servicename=~"apisix-[0-9].*|apisix-proxy|apisix-auth"})>1000.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: apisix监听端口accept队列拥堵 - alert: hotword服务CPU告警 expr: round(sum by(pod_name) (kube_metrics_server_pods_cpu{job="metrics-server-exporter-d3-prod",pod_name=~"hotword.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core",job="kube-state-metrics-d3-prod",pod=~"hotword.*"}, "pod_name", "$1", "pod", "(.*)")) * 1000) * 100 / 1e+06) / 100>0.9 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 国科hotword服务cpu占用高 - alert: vad服务告警 expr: gateway_product{matched_route_id=~"00000000000000025138",status=~"401|403|404|499|500|502|503|504"} - gateway_product{matched_route_id=~"00000000000000025138",status=~"401|403|404|499|500|502|503|504"} offset 1m>5.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: 监控生产环境的服务状况,发现问题及时修复 - alert: hive metastore nodes expr: count(bdp_prod_metastore_metastore{name="init_total_count_dbs"})<=1.0 for: 10m labels: severity: critical annotations: summary: "{{ $value }}" description: hive metastore nodes数量少于2 - alert: 连云港入口物理机已离线 expr: sum by(k8scluster,instance,hostip)(up{job=~"base-exporter.*",instance=~"d2-lb-001|d2-lb-002"})!=1.0 for: 2m labels: severity: emergency annotations: summary: "{{ $value }}" description: 连云港入口物理机已离线 - alert: d3-011305告警(线上识别服务) expr: sum by(mode,status,describe,detail,cluster)(duimonitor_service_exception{service="ASR",env="prod",status="011305"})>5.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 011305(http协议计算进程出错)阈值5,每分钟 - alert: smart-fs&mrcp-告警 expr: sum(runtimeError{env=~"prod",serviceName=~"uniMrcp|nginx-mrcp|freeswitch"} - runtimeError{env=~"prod",serviceName=~"uniMrcp|nginx-mrcp|freeswitch"} offset 1m)>100.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: smart-fs&mrcp-告警 - alert: duisys-pushserver-ng pod连接数告警 expr: pushserver_active_connections{env="prod", host_name=~"duisys-pushserver-ng-.*"}>40000.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 单pod连接数超过40000 - alert: 大数据集群物理机大于1T的磁盘空间不足 expr: round(sum by(hostname,device,mountpoint)((node_disk_usage{hostname=~".*insight.*|.*bigdata.*|.*bdp.*",hostname!~"insight-minio-.*"} and on(hostname,device) (node_disk_total>1099511627776)))*100,0.01)>90.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据集群物理机大于1T的磁盘空间不足 - alert: Pod异常重启(owl服务) expr: sum by(namespace,pod)(round(delta(kube_pod_container_status_restarts_total{job="kube-state-metrics-d1-prod",pod=~"mscp-owl-.*"}[10m])))!=0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: Pod异常重启(owl服务) - alert: 物理机CPU使用率过高(线上语义服务cloud) expr: round((sum by(hostname)(avg_over_time(node_cpu_usage_total{hostname!~".*beta.*"}[3m])) and on (hostname) node_k8s_service{namespace="cloud",servicename=~".*cnluserver.*|olive-semantic-aidui|olive-semantic-bcd"})*100,0.01)>45.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机CPU使用率过高(线上语义服务cloud) - alert: gi-model-serving expr: sum(runtimeError{env=~"prod",serviceName=~"gi-model-serving"} - runtimeError{env=~"prod",serviceName=~"gi-model-serving"} offset 1m)>=5.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: gi-model-serving - alert: 国科集群404 expr: delta(gateway_fail{env="d3-prod",tag="apisix",host!~"casrserver-aiwechat.gk-internal.prod.duiopen.com|127.0.0.1|www.qq.com|apisix-gk-public.duiopen.com|passport.baidu.com|10.24.10.97|azenv.net|prometheus-gk.aispeech.com|222.92.117.21|222.92.117.24|casrserver-&res&.gk-internal.prod.duiopen.com|casrserver-aitransoff.gk-internal.prod.duiopen.com|mail.94me.cn|94me.cn|api.94me.cn|www.94me.cn|ww.94me.cn|webmail.222.92.117.24|_",host!="",proxyUpstreamName!~"duisys_hotword_random|cloud_ezmt-pyapi-analysis_8080|cloud_streaming-media_29000|cloud_file-resync2-server-hotword_28002",status="404",upstreamStatus!="404"}[1m])>4.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 国科集群404 - alert: runtime error(线上对话服务) expr: sum by(service)(duimonitor_runtimex{service=~"dm-dispatch-server|cdmserver|dsk-dm|ba-simulator-server|dm-dispatch-server-fullduplex|cdmserver-v2",env="prod"})>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: runtime error(线上对话服务) - alert: duisys-nlu-hotword runtime error告警 expr: sum(increase(metric_hotword_errorcode{env="prod",pod_name=~"duisys-nlu-hotword.*",error_code=~"130030|130031"}[5m]))>0.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: duisys-nlu-hotword runtime error告警 - alert: nlu-platform-service-vip之非cnluserver错误码告警 expr: sum(sum(_exported_operation_counter{serviceName="nlu-platform-service-vip",instance=~"nlu-platform-service-vip.*",type!="150501",scripthost="d1-kafka-007"}) BY (serviceName, instance, env, type) - sum(_exported_operation_counter{serviceName="nlu-platform-service-vip",instance=~"nlu-platform-service-vip.*",type!="150501",scripthost="d1-kafka-007"} OFFSET 5m) BY (serviceName, instance, env, type))WITHOUT (instance)>50.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: nlu-platform-service-vip之非cnluserver错误码告警 - alert: k8s节点docker异常 expr: kube_node_status_condition{condition=~"DockerOffline|DockerUnhealthy",job=~"kube-state-metrics-d.-prod",status="true"}==1.0 for: 10m labels: severity: critical annotations: summary: "{{ $value }}" description: k8s节点docker异常 - alert: logbus error qps expr: sum(rate(logbus_prod_echo_http_requests_total{status!="2xx"}[5m]))>5000.0 for: 10m labels: severity: critical annotations: summary: "{{ $value }}" description: logbus error qps大于2000 - alert: 爬虫redis内存使用率过高 expr: sum by(InstanceName,instanceId)(aliyun_acs_kvstore_MemoryUsage{instanceId=~"r-bp11e2b03805c0d4"} +on(instanceId) group_left(InstanceName) (label_replace(aliyun_meta_redis_info, "instanceId", "$1", "InstanceId", "(.*)")))>80.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 爬虫redis内存使用率过高 - alert: apisix/auth/proxy/stream端口syn-send状态拥堵 expr: ((sum by(hostname)(node_tcp_synsent{hostname!~".*beta.*"}) and on(hostname) node_k8s_service{podname=~"apisix-[0-9a-z]+-[0-9a-z]+"}) +on(hostname) group_left(podname) node_k8s_service{podname=~"apisix-[0-9a-z]+-[0-9a-z]+"})>250.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: apisix/auth/proxy/stream端口syn-send状态拥堵 - alert: 智能客服阿里云SLB后端实例不健康 expr: sum by(instanceId,instance_name,instance_dept,instance_status,vip,port)(aliyun_acs_slb_dashboard_UnhealthyServerCount *on(instanceId) group_left(instance_name,instance_dept,instance_status) aispeech_aliyun_slb_spec_info{instance_name=~"智能客服.*"})!=0.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 智能客服阿里云SLB后端实例不健康 - alert: 流控服务-qos-cpu expr: sum by(pod_name) (kube_metrics_server_pods_cpu{pod_name=~"trafficlimit-server.*"}) / (sum by(pod_name) (label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) * 1000 * 1e+06)>0.8 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: qos-cpu - alert: smart-callout-boot expr: sum(call_out_cdr_counter{connectStatus="0",customerName="小金测试",env="test",eventName="countCdr",serviceName="smart-callout-boot",taskName="顺丰速运-测试prometheus告警"})/sum(call_out_cdr_counter{customerName="小金测试",env="test",eventName="countCdr",serviceName="smart-callout-boot",taskName="顺丰速运-测试prometheus告警"})>=1.5 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: CDR话单接通率告警 - alert: pushCallInCountError expr: sum(runtimeError{eventName=~"pushCallInCountError",module=~"smart-dm-inter-boot"} - runtimeError{eventName=~"pushCallInCountError",module=~"smart-dm-inter-boot"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 推送德邦进线量失败 - alert: DUI集群正式环境物理机磁盘fstab卷丢失 expr: sum by(device,mountpoint,hostname)(node_disk_volume_loss{hostname!~"kf-.*|.*beta.*|.*bigdata.*|insight.*"})==0.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI集群正式环境物理机磁盘fstab卷丢失 - alert: idata2任务调度异常 expr: airflow2_executor_running_tasks==0.0 for: 20m labels: severity: emergency annotations: summary: "{{ $value }}" description: idata2任务调度异常 - alert: tts-lite并发告警 expr: sum(sum(sum(_exported_service_connecter{export_host=~"d1-dui-100.*",pod_name=~"dds-tts.*",servicename=~".*tts.*"}) BY (export_host, pod_name, pod_ip, servicename))) WITHOUT (export_host)>=120.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: tts-lite并发告警 - alert: 物理机内存Free不足2G(线上LASR服务) expr: round((sum by(hostname)(node_mem_free{job="base-exporter-d3-prod"}) and on (hostname) node_k8s_service{servicename=~"lasr-.*"})/1024/1024/1024,0.01)<2.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机内存Free不足2G(线上LASR服务) - alert: logbus qps expr: sum(rate(logbus_prod_echo_http_requests_total{}[2m]))>30000.0 for: 10m labels: severity: critical annotations: summary: "{{ $value }}" description: logbus qps大于6000 - alert: BossMetering CPU预警 expr: round(sum(kube_metrics_server_pods_cpu{pod_namespace="odcp",pod_name=~"mscp-boss-metering.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{namespace="odcp",pod=~"mscp-boss-metering.*",resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)*100/1000000)>75.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: BossMetering CPU预警 - alert: hb-011312告警 expr: sum(_exported_duimonitor_service_exception{service="ASR",env="prod",scriptname="monitor_casrserver_exception.sh",status="011312",cluster="hb-prod"}) by (mode,status,describe,detail,cluster)>50.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 011312(音频解码失败)阈值50,每分钟 - alert: (DUI)阿里云Redis 连接数使用率 expr: sum by (instanceId,nodeId)(aliyun_acs_kvstore_StandardConnectionUsage{instanceId!~"r-bp11zi6yr1o10y01t3|r-bp177cdiib1dvyc1si|r-bp1y1uzzhvbcqz0e25|r-bp19d64fe48b9384|r-bp156a0ddfcc0c74|r-bp1a0b2be6ccc644|r-bp18ce30a05fe2a4|r-bp115974f56f4f24"} or aliyun_acs_kvstore_ShardingConnectionUsage)>60.0 for: 3m labels: severity: critical annotations: summary: "{{ $value }}" description: (DUI)阿里云Redis 连接数使用率 - alert: 产品/技能发布失败(mscp-product-v2服务) expr: sum by(bizName,level,message,env,flow,value,errorCode,moduleName)(delta(_exported_errorCounter_total{env="prod",flow=~"productPublish|skillPublish|demoTest"}[1m]))>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 产品/技能发布失败(mscp-product-v2服务) - alert: PodCPU使用率大于85%(aiwork、aiot服务) expr: round(sum(kube_metrics_server_pods_cpu{job="metrics-server-exporter-d1-prod",pod_name=~"aiwork.*|aiot.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)*100/1000000)>85.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: PodCPU使用率大于85%(aiwork、aiot服务) - alert: 显存使用率过高(cuda-triton-tts服务) expr: round((sum by(hostname,gpu)(node_gpu_mem_usage{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"cuda-triton-tts"})*100,0.01)>85.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 显存使用率过高(cuda-triton-tts服务) - alert: (DUI)阿里云MongoDB CPU使用率 expr: sum by(instanceId,role)(aliyun_acs_mongodb_CPUUtilization{instanceId!~"dds-bp11d328e4e8da74|dds-bp17ead8b96974c4|dds-bp177ae7a19a9a34|dds-bp1f6f784b1a3024|dds-bp15c90ebc25a554|dds-bp15c9d449e3b214|dds-bp1911c580f6cb44|dds-bp1e9e91ed7bfe24|dds-bp1c7e05e97ab864"})>80.0 for: 3m labels: severity: critical annotations: summary: "{{ $value }}" description: (DUI)阿里云MongoDB CPU使用率 - alert: 训练平台中控告警 expr: sum(runtimeError{serviceName=~"nlu-trian-manager"} - runtimeError{serviceName=~"nlu-trian-manager"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: error告警 - alert: 网关正式环境408 expr: sum(rate(gateway_fail{env=~"d1-prod|d3-prod",status="408",proxyUpstreamName=""}[40s]))>0.3 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: 网关正式环境408 - alert: 大数据Elrond日志异常 expr: sum by(hostname,taskName)(number_exception{taskName!~"DUIServiceBackend_Mysql_SkillProduct_0_Import,BA_Kafka_BAZipkin2_115_Import,BA_Kafka_BAZipkin_115_Import",type=~"error"})>0.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据Elrond日志异常 - alert: DUI机器TCP连接数过高 expr: sum by(hostname,status)(node_tcp_count{hostname!~".*insight.*|.*bigdata.*|.*bdp.*|d3-lb-001"})>45000.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI机器TCP连接数过高 - alert: 拨测异常告警(线上识别服务) expr: sum by(api,message,mode,service)(speech_blackbox_testing{service="asrLite"})!=0.0 for: 2m labels: severity: warning annotations: summary: "{{ $value }}" description: 拨测异常告警(线上识别服务) - alert: fs-callout 外呼超时报错 expr: sum(runtimeError{eventName=~"callout-route_error|callout-timeout_error",env=~"prod"} - runtimeError{eventName=~"callout-route_error|callout-timeout_error",env=~"prod"} offset 3m)>3.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: fs-callout 外呼超时报错 - alert: aistore-vts-报警 expr: runtimeError{serviceName=~"aistore-vts-server",env=~"prod"} - runtimeError{serviceName=~"aistore-vts-server",env=~"prod"} offset 1m>5.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: aistore-vts-报警 - alert: DUI正式环境Pod调度异常 expr: (sum by (k8scluster, namespace, pod, phase) (kube_pod_status_phase{k8scluster=~"d.*-prod", phase=~"Pending|Unknown|UnexpectedAdmissionError",namespece!~"idata"}) - on (k8scluster, namespace, pod) group_left sum by (k8scluster, namespace, pod) (kube_pod_container_status_waiting_reason{k8scluster=~"d.-prod",namespace!="idata",reason=~"ImagePullBackOff|ErrImagePull|InvalidImageName"}))>0.0 for: 10m labels: severity: emergency annotations: summary: "{{ $value }}" description: '{{ $labels.k8scluster }}集群的服务{{ $labels.phase }}超过10分钟' - alert: 大数据Yarn队列告警(Prod) expr: sum by(env,queue)(monitor_yarn_queues{env="prod",queue="etl"})>90.0 for: 10m labels: severity: warning annotations: summary: "{{ $value }}" description: 大数据Yarn队列告警(Prod) - alert: 对话中控错误码(P1)(线上对话服务) expr: sum by(env,errorId,errorMsg,moduleName,pod_name)(delta(dmdispatch_error_statistic{env="prod",moduleName=~"dm-dispatch-server|dm-dispatch-server-fullduplex",errorId=~"010410|010412|010414|080015"}[1m]))>=10.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 对话中控错误码(P1)(线上对话服务) - alert: 质检发布相似度报警 expr: sum(runtimeError{env=~"prod",clientName=~"exception",callerMethod=~"aspectConsumer"} - runtimeError{env=~"prod",clientName=~"exception",callerMethod=~"aspectConsumer"} offset 1m)>=1.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 质检发布相似度报警 - alert: MEDINGDING服务副本数异常告警(IES) expr: max by(k8scluster,deployment,namespace)(kube_deployment_spec_replicas{k8scluster!~"d1-beta|d3-alpha|kf-alpha"}) - max by(k8scluster,deployment,namespace)(kube_deployment_status_replicas_available{k8scluster!~"d1-beta|d3-alpha|kf-alpha"})>0.0 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: MEDINGDING服务副本数异常告警(IES) - alert: nlu-quality-checking 质检任务处理失败 expr: sum(runtimeError{env=~"prod",serviceName=~"nlu-quality-checking",eventName=~"task error"} - runtimeError{env=~"prod",serviceName=~"nlu-quality-checking",eventName=~"task error"} offset 1m)>0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: nlu-quality-checking 质检任务处理失败 - alert: 内核服务(aicar、cn-16k)并发过高 expr: sum by(k8scluster,pod_project,pod_name)(gauge_op_concurrencies{k8scluster=~"d.-prod",pod_project=~"me-asr-online-wfst.*|me-asr-onlinelong-wfst-cn-16k"})>144.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 内核服务(aicar、cn-16k)并发过高 - alert: file-resync2-dcopy pod cpu高告警 expr: round(sum(kube_metrics_server_pods_cpu{pod_name=~"file-resync2-dcopy.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)*100/1000000)>75.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: file-resync2-dcopy pod cpu高告警 - alert: Pod异常重启(线上声纹服务) expr: sum by(namespace,pod)(round(delta(kube_pod_container_status_restarts_total{k8scluster="d3-prod"}[10m])) and on(pod) (kube_pod_info{pod=~"me-vpr-plus-service-.*|vpr-dp-sr-.*|vpr-sti-sr-.*|cuda-.*"} and on(node) kube_pod_info{pod=~"me-vpr-plus-service-.*|vpr-dp-sr-.*|vpr-sti-sr-.*"}))!=0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: Pod异常重启(线上声纹服务) - alert: Pod异常重启(线上cuda服务) expr: sum by(pod)(round(delta(kube_pod_container_status_restarts_total{ job="kube-state-metrics-d3-prod",pod=~"cuda-.*"}[10m])))!=0.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: Pod异常重启(线上cuda服务) - alert: 大数据集群Mysql服务异常(虚IP10.24.1.118) expr: sum by(hostname)(mysql_local_status)!=1.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: 大数据集群Mysql服务异常(虚IP10.24.1.118) - alert: nlu-platform-service PodCPU用率过高 expr: round(sum(kube_metrics_server_pods_cpu{k8scluster="d1-prod",pod_name=~"nlu-platform-service.*"}) by (pod_name) / (sum(label_replace(kube_pod_container_resource_limits{resource="cpu",unit="core"}, "pod_name", "$1", "pod", "(.*)")) by (pod_name) *1000)*100/1000000)>75.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: nlu-platform-service PodCPU用率过高 - alert: 鉴权网关-请求响应耗时95值告警 expr: round(max by(uri)(histogram_quantile(0.95,rate(http_server_requests_seconds_bucket{env="prod",lang="java",pod_project="mscp-owl",moduleName="mscp-owl",uri="/auth/device/register",status="200"}[5m]))*1000))>300.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 鉴权网关-请求响应耗时95值告警 - alert: aimt-kong-协同服务15分钟5xx错误率>60%(会议魔方业务) expr: sum(delta(kong_http_status{uri=~"/ot.*",code=~"5.*",env="prod",moduleGroup="aimt"}[15m])) / sum(delta(kong_http_status{uri=~"/ot.*",env="prod",moduleGroup="aimt"}[15m]))>0.6 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: aimt-kong-协同服务15分钟5xx错误率>60%(会议魔方业务) - alert: 拨测异常告警(长语音重构服务) expr: sum by(api,message,mode,service)(speech_blackbox_testing{service="me-asr-onlinelong-service"})!=0.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 拨测异常告警(长语音重构服务) - alert: r0-daemon-d0nlp-queue expr: runtimeError{serviceName=~"r0-daemon",eventName=~"data_log_alarm_d0nlp_queue_exceed_queue",env=~"prod"} - runtimeError{serviceName=~"r0-daemon",eventName=~"data_log_alarm_d0nlp_queue_exceed_queue",env=~"prod"} offset 35m>=1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: d0-nlp文本消费队列积压 - alert: IT系统HTTP探针非200异常 expr: sum by(k8scluster,des)(probe_http_status_code{job=~"http-probestatus-.*",des=~"https://mis.aispeech.com.cn|https://ais.aispeech.com.cn/healthz"})!=200.0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: IT系统HTTP探针非200异常 - alert: 华东clonevoice用量 expr: delta(gateway_product{matched_route_id="00000000000000001478",status="403"}[1m])>1.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 华东clonevoice用量 - alert: DUI集群Beta物理机磁盘fstab卷丢失 expr: sum by(device,mountpoint,hostname)(node_disk_volume_loss{hostname=~".*beta.*"})==0.0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: DUI集群Beta物理机磁盘fstab卷丢失 - alert: 物理机内存使用率过高(线上声纹和其他服务) expr: round((sum by(hostname)(node_mem_usage{hostname!~".*beta.*"}) and on (hostname) node_k8s_service{servicename=~"me-vpr-.*|vpr-dp-sr-.*|vpr-lti-sr-.*|vpr-sti-sr-.*|vpr-ti-sr-.*|vpr-verify-.*|vpr-supplement"})*100,0.01)>90.0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机内存使用率过高(线上声纹和其他服务) - alert: DSKDM 错误码告警5分钟内出现50次 expr: sum(increase(DSKDM_monitor_dskdm_error{env="prod"}[5m])) by (errorId)>50.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: DSKDM 错误码告警5分钟内出现50次 - alert: 情感意图、半截句、形象定制 expr: sum(runtimeError{eventName=~"sys_gi_bnlu|sys_chat_rank|sys_sqa_rank|sys_input_output|sys_es_request|threadPool_is_full|sentence_convert|http_request_error",env=~"prod",serviceName=~"sqa"} - runtimeError{eventName=~"sys_gi_bnlu|sys_chat_rank|sys_sqa_rank|sys_input_output|sys_es_request|threadPool_is_full|sentence_convert|http_request_error",env=~"prod",serviceName=~"sqa"} offset 1m)>500.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: 服务告警 - alert: smart-dataflow expr: runtimeError{env=~"prod",serviceName=~"smart-dataflow"} - runtimeError{env=~"prod",serviceName=~"smart-dataflow"} offset 1m>1.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: smart-datalfow服务有error报错 - alert: APISIX专用 redis内存使用率过高 expr: sum by(InstanceName,instanceId)(aliyun_acs_kvstore_MemoryUsage{instanceId=~"r-bp1qw6vs7uc08ic1dz.*"})>90.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: APISIX专用 redis内存使用率过高 - alert: freeswitch 健康检查报警 expr: sum(runtimeError{eventName=~"Healthz",env=~"prod"} - runtimeError{eventName=~"Healthz",env=~"prod"} offset 1m)>0.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: freeswitch 健康检查报警 - alert: 网关正式环境404 expr: sum(rate(gateway_fail{env=~"d1-prod|d3-prod",status="404",proxyUpstreamName=~""}[40s]))>3.0 for: 1m labels: severity: critical annotations: summary: "{{ $value }}" description: 网关正式环境404 - alert: 标注下载音频请求延时告警 expr: round(sum by (method,path)(rate(leo_api_request_duration_sum{path="/rest/leo/resource/9V0q7CaR",type="internal",status="200"}[2m])),0.01)>5.0 for: 2m labels: severity: warning annotations: summary: "{{ $value }}" description: 服务告警 - alert: 翻译服务告警 expr: sum by(errorId, errorMsg) (translation_monitor_statistic{env="prod"} - _exported_translation_monitor_statistic{env="prod"} offset 1m)>20.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 翻译服务告警 - alert: eva请求cloud-sds告警 expr: sum by(errorId, errorMsg) (_exported_eva_monitor_sds{env = "prod"} - _exported_eva_monitor_sds{env = "prod"} offset 1m)>25.0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: eva请求cloud-sds告警 - alert: 华东正式res服务5XX expr: sum(rate(gateway_fail{host="res.download.dui.ai",status=~"502|503|504"}[1m]))>1.0 for: 0m labels: severity: critical annotations: summary: "{{ $value }}" description: res服务5XX - alert: kylin job pending expr: sum(bdp_prod_kylin_pending_job+bdp_prod_kylin_error_job+bdp_prod_kylin_running_job)>200.0 for: 10m labels: severity: critical annotations: summary: "{{ $value }}" description: kylin 堆积任务超过200 - name: dataAbsent rules: - alert: pod异常(线上车载服务)-nodata expr: absent((sum by(pod)(kube_pod_status_ready{job="kube-state-metrics-d1-prod",namespace="cloud",pod=~"lyra-webhook-.*|lyra-webapi.*|lyra-octopus.*|lyra-xq-infrared.*|lyra-external-interface-service.*|softhardware-h5.*|dds-xiandou.*",condition="true"} and on (namespace,pod) kube_pod_status_phase{phase="Running"})))==1 for: 5m labels: severity: critical annotations: summary: "{{ $value }}" description: 该告警监控数据缺失 - alert: hive metastore qps-nodata expr: absent(sum(bdp_prod_metastore_metastore{name=~"api.*",type="OneMinuteRate"}))==1 for: 5m labels: severity: critical annotations: summary: "{{ $value }}" description: 该告警监控数据缺失 - alert: 专线线路丢包率过高-nodata expr: absent(sum by(vbrid,vbrname)(duimonitor_vbrhealthychecklossrate))==1 for: 5m labels: severity: emergency annotations: summary: "{{ $value }}" description: 该告警监控数据缺失 - alert: AIOS服务请求异常(httpas)-nodata expr: absent(aios_request_error{service="httpas"})==1 for: 5m labels: severity: critical annotations: summary: "{{ $value }}" description: 该告警监控数据缺失 - alert: aios-apisix拨测-nodata expr: absent(speech_blackbox_testing{service="apisix-aios"})==1 for: 5m labels: severity: critical annotations: summary: "{{ $value }}" description: 该告警监控数据缺失 - alert: NWA云之家工单ECS到期-nodata expr: absent(round((max by (domain, manager) (aispeech_expired_timestamp_ecs_aliyun) - time())/24/3600,0.1))==1 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: 该告警监控数据缺失 - alert: hive metastore连接数-nodata expr: absent(sum(bdp_prod_metastore_metastore{name="open_connections"}))==1 for: 5m labels: severity: critical annotations: summary: "{{ $value }}" description: 该告警监控数据缺失 - alert: 专线线路入口流量过高-nodata expr: absent(round(sum by(vbrid,vbrname)(duimonitor_vbrinrate)/1024/1024,0.01))==1 for: 5m labels: severity: emergency annotations: summary: "{{ $value }}" description: 该告警监控数据缺失 - alert: 国科LB告警-nodata expr: absent(sum by (mode)(ops_monitor))==1 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: 该告警监控数据缺失 - alert: 国科单点mongo挂了-nodata expr: absent(ops_mongo_nfs001_monitor)==1 for: 5m labels: severity: emergency annotations: summary: "{{ $value }}" description: 该告警监控数据缺失 - alert: 国科入口haproxy异常-nodata expr: absent(sum by(mode)(ops_monitor))==1 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: 该告警监控数据缺失 - alert: 华东正式环境rabbitmqSocket满了-nodata expr: absent(sum by(node)(rabbitmq_sockets_used{job="rabbitmq-exporter-d1-prod"}))==1 for: 5m labels: severity: emergency annotations: summary: "{{ $value }}" description: 该告警监控数据缺失 - alert: 国科Mongo集群挂了-nodata expr: absent(ops_mongo_monitor)==1 for: 5m labels: severity: emergency annotations: summary: "{{ $value }}" description: 该告警监控数据缺失 - alert: ACK正式环境rabbitmq内存满了-nodata expr: absent(sum by(node)(rabbitmq_node_mem_used{ job="rabbitmq-exporter-d1-prod"}) /1024/1024)==1 for: 5m labels: severity: emergency annotations: summary: "{{ $value }}" description: 该告警监控数据缺失 - alert: hive metastore latency-nodata expr: absent(avg(bdp_prod_metastore_metastore{name=~"api.*",type="99thPercentile"}))==1 for: 5m labels: severity: critical annotations: summary: "{{ $value }}" description: 该告警监控数据缺失 - alert: 域名即将过期-nodata expr: absent(sum(dns_expire_date{}) by (domain))==1 for: 5m labels: severity: emergency annotations: summary: "{{ $value }}" description: 该告警监控数据缺失 - alert: 五菱私有云情感合成-nodata expr: absent(sum(service_connecter_private{env="sgmw", servicename="qg-ttsserver"}) by (servicename))==1 for: 5m labels: severity: critical annotations: summary: "{{ $value }}" description: 该告警监控数据缺失