apiVersion: v1 data: rules-common.yml: |- groups: - name: common rules: - alert: 域名注册到期时间小于30天 expr: round(sum by(company, domain_name)((expiretime - time()) /3600 / 24)) < 30 for: 30m labels: severity: warning annotations: summary: "{{ $value }}" description: 域名注册到期时间小于30天 - alert: ssl证书到期时间小于7天 expr: round(sum by(env,job,service,url)((probe_ssl_earliest_cert_expiry-time()) /3600/24)) < 7 for: 1h labels: severity: warning annotations: summary: "{{ $value }}" description: ssl证书到期时间小于7天 - alert: 域名接口状态码返回异常 expr: sum by (job, env, service, url) (probe_http_status_code !=200 and probe_http_status_code !=404 and probe_http_status_code !=403) for: 15m labels: severity: warning annotations: summary: "{{ $value }}" description: 域名接口状态码返回异常 - alert: 物理机磁盘状态异常 expr: sum by(hostname,hostip,slot,Size)(node_disk_status) != 1 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机磁盘状态异常 - alert: 磁盘media错误过高 expr: sum by(hostname,hostip,slot)(node_disk_media_error) > 10 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 磁盘media错误过高 - alert: 磁盘other错误过高 expr: sum by(hostname,hostip,slot)(node_disk_other_error) > 10 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 磁盘other错误过高 - alert: 系统磁盘挂载点读写异常 expr: sum by(device,fs,mountpoint,hostname,hostip)(node_disk_mount) == 0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 系统磁盘挂载点读写异常 - alert: 系统磁盘fstab卷丢失 expr: sum by(device,mountpoint,hostname,hostip)(node_disk_volume_loss) == 0 for: 0m labels: severity: warning annotations: summary: "{{ $value }}" description: 系统磁盘fstab卷丢失 - alert: 系统磁盘IO负载过高 expr: sum by(hostname,hostip,device)(node_disk_ioutil{hostip!~"192.168.31.*"}) > 99 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: 系统磁盘IO负载过高 - alert: 系统用户被修改 expr: sum by (hostip,hostname,username)(delta(node_shadow_md5[5m])) != 0 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 系统用户被修改 - alert: 系统用户修改详情 expr: sum by (hostip,hostname,username,action)(node_systemuser_status) == 1 for: 0m labels: severity: emergency annotations: summary: "{{ $value }}" description: 系统用户修改详情 - alert: 物理机离线 expr: sum by(hostip,instance)(up{job=~"base-exporter-.*"}) == 0 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机离线 - alert: 系统5分钟内发生重启 expr: round(sum by(hostname,hostip)(node_uptime) / 60)<5.0 for: 1m labels: severity: emergency annotations: summary: "{{ $value }}" description: 系统5分钟内发生重启(请结合系统离线告警判断) - alert: 物理机CPU使用率过高 expr: sum by(hostname,hostip)(node_cpu_usage_total{hostip!~"192.168.31.*"})*100 > 80 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机CPU使用率过高 - alert: 24h内存平均使用率超过95% expr: round(sum by (hostname,hostip) (avg_over_time(node_mem_usage[1d])) * 100, 0.01) >95 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 24h内存平均使用率超过95% - alert: 物理机大于1T的磁盘空间不足5% expr: round(sum by(hostname,hostip,device,mountpoint)((node_disk_usage and on(hostname,device) (node_disk_total>1099511627776)))*100,0.01) >95 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机大于1T的磁盘空间不足5% - alert: 物理机小于1T的磁盘空间不足15% expr: round(sum by(hostname,hostip,device,mountpoint)((node_disk_usage and on(hostname,device) (node_disk_total<1099511627777)))*100,0.01) >90 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 物理机小于1T的磁盘空间不足15% - alert: 系统load5过高 expr: round(sum by (hostname,hostip)(node_load5 / node_cpu_core),0.01) >1.5 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 系统load5过高 rules-k8s.yml: |- groups: - name: k8s rules: - alert: K8S集群Node节点NotReady expr: sum by (job,node)(kube_node_status_condition{condition="Ready",status="true"}) == 0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: K8S集群Node节点NotReady - alert: K8S集群资源短缺 expr: kube_node_status_condition{condition=~"OutOfDisk|MemoryPressure|DiskPressure",status!="false"} ==1 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: K8S集群资源短缺 - alert: K8S集群PVC空间不足20% expr: round(sum by (namespace,persistentvolumeclaim,job)(kubelet_volume_stats_available_bytes/kubelet_volume_stats_capacity_bytes)*100,0.01) <20 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: PVC空间不足20% - alert: K8S集群15分钟内有Pod重启 expr: sum by (container,k8scluster,namespace,pod)(changes(kube_pod_container_status_restarts_total{pod!~"kuboard-pv-browser.*"}[15m])) ==1 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: K8S集群15分钟内有Pod重启 - alert: K8S集群Statefulset副本异常 expr: kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: K8S集群Statefulset副本异常 rules-network.yml: |- groups: - name: network rules: - alert: 网络设备接口异常DOWN掉 expr: sum by (ifIndex,ifName,instance,project)(ifOperStatus and on (ifIndex,ifName,instance,project) avg_over_time(ifOperStatus[24h]) ==1) !=1 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 网络设备接口异常DOWN掉 - alert: 网络设备电源状态异常 expr: sum by(hh3cDevMPowerNum,instance,project)(hh3cDevMPowerStatus) !=1 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 网络设备电源状态异常 - alert: 网络设备风扇状态异常 expr: sum by(hh3cDevMFanNum,instance,project)(hh3cDevMFanStatus) !=1 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 网络设备电源状态异常 - alert: 网络设备流出带宽超过120Mbps expr: round(sum by(ifIndex,ifName,instance)(irate(ifHCOutOctets[5m]) /1024/1024),0.01) >120 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 网络设备流出带宽超过120Mbps - alert: 网络设备流入带宽超过120Mbps expr: round(sum by(ifIndex,ifName,instance)(irate(ifHCInOctets[5m]) /1024/1024),0.01) >120 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 网络设备流入带宽超过120Mbps - alert: 网络设备5分钟内发生重启 expr: round(sum by (instance,project)(sysUpTime /100 /60)) <5 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 网络设备5分钟内发生重启(请结合设备离线告警判断) rules-uenpay.yml: |- groups: - name: uenpay rules: - alert: 刷新支付服务端口监听异常 expr: sum by (env,project,service,ip,port)(probe_success{project="sxzf"}) != 1 for: 0s labels: severity: warning annotations: summary: "{{ $value }}" description: 新支付服务端口监听异常 rules-vmvare.yml: |- groups: - name: vmvare rules: - alert: EXSi主机离线 expr: sum by (dc_name,host_name)(vmware_host_power_state) !=1 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: EXSi主机离线超过5分钟 - alert: EXSi主机CPU使用情况 expr: sum by (dc_name,host_name)(vmware_host_cpu_usage / vmware_host_cpu_max)*100 >95 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: EXSi主机CPU使用率超90% - alert: EXSi主机内存使用情况 expr: sum by (dc_name,host_name)(vmware_host_memory_usage/ vmware_host_memory_max)*100 >99 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: EXSi主机内存使用率超90% - alert: EXSi主机磁盘容量情况 expr: sum by (dc_name,ds_name)((vmware_datastore_capacity_size- vmware_datastore_freespace_size) / vmware_datastore_capacity_size)*100 >99 for: 5m labels: severity: warning annotations: summary: "{{ $value }}" description: EXSi主机磁盘容量使用率超90% rules-wfm.yml: |- groups: - name: "wfm" rules: - alert: 微付猫机器离线 expr: sum by (hostip, instance) (up{ job="base-exporter-d1-others",project="wfm"}) != 1 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 微付猫机器离线 - alert: 微付猫机器磁盘空间不足20% expr: round(sum by (hostname, hostip, device, mountpoint) (node_disk_usage{project="wfm"}) * 100, 0.01) > 80 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 微付猫机器磁盘空间不足20% - alert: 微付猫机器24h内存平均使用率超过90% expr: round(sum by (hostip, hostname) (avg_over_time(node_mem_usage{project="wfm"}[1d])) * 100, 0.01) >90 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 微付猫24h内存平均使用率超过95% - alert: 微付猫服务端口监听异常 expr: sum by (env,project,service,ip,port)(probe_success{project="wfm"}) != 1 for: 0s labels: severity: warning annotations: summary: "{{ $value }}" description: 微付猫服务端口监听异常 - alert: 微付猫MongoDB连接数小于10 expr: sum by(env,state)(mongodb_connections{env="mongo-wfm-prod",state="current"}) <10 for: 10s labels: severity: warning annotations: summary: "{{ $value }}" description: 微付猫MongoDB连接数小于10 rules-xsf.yml: |- groups: - name: "xsf" rules: - alert: 新闪付机器离线 expr: sum by(hostip,instance)(up{job=~"base-exporter-d3.*"}) != 1 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 新闪付机器离线 - alert: 新闪付机器CPU使用率超过70% expr: round(sum by(hostname,hostip)(node_cpu_usage{job=~"base-exporter-d3.*"})*100,0.01) > 70 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 新闪付机器CPU使用率超过70% - alert: 新闪付机器24h内存平均使用率超过90% expr: round(sum by (hostip, hostname) (avg_over_time(node_mem_usage{job=~"base-exporter-d3.*"}[1d])) * 100, 0.01) > 90 for: 3m labels: severity: warning annotations: summary: "{{ $value }}" description: 新闪付24h内存平均使用率超过95% - alert: 新闪付机器磁盘空间不足20% expr: round(sum by (hostname, hostip, device, mountpoint)(node_disk_usage{job=~"base-exporter-d3.*"} * 100),0.01) > 80 for: 30s labels: severity: warning annotations: summary: "{{ $value }}" description: 新闪付机器磁盘空间不足20% - alert: 新闪付机服务Pod运行异常 expr: (sum by(pod)(kube_pod_status_ready{job="kube-state-metrics-d3-prod",namespace="xsf",condition="true"} and on (namespace,pod) kube_pod_status_phase{phase="Running"})) == 0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 新闪付机服务Pod运行异常 - alert: 新闪付机服务Pod异常重启 expr: sum by(namespace,pod)(round(delta(kube_pod_container_status_restarts_total{job="kube-state-metrics-d3-prod",namespace=~"xsf"}[10m]))) != 0 for: 1m labels: severity: warning annotations: summary: "{{ $value }}" description: 新闪付机服务Pod异常重启 kind: ConfigMap metadata: name: rules-config namespace: kube-public