1. 6.2 Prometheus 指标详解

6.1 说明了暴露方式；本节按 网络 / Actor / RPC / Gate / 池与 NATS / Go runtime / per-handler 归类，指标名以 zhenyi/zmetrics/framework.go 与 handler.go 中 实际注册字符串 为准（新增版本可能略增减）。

1.1. 6.2.1 指标分类

zhenyi 的指标按层级组织：

网络层（Conn*）          → 连接、流量
Actor 层（Actor*）        → 消息处理、队列、重启
RPC 层（RPC*）            → 远程调用
Gate 层（Gate*）          → 路由、在线、RTT
对象池（MsgPool*）         → 内存安全
NATS 层（Nats*）          → 跨进程通信
Go 运行时（Go*/go_*）     → 内存、GC、goroutine
Handler 级（per-handler） → 精确到每个 msgId 的延迟

1.2. 6.2.2 网络层指标

指标名	类型	含义
`zhenyi_conn_active`	Gauge	当前活跃连接数
`zhenyi_conn_accepted_total`	Counter	累计接受的连接数
`zhenyi_conn_rejected_total`	Counter	被限流拒绝的连接数
`zhenyi_bytes_recv_total`	Counter	累计接收字节数
`zhenyi_bytes_sent_total`	Counter	累计发送字节数
`zhenyi_conn_errors_total`	Counter	连接错误（读写/解析失败）
`zhenyi_conn_heartbeat_timeout_total`	Counter	心跳超时断开的连接数

典型用法：

# 连接速率
rate(zhenyi_conn_accepted_total[5m])

# 当前在线连接
zhenyi_conn_active

# 错误率
rate(zhenyi_conn_errors_total[5m]) / rate(zhenyi_conn_accepted_total[5m])

conn_rejected_total 突增 → 限流配置可能需要调整。

1.3. 6.2.3 Actor 层指标

指标名	类型	含义
`zhenyi_actor_msg_recv_total`	Counter	Actor 收到的消息总数（Push）
`zhenyi_actor_msg_handled_total`	Counter	Actor 处理完成的消息总数
`zhenyi_actor_msg_dropped_total`	Counter	被丢弃的消息（未注册 handler）
`zhenyi_actor_tick_total`	Counter	Tick 触发次数
`zhenyi_actor_tick_latency_ms`	Histogram	Tick 处理延迟
`zhenyi_actor_panic_total`	Counter	Actor panic 恢复次数
`zhenyi_actor_restarts_total`	Counter	监督者重启 Actor 的次数
`zhenyi_actor_queue_depth`	Gauge	Mailbox 队列深度（采样值）
`zhenyi_actor_msg_latency_ms`	Histogram	消息处理延迟
`zhenyi_actor_workerpool_running`	Gauge	协程池当前运行数
`zhenyi_actor_workerpool_capacity`	Gauge	协程池容量
`zhenyi_actor_blocked_total`	Counter	Watchdog 检测到的阻塞次数

关键告警规则：

# Actor 频繁重启
rate(zhenyi_actor_restarts_total[5m]) > 0.1
→ 某个 Actor 反复 panic，需要排查 handler 中的 bug

# 队列堆积
zhenyi_actor_queue_depth > 1000
→ Actor 处理速度跟不上消息到达速度，考虑扩容或优化 handler

# 消息处理延迟飙升
histogram_quantile(0.99, rate(zhenyi_actor_msg_latency_ms_bucket[5m])) > 100
→ P99 延迟超过 100ms，影响用户体验

# 消息被丢弃
rate(zhenyi_actor_msg_dropped_total[5m]) > 0
→ 有 msgId 没注册 handler，检查路由配置

actor_workerpool_running 持续等于 capacity → 协程池满载，可能需要热更新增大。

1.4. 6.2.4 RPC 层指标

指标名	类型	含义
`zhenyi_rpc_sent_total`	Counter	RPC 请求发送总数
`zhenyi_rpc_success_total`	Counter	RPC 成功总数
`zhenyi_rpc_timeout_total`	Counter	RPC 超时总数
`zhenyi_rpc_circuit_breaker_tripped_total`	Counter	熔断器触发次数（按发送方 Actor 内、按目标 `actorId` 分桶，非全进程单一全局开关）
`zhenyi_rpc_latency_ms`	Histogram	RPC 往返延迟

# RPC 超时率
rate(zhenyi_rpc_timeout_total[5m]) / rate(zhenyi_rpc_sent_total[5m])

# RPC P99 延迟
histogram_quantile(0.99, rate(zhenyi_rpc_latency_ms_bucket[5m]))

# 熔断触发频率
rate(zhenyi_rpc_circuit_breaker_tripped_total[5m])

超时率升高 → 目标 Actor 可能过载或网络有问题。熔断触发 → 目标服务不可用，流量被自动切断。

1.5. 6.2.5 Gate 与流量相关指标

指标名	类型	含义
`zhenyi_gate_online_users`	Gauge	Gate 视角在线用户（实现见 `ServerMetrics`）
`zhenyi_gate_recv_qps` / `zhenyi_gate_sent_qps`	Gauge	收/发 QPS（采样）
`zhenyi_gate_rtt_ms`	Histogram	客户端 RTT
`zhenyi_gate_route_gate_self_total`	Counter	Gate 自身处理的消息数
`zhenyi_gate_route_local_total`	Counter	路由到本进程的消息数
`zhenyi_gate_route_remote_total`	Counter	路由到远程进程的消息数
`zhenyi_gate_route_no_route_total`	Counter	无路由的消息数
`zhenyi_gate_route_remote_fail_total`	Counter	远程路由失败数
`zhenyi_gate_route_remote_candidates`	Gauge	远程候选节点数（最近采样）
`zhenyi_gate_route_remote_try_total`	Counter	远程路由尝试总次数（含重试）
`zhenyi_gate_route_remote_fallback_total`	Counter	远程路由 fallback 次数

# 路由命中率
sum(rate(zhenyi_gate_route_gate_self_total[5m])) + sum(rate(zhenyi_gate_route_local_total[5m])) + sum(rate(zhenyi_gate_route_remote_total[5m]))
÷ sum(rate(zhenyi_actor_msg_recv_total[5m]))

# 无路由率（应该是 0）
rate(zhenyi_gate_route_no_route_total[5m]) / sum(rate(zhenyi_actor_msg_recv_total[5m]))

路由分布是重要的运维信息：

Gate self 高 → 很多消息在 Gate 层就处理了（如心跳、认证）
Local 高 → 大部分消息在本进程处理（性能好）
Remote 高 → 很多消息需要跨进程（延迟高）
NoRoute > 0 → 有 bug，某些 msgId 没有对应的 handler

1.6. 6.2.6 Handler 级指标（per-handler）

这是最精细的指标，精确到 每个 (actorId, msgId) 组合：

zhenyi_handler_total{handler="1001",actor_id="1",actor_type="2"} 12345
zhenyi_handler_slow_total{handler="1001",actor_id="1",actor_type="2"} 50
zhenyi_handler_latency_ms_bucket{handler="1001",actor_id="1",actor_type="2",le="10"} 12000

指标名	类型	含义
`zhenyi_handler_total`	Counter	该 handler 的调用次数
`zhenyi_handler_slow_total`	Counter	超过 10ms 阈值的调用次数
`zhenyi_handler_latency_ms`	Histogram	该 handler 的处理延迟

性能开销：源码注释给出 ~8ns 量级（Inc + ObserveDuration）；实际受 CPU、竞争与直方图实现影响，以 benchmark/profile 为准。

1.6.1. 慢消息排查

# 哪个 handler 最慢？
topk(5, histogram_quantile(0.99, rate(zhenyi_handler_latency_ms_bucket[5m]))) by (handler)

# 慢消息占比
rate(zhenyi_handler_slow_total[5m]) / rate(zhenyi_handler_total[5m])

通过 handler 标签可以精确定位到某个 msgId 的性能瓶颈。

1.7. 6.2.7 对象池与 NATS 指标

指标名	含义	告警
`zhenyi_msgpool_double_release_total`	消息对象重复释放次数	> 0 就要告警，说明有 bug
`zhenyi_nats_publish_total`	NATS 发布总数	—
`zhenyi_nats_publish_errors_total`	NATS 发布失败	> 0 告警
`zhenyi_nats_request_total`	NATS 请求总数	—
`zhenyi_nats_request_errors_total`	NATS 请求失败	> 0 告警
`zhenyi_nats_request_latency_ms`	NATS 请求延迟	P99 > 50ms 告警

1.8. 6.2.8 Go 运行时指标

zhenyi 自动采集 Go 运行时指标，与 prometheus/client_golang 的 NewGoCollector 命名对齐：

指标名	含义
`go_memstats_alloc_bytes`	当前堆分配字节数
`go_memstats_sys_bytes`	从 OS 获取的总字节数
`go_memstats_heap_inuse_bytes`	堆中正在使用的字节数
`go_memstats_heap_idle_bytes`	空闲堆字节数
`go_memstats_heap_released_bytes`	归还给 OS 的字节数
`go_goroutines`	当前 goroutine 数量
`go_threads`	当前 OS 线程数
`go_gc_cycles_total`	GC 总次数
`go_gc_last_pause_ns`	最近一次 GC 暂停时间
`go_gc_pause_total_ns`	GC 暂停总时间
`go_gc_gogc_percent`	当前 GOGC 值

# GC 频率（每秒）
rate(go_gc_cycles_total[1m])

# GC 暂停占比（STW 时间占比）
rate(go_gc_pause_total_ns[5m]) / 1e9

# goroutine 泄漏检测
go_goroutines > 10000  # 根据业务调整阈值

优化 GOGC 可以降低 GC 频率：

zhenyi 的 RingBuffer、消息对象池、buffer 池都是为了减少堆分配，降低 GC 压力。如果 GC 仍然频繁，可以考虑调大 GOGC（默认 100，改到 200 或更高）。

1.8.1. 采集策略

StartRuntimeCollector（由 Enable 链路启动）：默认 interval 30s（可配）；每次 tick 调用 runtime.ReadMemStats 更新大部分 gauge/counter 镜像。GOGC 通过 debug.SetGCPercent(-1) 读取，刻意放在 约 1 分钟 一次的低频分支，以减少全局 GC 参数读写的副作用（见 zmetrics/runtime.go 注释）。

1.9. 6.2.9 Grafana 面板建议

一个实用的生产环境 Grafana Dashboard 应该包含：

第一行：系统概览

在线连接数（zhenyi_conn_active）
接收/发送 QPS（Gate 层）
goroutine 数量
GC 暂停时间

第二行：消息处理

消息接收速率（zhenyi_actor_msg_recv_total）
消息处理 P50/P99 延迟（zhenyi_actor_msg_latency_ms）
消息丢弃数（zhenyi_actor_msg_dropped_total）
队列深度（zhenyi_actor_queue_depth）

第三行：路由分布

本地/远程/无路由占比（饼图）
远程路由失败率
远程候选节点数

第四行：RPC 与 NATS

RPC 成功率 / 超时率
RPC P99 延迟
NATS 发布/请求错误数
NATS 请求延迟

第五行：内存

堆分配（go_memstats_alloc_bytes）
GC 频率与暂停时间

1.10. 6.2.10 本节要点

命名：以 framework.go 中 zhenyi_* / go_* 为准；另可有 zhenyi_zpool_*、zhenyi_monitor_snapshot 等（见 Enable 注释与 MONITORING_METRICS.md）。
Gate：除路由计数外，关注 在线、QPS、RTT 与 远程 fallback。
Handler：zhenyi_handler_* 三维标签 handler / actor_id / actor_type。
运行时：采集 可配置间隔；GOGC 读取低频。
面板：按 连接 → Actor → 路由/RPC → NATS → 内存 分层即可。

6.3 节：SetTraceHooks 与消息 trace 字段。

6.2 Prometheus 指标详解