CAT性能优化的实践和思考

1. CAT性能优化的实践和思考梁锦华携程高级技术专家

2. ⾃自我介绍

3. ⾃自我介绍 - 曾就职阿里、百度、大众点评 - 10+年框架和中间件研发 - 携程 - - 消息中间件应用监控

4. ⽬目录 - CAT在携程 - CAT性能优化案例 - 总结和思考

5. ⽬目录 - CAT在携程 - CAT性能优化案例 - 总结和思考

6. CAT在携程 - 2014年底落地 - 监控数据指数级增长 - 2019年数据量量机器器数 2015 2016 2017 2018 2019 https://github.com/dianping/cat - 7W+客户端 - 消息树：8,000亿消息/天，900TB/天，峰值流量 3,000万消息/秒 - 日志：40,000亿行/天，900TB/天，峰值流量 1.5亿行/秒

7. ⽬目录 - CAT在携程 - CAT性能优化案例 - 总结和思考

8. CAT的计算模型 Cat Sever MessageTree MessageTree MessageTree Transaction Report Analyzer Transaction Report Event Report Analyzer Event Report …… …… RPC Report Analyzer RPC Report 实时分析内存

9. Report示例例

10. CAT服务端常⻅见问题 - CPU满 - GC频繁

11. #1 线程模型优化

12. hash ( AppId + IP ) MessageTree hash ( AppId + IP ) CAT线程模型 Report1 Analyzer Analyzer Thread Report Data Analyzer Thread Report Data Analyzer Thread Report Data Report2 Analyzer Analyzer Thread Report Data Analyzer Thread Report Data Report3 Analyzer Analyzer Thread Report Data 数据和线程绑定，⽆无锁更更新

13. 遇到的问题应用流量不均导致部分队列堆积 - Report1 Analyzer 打散数据：增加队列和线程多次调整后，处理理能⼒力力反⽽而下降 hash ( ( AppId AppId + + IP IP ) ) hash - Report1 Analyzer Analyzer Thread Report Data Analyzer Thread Report Data Report Data Analyzer Thread Analyzer Thread Analyzer Thread Analyzer Thread Analyzer Thread Report Data Report Data Report Data Report Data

14. 分析问题处理能力下降 - 上下文切换频繁 - 线程过多队列列和线程解耦 hash ( AppId + IP ) - Report1 Analyzer Analyzer Thread Report Data Analyzer Thread Report Data Analyzer Thread Report Data

15. hash ( AppId + IP ) hash ( AppId + IP ) 分析问题 Report1 Analyzer Analyzer Thread Analyzer Thread Analyzer Thread BIO Report Data Report Data Report Data Processor Thread Socket Channel Processor Thread Socket Channel Processor Thread Socket Channel Report1 Analyzer NIO Report Data Selector Thread Pool Report Data Report Data Socket Channel Selector Socket Channel Socket Channel Thread Pool

16. - Selector - 监听队列数据 - 调度策略 - 充分利用ThreadPool - 避免同一份队列数据的并发更新 hash ( AppId + IP ) 新线程模型 Report1 Analyzer Report Data Selector Thread Pool Report Data Report Data

17. 新线程模型 hash ( AppId + IP ) Cat Server Report1 Analyzer Report Data Selector Report Data Report Data Selector hash ( AppId + IP ) Thread Pool Thread Pool Report2 Analyzer Report Data Selector Thread Pool Report Data

18. hash ( AppId + IP) hash ( AppId + IP ) ⼩小结 - 队列和线程解耦 - 多队列 Analyzer Thread Report Data Analyzer Thread Report Data Analyzer Thread Report Data Analyzer Thread Report Data Analyzer Thread Report Data CPU > 90%, 数据丢失5% - 均衡数据 - 减少上下文切换 - 提供更灵活的调度策略 hash ( AppId + IP ) - CPU核数个线程 hash ( AppId + IP ) - 减少单个队列的锁竞争 CPU ~70%, 数据⽆无丢失 Report Data Report Data Report Data Selector Thread Pool Report Data Report Data

19. #2 客户端计算

20. 遇到的问题 CPU再次⽤用满，数据丢失

21. 分析问题 - 容量不够，加机器 - 优化，节省CPU - 部分服务端计算移到客户端

22. 分析问题 - 哪些计算适合放在客户端 - 不变的逻辑 - 服务端CPU使用比较多 Transaction和Event Report

23. Transaction/Event Report的CPU使⽤用 5.3个核 2.2个核两个Report共占⽤用7.5个核，23% CPU资源!

24. Transaction/Event Report计算服务端计算 Client Server MessageTree 1. 遍历每棵MessageTree MessageTree 2. 更更新Transaction/Event metric … … MessageTree 客户端计算 1. 合并多个MessageTree的统计 2. ⼀一次发送 Transaction 1 metrics Transaction 2 metrics Event 1 metrics Event 2 metrics Client Server MessageTree Transaction 1 metrics MessageTree … MessageTree Transaction 1 metrics Transaction 2 metrics Event 1 metrics Event 2 metrics Transaction 2 metrics Event 1 metrics Event 2 metrics

25. Transaction/Event报表客户端计算 Server Client Transaction Report Analyzer Transaction Metrics Aggregator 优先发送定时 MetricsList Event Metrics Aggregator MessageTree Event Report Analyzer Report 1 Analyzer Report 2 Analyzer

26. 效果

27. 效果

28. 客户端影响 - 内存 10M以下 - CPU 0.1%以下影响

29. ⼩小结 - 对于简单、变更少的逻辑可以考虑客户端计算 - 客户端计算复杂度比服务端计算低 - 服务端计算量只和客户端数量及间隔时间有关

30. #3 Report双缓冲

31. 遇到的问题 - 每小时前几分钟发生数据丢失

32. Cat服务端内存使⽤用 - 网络过来的数据 - 当前小时Report 流量量平稳

33. Cat Report的⽣生命周期 - 当前小时Report常驻内存 - 跨小时会创建一份新的Report，并把上小时Report持久化到存储 Tx/Event Report AppId 不不断创建下层Map，每个Map不不断resize IPs IP Types Type 从Young区移到Old区，很多没有⽤用的Young GC，Old区不不断增加 Map<String, Map> Names Name Minutes Minute Metrics

34. 分析问题 - 不断resize Clone Report 供下个⼩小时使⽤用 - 不断创建下层Map - 无用Young GC - Old区不断增加内存中保留留两份 Report 轮换使⽤用 Young Old Report_17:00 Report_16:00 Report_15:00 Report_Current Young Report_14:00 Old Report_Current Report_LastHour

35. 效果 Full GC 每天20次 Full GC 每天3次

36. ⼩小结 - GC问题 - 尽量少分配内存 - 是否可以复用内存

37. #4 字符串串

38.

39. MessageTree的传输 Client MessageTree MessageTree Transaction 1 序列列化 Event 1 Transaction 2 …… byte [ ] : Transaction 3 new String(byte[], start, len) for type/name/status/data 反序列列化 Event 2 MessageTree type/name Transaction/Event Server Report1 Analyzer Report2 Analyzer … Report2 Analyzer status data String String String

40. byte[] —> String - 1次 new String(byte[], start, len,“UTF-8”) - 2次创建char[] - 1次字符集解码既消耗内存也消耗CPU

41. byte[] —> String必须？ - type/name - status - data ⼤大部分Report只关⼼心是否成功不不是所有Report都需要特殊对待成功，其他状态按需lazy做即可按需lazy做即可

42. type/name需要byte[] —> String? Server Side Message Tree byte [ ] : AppId IP Type Name Metrics …… AppId/IP TX A TX A.1 TX A.2 2次 new char[] Report 1次 cd.decode(bb, cb, true) AppId String Map.get Map<String, Map> IPs IP Types Type Names TX B Name Discard Minutes Minute Metrics

43. type/name需要byte[] —> String? Server Side byte [ ] : AppId IP Type Name new char[en] BytesWrapper byte[] bytes; int start; int len; public class } { ？ …… Report cd.decode(bb, cb, true) BytesWrapper String Metrics Map.get AppId Map<BytesWrapper, Map<String, Map> Map> IPs IP Types Type Names Name Minutes Minute Discard Metrics

44. 进⼀一步思考 Server Side byte [ ] : AppId IP Type Name Metrics …… Report BytesWrapper byte[] bytes; int start; int len; public class } SIZE TYPE DESCRIPTION 4 (object header) 4 (object header) 4 (object header) 4 (object header) 4 int BytesWrapper.start 4 int BytesWrapper.len 4 byte[] BytesWrapper.bytes 4 (object alignment) Instance size: 32 bytes AppId { BytesWrapper String Map.get get(Object key) Map<BytesWrapper, Map<String, Map> Map> IPs IP Types Type Names Name get(byte[], int start, int len) Discard Minutes Minute Metrics

45. 进⼀一步思考 BytesHashMap<V> implements Map<byte[], V> { public V get(byte[] bytes, int start, int len) public class } byte [ ] : AppId IP Type Name Metrics Server Side …… Report AppId BytesHashMap.get String Map<String, BytesHashMap<Map> Map> IPs IP Types Type 直接引⽤用⽹网络过来的byte数组做⽐比较不不需要创建String 不不需要Discard Names Name Minutes Minute Metrics

46. byte[] —> String必须？ - type/name Report改成BytesHashMap，不不需要 - status ⼤大部分Report只关⼼心是否成功特殊对待成功，其他状态按需lazy做即可 - data 不不是所有Report都需要按需lazy做即可 Young GC减少40%

47. ⼩小结 - 关注大量使用对象的创建 - new String(byte[], start, len)消耗大 - 是否可以直接使用byte[]

48. ⽬目录 - CAT在携程 - CAT性能优化案例 - 总结和思考

49. 优化思路路 - CPU - 减少额外损失 - - 优化线程模型：上下文切换、锁竞争的消耗减少不必要的操作 - 减少字符串构造：不必要的解码 - 服务端计算移到客户端

50. 优化思路路 - GC - - 减少不必要的对象创建 - 减少创建字符串，使用已有byte[] - 减少Report的重复创建、填充和resize 内存复用 - Report的内存双缓冲复用

51.

52.