BIGO基于Pulsar在高吞吐追赶读场景下的性能优化实践

1. BIGO's Performance Optimization Practice in High-throughput Catch-up Read Scenarios Based on Pulsar Zhanpeng Wu

2. Contents • Background • Measurement Study • RA System Designing • Read Acceleration Architecture • Evaluation • Conclusion

3. Background Difference between TAILING read & CATCHUP read? Our catchup read scenarios look like? Why does catchup read hurt system performance?

4. Tailing Read & Catchup Read

5. Data in Specified Time Range

6. Performance Comparison

7. Performance Loss in Catchup Read

8. Measurement Study How to build a performance monitoring system? What is the most time-consuming stage in a read request?

9. Dataflow under Multi-layer Cache

10. Measurement Metrics BP-44: https://github.com/apache/bookkeeper/issues/2834

11. Results

12. System Designing Why do we need a whole new read-ahead system? What an asynchronous read-ahead system should look like under ideal conditions

13. Current Read-ahead Mechanism

14. Principles in Read-ahead Mode • When should the read-ahead be triggered? • Sequential read behavior should trigger read-ahead; • Only read a single entry from disk should not trigger read-ahead; • When should read-ahead locations be recorded? • When all levels of cache fail to hit the target entry, a disk read must be triggered. Before returning the entry, put the `entry+1` position into the `pending_ra_map`. • When the asynchronous read-ahead task is completed, the `pre_ra_pos position` (the default value is the position of the entry in the 75th percentile of the read-ahead entry list) is put into the `pending_ra_map`;

15. Principles in Read-ahead Mode • When should the read-ahead task actually be submitted? • When the target entry exists in `pending_ra_map`, the read- ahead task will be submitted asynchronously in the background. • How does the read-ahead window change? • Currently it is a fixed size, and the relative parameters are configurable.

16. Principles in Read-ahead Mode • How to read an entry that its corresponding read-ahead task has not yet completed? • If the target entry is belong to an un-completed read-ahead task, it will block and wait until the read-ahead task is completed, and then return the entry data. • Sub-question: Granularity of blocking? • At present, the blocking granularity is the window size of a read-ahead task, not the granularity of each entry, so as to avoid creating too many locks. • Where is the read-ahead data stored? • The data generated by the read-ahead task is stored in `org.apache.bookkeeper.bookie.storage.ldb.ReadCache`. The current cache structure is reused, and the data exists in the off-heap space.

17. Acceleration Architecture Detail implementation of read-ahead system

18. Overview BP-49: https://github.com/apache/bookkeeper/issues/3085

19. Dataflow in Async Read-ahead

20. Detail Implementation

21. Detail Implementation

22. Detail Implementation

23. Evaluation Evaluation metrics designing The actual effect of the evaluation results

24. Metrics • Summary • read-ahead total time • read-ahead async queue time • read-ahead async execution time • read entry blocking time • Counter • hit ReadCache count • miss ReadCache count • read-ahead entries count • read-ahead bytes count

25. Evaluation Results test-env: Hit / Miss Count (~50MB/s/node writes )

26. Evaluation Results test-env: Cluster/Bookie-wide P99 & AVG Read Time (~50MB/s/node writes)

27. Evaluation Results prod-env: Bookie-wide P50 & P99 & AVG Read Time (2~3GB/s/cluster reads)

28. Evaluation Results prod-env: Bookie-wide P50 & P99 & AVG Read Time | SSD rocksdb

29. Conclusion A brief summary of our optimizations

30. Conclusion • This talk proposes a new asynchronous read-ahead system, which can effectively improve the efficiency of catchup-read. • The work expands a lot of performance metrics of the read-ahead system under the original monitoring system, laying a solid foundation for the performance analysis of read latency. • The new system has been running stably within BIGO for several months and has fully served the machine learning platform, and the training jobs on it have achieved lower read latency.

31. Thanks