Apache Flink在网易的实践

1. How Netease Uses Apache Flink

2. 01 The evolution of business and scale Contents 02 03 04 Flink platformization Case study Future development and thinking

3. The evolution of business and scale 01

4. The evolution of stream computing in Netease 2012 2017.2 2019.1 2019.7

5. The business scale based on stream computing 1000+ tasks 2w+ vcores 80+ T mem

6. Ad Data analysis Risk managem ent ETL Big screen Monitor Recomme ndation Live

7. Flink platformization 02

8. The evolution of platform architecture- Sloth0.x Sloth proxy Self-develop codegen Hard to operation and maintenance server monitor server monitor Flink 1.5 Flink 1.3 Sloth 0.9 Sloth 0.8 server monitor Flink v lower Sloth v lower Yarn cluster Bingjiang cluster jiande cluster Dongguan cluster

9. The evolution of platform architecture – Sloth1.0 Pluggable of different Flink versions Father and children process architecture server server monitor Easy to operate and maintenance scheduler monitor Flink 1.5 Flink 1.7 subprocess Support Jar task development Blink subprocess Sloth 0.9 Support Blink SQL development Monitor using Grafana Yarn cluster Bingjiang cluster jiande cluster Dongguan cluster

10. The evolution of platform architecture – Sloth2.0 sloth-web magina-web nginx Supply platform interface to different web Support distribute deployment of server, service high available Sloth sever Flink1.5.0 Sloth server Flink1.7.2 Flink1.9.1 Sloth server Blink Yarn cluster Bingjiang cluster jiande cluster Dongguan cluster

11. The modules of platform sloth-web magina-web nginx hdfs zookeeper tsdb kafka SQL B es

12. Event management 3. send message 4. return result The event includes two operations of task start and stop, which are completed by three modules: Server, Kernel and Admin Server Kernel 6. action result 5. update starting Server:The initiator of event execution, accepts the request of the event, performs data verification and assembly, and sends the event to the kernel for execution 1. create node 7. write data DB 9. update running/ stopped Kernel:The executor of the event specific logic, sends instructions to the cluster according to the request (shell script mode) 2. watch node 8. get data Zookeeper Admin: The confirmer of the event execution result, obtains the final result of the event according to the event type to ensure the correctness of the result Admin 10. delete node

13. Kernel scheduler sloth-server call check kernelMap put

14. The statechart of platform

15. Task development Support task debug Support tab page for tasks Support syntax check Support tags for tasks Support metadata management Support user resource file management Support task replication

16. Blink SQL source Kafka compute platform dim sink hdfs kafka sloth server mysql Expansion and improvement of the support for dim table join and sink redis Blink kernel redis hbase Yarn cluster hbase kudu es oracle tsdb

17. SQL task debug web click debug btn web (upload file, start) sloth-server sloth-server SQL task support debug by uploading different csv files as source according to different source table and dim table get debug tables The execution of debug use assigned kernel, sloth server response for packaging request, invoking kernel, returning results, searching logs return src & dim send debug object return debug result sloth-kernel (permanent) sloth-kernel (temporary) step 1 step 2

18. Log retrieval Yarn cluster logstash cluster kafka cluster es cluster logstash Filebeat input filter output ES node0 output ES node2 kibana logstash filebeat kafka input filter logstash filebeat input filter output ES node3 sloth sever sloth sever sloth web

19. Monitor ntsdb cluster Monitor metrics using influxdb metric reporter Using self-develop time series database ntsdb node1 Flink job Influxdb metric reporter node2 node3 grafana

20. Monitor

21. Alarm User can self define alarm rules Support multiple alarm styles

22. Case study 03

23. Realtime data synchronization web input Mysql Binlog AI smart talk services: client input the knowledge data in frontend, then real processed by Sloth, finally write into ES for user query Mysql NDC kafka sloth ES web query

24. Realtime warehouse Application layer Live stream Realtime rec operate analysis A Service layer Storage layer Compute layer kudu redis mysql hbase kafka sloth(based on flink) kafka Access layer NDC(data transfer) Datastream(log crawler)

25. E-commerce application – data analysis Business scenario: real time activity analysis, home page resource analysis, flow funnel, real-time gross profit calculation Brief logic: collect the user‘s access log from Hubble and push it to Kafka, use Sloth to clean out the details layer, write it to Kafka, then use sloth task to associate dimensions and write it to Kudu in real time. The data falling into the Kudu table, on the one hand, can be used to develop real-time queries for business parties and analysts; on the other hand, it can be used to query and summarize data on the real-time Kudu table and provide it to data applications

26. E-commerce application - search and recommendation Business scenario: user real-time footprints, user real-time features, real-time product features, real-time CTR CVR sample formation, home page a area rotation, B area activity selection and other UV, PV real-time statistics Brief logic: use Sloth to read application logs (exposure, click, add shopping cart, purchase), clean data and split dimensions, write into Kafka, then use sloth to read Kafka data, make real-time statistics of PV and UV of multi-dimensional features for 5min, 30min, 1H, write into Redis, for online engineering calculation CTR, CVR, and optimize search and recommendation results

27. Future development and thinking 04

28. Future development and thinking Support running K8s tasks in real time stream computing platform Task can auto config, the platform can auto config memory and parallelism according to business type and traffic. It can not only promise the SLA of business, also improve the resource utilization rate of the computing cluster Smart diagnosis: It’s hard to debug stream task with udf and building by code. The failure of task make the business and platform side exhausted. Smart diagnosis directly point out the problem according to the various task metrics, reduces the time of business and platform positioning problems. It should give warning in advance and give advice on tuning Take a look at the support of SQL in Flink 1.9 later series and the unification of batch and streaming More participant in Flink community

29. THANKS