Apache Flink在网易的实践
如果无法正常显示,请先停止浏览器的去广告插件。
1. How Netease Uses Apache Flink
2. 01
The evolution of business and scale
Contents
02
03
04
Flink platformization
Case study
Future development and thinking
3. The evolution of business and scale
01
4. The evolution of stream computing in Netease
2012
2017.2
2019.1
2019.7
5. The business scale based on stream computing
1000+ tasks
2w+ vcores
80+ T mem
6. Ad
Data
analysis
Risk
managem
ent
ETL
Big screen
Monitor
Recomme
ndation
Live
7. Flink platformization
02
8. The evolution of platform architecture- Sloth0.x
Sloth proxy
Self-develop codegen
Hard to operation and maintenance
server
monitor
server
monitor
Flink 1.5 Flink 1.3
Sloth 0.9 Sloth 0.8
server
monitor
Flink v lower
Sloth v lower
Yarn cluster
Bingjiang cluster
jiande cluster
Dongguan cluster
9. The evolution of platform architecture – Sloth1.0
Pluggable of different Flink versions
Father and children process
architecture
server
server
monitor
Easy to operate and maintenance
scheduler
monitor
Flink 1.5
Flink 1.7
subprocess
Support Jar task development
Blink
subprocess
Sloth 0.9
Support Blink SQL development
Monitor using Grafana
Yarn cluster
Bingjiang cluster
jiande cluster
Dongguan cluster
10. The evolution of platform architecture – Sloth2.0
sloth-web
magina-web
nginx
Supply platform interface to different
web
Support distribute deployment of
server, service high available
Sloth sever
Flink1.5.0
Sloth server
Flink1.7.2
Flink1.9.1
Sloth server
Blink
Yarn cluster
Bingjiang cluster
jiande cluster
Dongguan cluster
11. The modules of platform
sloth-web
magina-web
nginx
hdfs
zookeeper
tsdb
kafka
SQL
B
es
12. Event management
3. send message
4. return result
The event includes two operations of task start and stop,
which are completed by three modules: Server, Kernel and
Admin
Server
Kernel
6. action result
5. update starting
Server:The initiator of event execution, accepts the request
of the event, performs data verification and assembly, and
sends the event to the kernel for execution
1. create node
7. write data
DB
9. update running/
stopped
Kernel:The executor of the event specific logic, sends
instructions to the cluster according to the request (shell
script mode)
2. watch node
8. get data
Zookeeper
Admin: The confirmer of the event execution result, obtains
the final result of the event according to the event type to
ensure the correctness of the result
Admin
10. delete node
13. Kernel scheduler
sloth-server
call
check
kernelMap
put
14. The statechart of platform
15. Task development
Support task debug
Support tab page for tasks
Support syntax check
Support tags for tasks
Support metadata management
Support user resource file
management
Support task replication
16. Blink SQL
source
Kafka
compute
platform
dim
sink
hdfs
kafka
sloth server
mysql
Expansion and improvement of the
support for dim table join and sink
redis
Blink kernel
redis
hbase
Yarn cluster
hbase
kudu
es
oracle
tsdb
17. SQL task debug
web
click debug btn web
(upload file, start)
sloth-server sloth-server
SQL task support debug by uploading
different csv files as source according to
different source table and dim table
get debug
tables
The execution of debug use assigned kernel,
sloth server response for packaging request,
invoking kernel, returning results, searching
logs
return src &
dim
send debug
object
return debug
result
sloth-kernel
(permanent) sloth-kernel
(temporary)
step 1 step 2
18. Log retrieval
Yarn cluster
logstash cluster
kafka cluster
es cluster
logstash
Filebeat
input
filter
output ES node0
output ES node2
kibana
logstash
filebeat
kafka
input
filter
logstash
filebeat
input
filter
output
ES node3
sloth
sever
sloth
sever
sloth
web
19. Monitor
ntsdb
cluster
Monitor metrics using influxdb
metric reporter
Using self-develop time series
database ntsdb
node1
Flink job
Influxdb metric
reporter
node2
node3
grafana
20. Monitor
21. Alarm
User can self define alarm
rules
Support multiple alarm styles
22. Case study
03
23. Realtime data synchronization
web input
Mysql
Binlog
AI smart talk services: client input
the knowledge data in frontend, then
real processed by Sloth, finally write
into ES for user query
Mysql
NDC
kafka
sloth
ES
web query
24. Realtime warehouse
Application layer
Live stream
Realtime
rec
operate
analysis
A
Service layer
Storage layer
Compute layer
kudu
redis
mysql
hbase
kafka
sloth(based on flink)
kafka
Access layer
NDC(data transfer)
Datastream(log crawler)
25. E-commerce application – data analysis
Business scenario: real time activity analysis, home page resource analysis, flow funnel, real-time gross profit calculation
Brief logic: collect the user‘s access log from Hubble and push it to Kafka, use Sloth to clean out the details layer, write it to
Kafka, then use sloth task to associate dimensions and write it to Kudu in real time. The data falling into the Kudu table, on the
one hand, can be used to develop real-time queries for business parties and analysts; on the other hand, it can be used to query
and summarize data on the real-time Kudu table and provide it to data applications
26. E-commerce application - search and recommendation
Business scenario: user real-time footprints, user real-time features, real-time product features, real-time CTR CVR sample
formation, home page a area rotation, B area activity selection and other UV, PV real-time statistics
Brief logic: use Sloth to read application logs (exposure, click, add shopping cart, purchase), clean data and split dimensions,
write into Kafka, then use sloth to read Kafka data, make real-time statistics of PV and UV of multi-dimensional features for
5min, 30min, 1H, write into Redis, for online engineering calculation CTR, CVR, and optimize search and recommendation results
27. Future development and thinking
04
28. Future development and thinking
Support running K8s tasks in real time stream computing platform
Task can auto config, the platform can auto config memory and parallelism according to business type and traffic. It can
not only promise the SLA of business, also improve the resource utilization rate of the computing cluster
Smart diagnosis: It’s hard to debug stream task with udf and building by code. The failure of task make the business and
platform side exhausted. Smart diagnosis directly point out the problem according to the various task metrics, reduces the
time of business and platform positioning problems. It should give warning in advance and give advice on tuning
Take a look at the support of SQL in Flink 1.9 later series and the unification of batch and streaming
More participant in Flink community
29. THANKS