END-TO-END LOAD TESTING AT SCALE

1. END-TO-END LOAD TESTING AT SCALE OLIWIA ZAREMBA CONTINUOUS TESTING MEETUP 10-09-2019

2. WHY LOAD TESTING?

3. BLACK FRIDAY 3 Source: giphy.com

4. BLACK FRIDAY ● Auto-scaling is not a solution for huge spikes → To handle a huge spike of traffic, all services need to be pre-scaled beforehand ● Scaling into infinity costs infinite $$$ → The scaling configuration needs to be frugal 4

5. BLACK FRIDAY Problem statement: For each service S i find a minimum value of the scaling parameter k i , at which the service can handle the expected load L. 5

6. WHY END-TO-END?

7. WHY NOT MAKING IT THIS SIMPLE? curl --get https://www.zalando.de/ 7

8. WHY NOT MAKING IT THIS SIMPLE? ● Real customers perform different actions: browsing, filtering, checking out, … ● These actions are served by many different services 8

9. MICROSERVICES 9 Source: https://thenewstack.io/history-service-mesh/

10. END-TO-END APPROACH WITH MICROSERVICES Definition of the task: Generate an increasingly high realistic load to identify bottlenecks in the microservices. 10

11. AT WHAT SCALE?

12. BLACK FRIDAY 2017 NUMBERS ● 2,000 orders per minute (1,500 in 2016) ● 100,000 new customers 12 Source: https://corporate.zalando.com/en/newsroom/en/stories/zalando-celebrates-successful-black-friday-2017

13. BLACK FRIDAY 2017 NUMBERS ● 2,000 orders per minute (1,500 in 2016) ● 100,000 new customers … and expectations for bigger numbers in 2018 13

14. HOW DO WE WANT TO ACHIEVE IT?

15. SIMULATE REAL USERS 15

16. SIMULATE REAL USERS 16

17. SIMULATE REAL USERS - WITH PUPPETEER? 17

18. SIMULATE REAL USERS - WITH PUPPETEER? 18

19. SIMULATE REAL USERS - WITH PUPPETEER? $ 19 $

20. SIMULATE REAL USERS… AND MAKE IT CHEAP team my_browser_session record once 20

21. SIMULATE REAL USERS… AND MAKE IT CHEAP Thin agent TRANSFORM my_browser_session 21 ... load test scenario Thin agent Thin agent replay many times

22. IMPLEMENTATION

23. RECORDING BY THE TEAM + REPLAYING BY THE LOAD TEST RUNNER load test runner session 23 load test scenario

24. RECORDING BY THE TEAM + REPLAYING BY LOCUST session 24 locustfile.py

25. TRANSFORMER + LOCUST session.har 25 locustfile.py

26. TRANSFORMER + ZELT + LOCUST session.har 26 locustfile.py

27. TRANSFORMER + ZELT + LOCUST session.har 27 locustfile.py

28. TRANSFORMER + ZELT + LOCUST + KUBERNETES session.har locustfile.py + 28

29. CHOOSING THE TECHNOLOGY 29 Source: locust.io

30. CHOOSING THE TECHNOLOGY 30 Source: locust.io

31. CHOOSING THE TECHNOLOGY 31 Source: github.com/locustio/locust

32. 32

33. HOW IT WORKS: 1. RECORDING THE USER BEHAVIOUR ● HAR - HTTP ARchive ● File extension: .har ● Format: JSON 33

34. HOW IT WORKS: 1. RECORDING THE USER BEHAVIOUR 34

35. HOW IT WORKS: 2. TRANSFORMING HAR INTO LOCUSTFILE my_browser_session.har 35 locustfile.py

36. HOW IT WORKS: 2. TRANSFORMING SCENARIOS WITH WEIGHTS browsing_items_scenario.har 83% checkout_scenario.har 17% 36

37. HOW IT WORKS: 2. TRANSFORMING SCENARIOS WITH WEIGHTS browsing_items_scenario.har locustfile.py checkout_scenario.har 37

38. HOW IT WORKS: 3. EXECUTING THE END-TO-END LOAD TESTS Input: # RPS for each major service Plan and record the scenarios 38 Output: HAR files with the scenarios recorded

39. HOW IT WORKS: 3. EXECUTING THE END-TO-END LOAD TESTS Announce the load test Plan and record the scenarios 39

40. HOW IT WORKS: 3. EXECUTING THE END-TO-END LOAD TESTS Announce the load test Plan and record the scenarios 40 Execute the test increasing the load

41. HOW IT WORKS: 3. EXECUTING THE END-TO-END LOAD TESTS Announce the load test Plan and record the scenarios 41 Execute the test increasing the load Identify the first component to go down

42. HOW IT WORKS: 3. EXECUTING THE END-TO-END LOAD TESTS Announce the load test Plan and record the scenarios Execute the test increasing the load Identify the first component to go down Wait some time until the issue is addressed 42

43. HOW IT WORKS: 3. EXECUTING THE END-TO-END LOAD TESTS Announce the load test Plan and record the scenarios Identify the first component to go down Share the journal & the next load test date 43 Execute the test increasing the load Wait some time until the issue is addressed

44. SOME OBSTACLES DOWN THE ROAD

45. OBSTACLE 1: SECURITY SYSTEM BLOCKED US 45

46. OBSTACLE 1: SECURITY SYSTEM BLOCKED US ● End-to-end load test is in reality a well-intended DoS attack 46

47. OBSTACLE 1: SECURITY SYSTEM BLOCKED US ● Solution: mark all requests coming from Zelt easily identifiable by the security system ● Analytics, machine learning models, A/B tests need to filter out Zelt traffic too! 47

48. OBSTACLE 2: COOKIES RECORDED IN THE HAR FILE ARE NOT VALID WHEN REPLAYING ● Solution: don’t process the cookies as recorded. Instead, let the cookies be set by response headers in the replay mode 48

49. OBSTACLE 3: WE CAN’T KEEP USING THE SAME TEST CUSTOMER ACCOUNT ● Solution: override the customer credentials in the registration/login step with test accounts ● Parameterize the scenarios: for each execution, choose a random account from a defined set 49

50. OBSTACLE 4: WE ONLY WANT TO TARGET ZALANDO, NOT GOOGLE ANALYTICS ENDPOINTS ● Solution: provide a blacklisting mechanisms for automatic filtering of the recorded requests 50

51. OBSTACLE 5: MORE AND MORE ZALANDO-SPECIFIC MECHANISMS NEED TO BE ADDRESSED ● Solution: introduce a system of plugins for Transformer ● Implement each Zalando-specific solution as a plugin 51

52. HOW DID WE DO?

53. FINAL CONFIGURATION 53 5 300 130,000 STACKS LOCUST WORKERS PER STACK RPS

54. OFFICIAL RESULTS OF THE BLACK FRIDAY CAMPAIGN Source: corporate.zalando.com 54

55. ONE MORE THING... github.com/zalando-incubator/transformer 55 github.com/zalando-incubator/zelt

56. OLIWIA ZAREMBA SOFTWARE ENGINEER oliwia.zaremba@zalando.de twitter.com/tortilato