从基于规则的分类器演变而来:Netflix数据平台中基于机器学习的自动修复

by Binbing Hou, Stephanie Vezich Tamayo, Xiao Chen, Liang Tian, Troy Ristow, Haoyuan Wang, Snehal Chennuru, Pawan Dixit

by Binbing Hou, Stephanie Vezich Tamayo, Xiao Chen, Liang Tian, Troy Ristow, Haoyuan Wang, Snehal Chennuru, Pawan Dixit

This is the first of the series of our work at Netflix on leveraging data insights and Machine Learning (ML) to improve the operational automation around the performance and cost efficiency of big data jobs. Operational automation–including but not limited to, auto diagnosis, auto remediation, auto configuration, auto tuning, auto scaling, auto debugging, and auto testing–is key to the success of modern data platforms. In this blog post, we present our project on Auto Remediation, which integrates the currently used rule-based classifier with an ML service and aims to automatically remediate failed jobs without human intervention. We have deployed Auto Remediation in production for handling memory configuration errors and unclassified errors of Spark jobs and observed its efficiency and effectiveness (e.g., automatically remediating 56% of memory configuration errors and saving 50% of the monetary costs caused by all errors) and great potential for further improvements.

这是Netflix关于利用数据洞察和机器学习(ML)改进大数据作业性能和成本效率的运营自动化系列工作的第一篇。运营自动化-包括但不限于自动诊断、自动修复、自动配置、自动调优、自动扩展、自动调试和自动测试-是现代数据平台成功的关键。在本博客文章中,我们介绍了我们在自动修复方面的项目,该项目将当前使用的基于规则的分类器与ML服务集成在一起,旨在自动修复失败的作业,无需人工干预。我们已经在生产环境中部署了自动修复,用于处理Spark作业的内存配置错误和未分类错误,并观察到其效率和有效性(例如,自动修复了56%的内存配置错误,并节省了由所有错误引起的50%的货币成本),并具有进一步改进的巨大潜力。

Introduction

介绍

At Netflix, hundreds of thousands of workflows and millions of jobs are running per day across multiple layers of the big data platform. Given the extensive scope and intricate complexity inherent to such a distributed, large-scale system, even if the failed jobs account for a tiny portion of the total workload, diagnosing and remediating job failures can cause considerable operational burdens.

在Netflix,每天有数十万个工作流和数百万个作业在大数据平台的多个层次上运行。鉴于这样一个分布式、大规模系统的广泛范围和复杂性,即使失败的作业只占总工作负载的一小部分,诊断和修复作业故障也会带来相当大的运营负担。

Fo...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.137.1. UTC+08:00, 2024-11-22 16:01
浙ICP备14020137号-1 $访客地图$