使用Ray扩展Pinterest的机器学习基础设施：从训练到端到端的机器学习管道

Goal: Accelerate large-scale ML workloads by optimizing Ray’s data processing capabilities.

目标： 通过优化Ray的数据处理能力来加速大规模的机器学习工作负载。

How:

如何：

Optimized Ray’s underlying data structure for faster access and processing.
优化了Ray的底层数据结构，以实现更快的访问和处理。
Introduced UDF & feature conversion level optimization for large datasets using Ray Data.
引入了使用Ray Data的大数据集的UDF和特征转换级别优化。

(This will be covered in detail in a future blog post, but we mention it here as an enabler of our approach.)

(这将在未来的博客文章中详细讨论，但我们在这里提到它是我们方法的一个推动因素。)

We have put significant effort to optimize data processing, and achieved 2–3X speedup across different pipelines. The optimization can be categorized into three categories: Ray Data, Feature Conversion and UDF efficiency.

我们投入了大量精力来优化数据处理，并在不同的管道中实现了2-3倍的加速。优化可以分为三类：Ray Data、特征转换和UDF效率。

a. Removing block slicing: Ray internally enables block slicing by setting the target_max_block_size attribute of DataContext to avoid excessively large blocks. This incurs significant CPU and memory overhead.

a. 移除块切片： Ray内部通过将DataContext的target_max_block_size属性设置为避免过大的块，从而启用块切片。这会导致显著的CPU和内存开销。

b. Remove combine_chunks: The combine_chunks function within the batcher can cause unnecessary data copying. This function was originally a workaround to prevent slowness on following operations, as a single continuous chunk can perform better than discrete chunks. Chunk combination will be performed only when necessary in our pipeline, and many operations are optimized for single chunks, so we can operate on the list of chunks without combining them.

b. 移除combine_chunks： 批处理器中的combine_chunks函数可能会导致不必要的数据复制。这个函数最初是为了防止后续操作的缓慢，因为单个连续块的性能可能优于离散块。在我们的管道中，只有在必要时才会进行块组合，许多操作已针对单个块进行了优化，因此我们可以在不组合它们的情况下对块列表进行操作。

2. Feature conversion

2. 特征转换

a. Deduplication by Request ID, Within a given batch of training data, certain features will share the same value due to their common origin. This data duplication presents an opportunity for optimization. The primary trade-off lies in balancing the computational cost of deduplication against the potential ...