对数:用于AI训练工作流程和服务的日志引擎

  • Systems and application logs play a key role in operations, observability, and debugging workflows at Meta.
  • 系统和应用程序日志在Meta的操作、可观察性和调试工作流中起着关键作用。
  • Logarithm is a hosted, serverless, multitenant service, used only internally at Meta, that consumes and indexes these logs and provides an interactive query interface to retrieve and view logs.
  • Logarithm是一个托管的、无服务器的、多租户服务,仅在Meta内部使用,用于消费和索引这些日志,并提供交互式查询界面以检索和查看日志。
  • In this post, we present the design behind Logarithm, and show how it powers AI training debugging use cases.
  • 在本文中,我们介绍了Logarithm的设计,并展示了它如何支持AI训练调试用例。

Logarithm indexes 100+GB/s of logs in real time, and thousands of queries a second. We designed the system to support service-level guarantees on log freshness, completeness, durability, query latency, and query result completeness. Users can emit logs using their choice of logging library (the common library at Meta is the Google Logging Library [glog]). Users can query using regular expressions on log lines, arbitrary metadata fields attached to logs, and across log files of hosts and services.

对数索引以实时方式索引100+GB/s的日志,并支持每秒数千个查询。我们设计了该系统以支持日志的新鲜度、完整性、持久性、查询延迟和查询结果完整性的服务级别保证。用户可以使用他们选择的日志记录库(Meta常用的库是Google Logging Library [glog])发出日志。用户可以使用正则表达式查询日志行、附加到日志的任意元数据字段以及主机和服务的日志文件。

Logarithm is written in C++20 and the codebase follows modern C++ patterns, including coroutines and async execution. This has supported both performance and maintainability, and helped the team move fast – developing Logarithm in just three years.

Logarithm使用C++20编写,代码库遵循现代C++模式,包括协程和异步执行。这既支持了性能和可维护性,也帮助团队快速开发了Logarithm,仅用了三年的时间。

Logarithm’s data model

Logarithm的数据模型

Logarithm represents logs as a named log stream of (host-local) time-ordered sequences of immutable unstructured text, corresponding to a single log file. A process can emit multiple log streams (stdout, stderr, and custom log files). Each log line can have zero or more metadata key-value pairs attached to it. A common example of metadata is rank ID in machine learning (ML) training, when multiple sequences of log lines are ...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.124.0. UTC+08:00, 2024-05-02 06:56
浙ICP备14020137号-1 $访客地图$