Before jumping into the details of the migration story, we’d like to take a step back and try to explain the motivation and rationale for this migration.
Over time, the data platform has morphed into various forms as the needs of the company have grown. What was a modest data platform in the early days has grown into an exabyte-scale platform. Some systems serving a smaller scale began showing signs of being insufficient for the increased demands that were placed on them. Most notably, we’ve run into some concrete reliability and efficiency issues related to data (de)serialization, which has made us rethink the way we log data and revisit the ideas from first principles to address these pressing issues.
Logger is at the heart of the data platform. The system is used to log analytical and operational data to Scuba, Hive, and stream processing pipelines via Scribe. Every product and data platform team interacts with logging. The data format for logging was either Hive Text Delimited or JSON, for legacy reasons. The limitations of these formats are described in our previous article on Tulip.