如何在事务图数据库上导出十亿规模的图
eBay’s GraphDatabase, NuGraph, benefits many eBay’s internal teams for real-time business decisions through relationship analysis. But as the graph dataset increases, it becomes more and more challenging to validate the graph data quality, check the relationship topology and understand the insight of the graph. For example, eBay’s internal biggest graph has more than 15 billion vertices and 24 billion edges. Furthermore, NuGraph customers expect to export the whole graph to the Hadoop Distributed File System (HDFS) for further processing.
eBay的图数据库NuGraph通过关系分析为eBay的内部团队提供实时业务决策的许多好处。但随着图数据集的增加,验证图数据质量、检查关系拓扑和理解图的洞察力变得越来越具有挑战性。例如,eBay内部最大的图表拥有超过150亿个顶点和240亿条边。此外,NuGraph的客户希望将整个图表导出到Hadoop分布式文件系统(HDFS)以进行进一步处理。
To address those challenges, we proposed a solution which leverages the Disaster Recovery (DR) of the backend storage for a full scan. We built a NuGraph analytics plugin over the open-source graph database JanusGraph which performs the full scan in parallel with the DR backend store and produces the exported graph to HDFS. For the biggest graph on eBay, it takes 3 hours to complete the graph export on a Spark cluster with 380 CPU cores and 3.7 TB memory.
为了解决这些挑战,我们提出了一种解决方案,利用后端存储的灾难恢复(DR)进行全面扫描。我们在开源图数据库JanusGraph上构建了一个NuGraph分析插件,该插件与DR后端存储并行执行全面扫描,并将导出的图表生成到HDFS。对于eBay上最大的图表,使用拥有380个CPU核心和3.7TB内存的Spark集群完成图表导出需要3小时。
For the DR setup and NuGraph analytics plugin development, various techniques and improvements on the open source graph database are applied, including separation of offline graph export from online transactional query traffic, handling super nodes in graphs, and JVM memory management on a huge graph.
对于DR设置和NuGraph分析插件开发,应用了各种技术和改进的开源图数据库,包括将离线图导出与在线事务查询流量分离,处理图中的超级节点,以及在大规模图上的JVM内存管理。
1. Motivation and Challenges
1. 动机和挑战
NuGraph is a graph database platform developed at eBay that is cloud-native, scalable and performant. It is built upon the open-source graph database Janusgraph [1], with FoundationDB [2] as the backend storage to store graph elements and indexes. FoundationDB is a distri...