Structured DataStore (SDS): Multi-model Data Management With a Unified Serving Stack

Authors: Alberto Ordonez Pereira; Senior Staff Software Engineer | Lianghong Xu; Senior Manager, Engineering |

Part 1: HBase Deprecation at Pinterest & Part 2: TiDB Adoption at Pinterest

In this blog, we will show how the team transitioned from supporting multiple query serving stacks to provide different data models to a brand new data serving platform with a unified multi model query serving stack called Structured DataStore (SDS).

As a data serving platform, SDS is responsible for various aspects of the data lifecycle management, including:

Online Query Serving, which offers multiple data models such as graph, table, document, etc. at 1–2 digit ms p99 latency and with high availability (99.99+) to access and store data
Offline Query Serving, which is responsible for offering means to run analytical queries on the online tables
Streaming (Change Data Capture), which is responsible for streaming changes to the online tables and making them available to external consumers through PubSub
Cost Attribution and Analysis, which provides visibility and predictability on the cost of the different use cases served by the system. It also provides insights on potential cost improvement opportunities, query analysis, etc.
Catalog/Metadata Management, which is responsible for storing and serving table metadata, including ownership, schemas, SLAs, etc., and also providing runtime metadata about the platform
Security and Auditability, which ensures GDPR compliance, advanced levels of auditability, security, AAA, etc.
Tooling (libraries, user interface, etc.), which provides ways to facilitate the use of the platform to the clients

Due to length constraints, this blog will specifically focus on the Online Query Serving component, which can also be referred to interchangeably as “SDS” or “SDS Online Query Serving.” Future blog entries will cover the other components.

Motivation

As discussed briefly in earlier blog posts in this series, the Storage and Caching team owned multiple online query services built on top of HBase that provided various data models and functionalities to our customers. These services all provide a Thrift interface, with the exception of Sparrow, which is a library. Some examples of these services were:

Zen, our graph service. It included specific components like Caching, Routing, Global ID generation, etc. that were designed specifically for this data model and service.
Ixia, our tabular service with near-real time secondary indexing support.. The Ixia service also included internal components like Routing, Caching, etc. again, specifically developed for this service.
Sparrow, our transactional management system that exposes a similar interface to HBase. Since it was offered as a library, our customers needed to develop their own components of routing, caching or rate limiting in their services.
UMS, our wide-column data store. It had the same set of requirements such as caching, routing, rate limiting, etc. that were, again, implemented specifically for this service.

All these services were developed independently at different moments throughout Pinterest’s lifespan, so we ended up having to maintain multiple service stacks with many common requirements. The following picture depicts the situation:

Figure 1: Example of four different services with no overlapping on common modules using two datastore technologies

There are several problems with this approach:

Maintenance-wise, each service is assembled differently, so there is a large cognitive overhead to become an expert in all of them. This leads to knowledge silos within the team, which can cause unnecessary friction.
Each service has been developed with a particular datastore technology in mind, so they are pretty coupled in general. This implies that changing one datastore technology to another requires a non-trivial effort. Such tight coupling made it difficult to transparently migrate a use case to a more suitable datastore technology
APIs are coupled to services. For instance, the graph API is served by the graph service, which couples internal modules to the structures defined in the API, and as mentioned before, to datastores.
Each service has to reimplement most of the common modules, as there is a direct dependency on the API where requests are coming from, the datastore the requests are going to, or both.
Clients cannot easily use different interfaces to access data. For instance, if a client is using a KV interface and eventually needs to transition to a tabular interface with secondary indexing support, that generally implies an API migration rather than an internal datastore switch and the partial use of a tabular interface for the new queries.

Where all these problems stem from is simple:

Internal modules for common operations are implemented based on the structures defined at the API level (hard dependency on where queries are coming from).
Internal modules have a tight coupling with the datastore technology of choice, which makes it nearly impossible to change the datastore without reimplementing the service.

The question to answer is then how to implement those common functionalities so that they can be shared by multiple APIs and datastore technologies. Or in other words, how can we achieve the following:

Figure 2: Condensing four different query serving stacks into a single stack with multiple APIs

Our answer to that is the following:

Define a structure that represents data independently of the datastore technology. We opted to represent data as Tables (relational Table, potentially wide table as well). A physical table (or schema) is never directly exposed to any client. Clients only deal with virtual tables (schemas defined at the service layer) so physical tables can be moved around
Define structures that can effectively query those tables (query language) and structures to convey the results of those queries (query language result format). The query language should be easy to understand and process
Ensure that API implementations compose queries and process query results using the abstractions defined in step 2. I.e. a graph API, instead of writing queries on say, SQL, it will write queries in a new query language that is easy to parse than SQL
Provide translators from these new structures to and from the different datastore protocols used to actually access the data.
Define the common modules such as Caching, Authorization, Rate Limiting, Routing, ID generation, Encryption, etc. based on the newly defined query and query result abstractions.

API requests will then be served as follows:

A request is received via any of the APIs exposed by SDS
The API endpoint transforms the query into one or more internal queries using the newly defined language
The query is then sent to a chain of transformations/modules (authorization, routing, rate limiting, caching, etc.)
One of those transformations translates the query following a corresponding datastore protocol and sends it to the datastore
The datastore returns the result, which is then transformed into a result structure independent of the datastore
The query result structure goes through the same sequence of transformations but in the reverse order as the query did (i.e. data may need to be decrypted)
The result is transformed to the API response format, and sent back to the client

Natural questions that arise from this proposal:

Why do we need a new query language instead of using SQL directly? First of all, we need to distinguish between exposing SQL through a public API (that clients can use) and the use of SQL to define the internal modules. The former is completely possible. The latter, even though possible, is not very convenient. SQL is not very easy to parse. Having to process a SQL string to simply identify which tables are being targeted is not trivial. Also, extracting different fields of a query to handle caching strategies is far from ideal. It would be much easier to use a Thrift-based language (Thrift as IDL), especially given that clients can still use any query language to access their data (we just need to map API queries to this new language).
Will the translation overhead be too high for an online service? In general, frontends don’t really need to do any translation, as they simply need to write the queries in one language or another. Translation to and from the different datastore protocols do introduce extra latency, but it is generally negligible because, as mentioned before, typically queries in an online query service are not too complicated, and the number of entities returned are limited. Additionally, query caching and other optimizations could be applied to mitigate such overhead.

Next, we will focus on the SDS Online Query Serving architecture.

SDS Online Query Serving Architecture

SDS considers that all data is organized as tables, mostly with schemas, but it is flexible enough to accommodate schemaless data as well. Tables are grouped together in namespaces, and no physical table location is ever exposed to the clients: there is a level of indirection between what the client sees and the physical tables. In fact, tables exposed to the clients can be seen as views, as multiple public tables can map to the same physical table. This capability allows us to move data seamlessly without impacting clients, a feature highly valued by our customers.

The SDS Online Query Service stack is illustrated below:

Figure 3: SDS Online Query Serving Architecture

Next, we will describe each of the different components in more detail.

Unified Query Language (UQL)

The UQL is a query language defined using Apache Thrift (™). The fact that queries are constructed using an in-memory structured language significantly eases any type of query transformation. Unlike SQL, it does not require the parsing of complex string structures.

In terms of functionality, there are several types of UQL queries:

Control: Used to alter the behavior of a query or query plan. This includes the following primitives:- Transaction management- Sync vs Async execution commands

- Parallel execution commands
Read: Used to retrieve data from tables. Structure-wise, it is quite similar to SQL:

struct Read {
1: required Locator locator,
2: required list<Selector> selectors,
3: optional Filter filter,
4: optional Ranker ranker,
5: optional GroupBy groupBy,
6: optional Paginator paginator,
}

Insert: Used to insert rows in a table
Delete: Allows to delete rows from a table using complex filter criterias
Update: Allows to update rows from a table

A sequence of UQL queries with an additional query context structure is called a QueryPlan.

Unified Result Format (URF)

The URF is introduced as a generic form of representing table query results. In practice, the URF is a Thrift structure with the following fields:

Status: to flag the status of the operation
Description (optional): to add human readable context to the result, specially useful in failure scenarios
Container: which is, as of today, a dense table with two components:
- Schema: which contains the name, type, and order of the columns of the table
- Data: a table that adheres to the schema definition

A sequence of URFs with an additional result context structure is called a QueryPlanResult.

SDS Frontends

An SDS frontend is simply an API that offers a particular data model or service. Examples of frontend could be the graph frontend (exposes a graph data model), the table frontend (exposes a tabular or relational data model), document, Key-Value, or any other service not particularly bound to any data model.

Frontends are essentially responsible for:

Executing API requests by assembling one or more QueryPlans that the middleware layer will execute
Sending the queries to the middleware layer for execution
Converting the QueryPlanResult returned by the middleware layer to the format expected by the API clients

As an example, a getNodes endpoint in a graph API may look like this:

List<Node> getNodes(List<Integer> ids) {
QueryPlan uqls = translateGetNodesToUQL(ids);
QueryPlanResult urfs = getMiddleware().execute(uqls);
return translateGetNodesFromURF(urfs);
}

SDS Middleware

The SDS Middleware is responsible for receiving a QueryPlan, executing it, and returning the results as a QueryPlanResult. There are two key abstractions that define the SDS Middleware layer: Backends and Profiles.

Backends

A backend is simply a query and query result transformation unit. It can be chained with other backends and its mainly responsible for:

Receiving a query plan (a sequence of UQL queries + some contextual information) and optionally transforming it (i.e. generate IDs for insert operations, add TTL constraints, encrypt data, etc.)
If the backend is part of a chain, it will call the next backend on the chain; otherwise it will short-circuit and assemble and return a QueryPlanResult.
Receiving a QueryPlanResult (a sequence of URFs + some contextual information) returned by the next backend in the chain and optionally transforming it before returning.

Figure 4: Internal Backend Structure

Examples of actual backends being used in production by the Storage & Caching team are:

Authorization guarantees that only authorized entities can access the tables referenced in the queries and for the purpose indicated (control, read, insert, update, remove). Note that not all datastores offer RBAC, which generally results in having to manage multiple ways to offer auth with an organization. By moving Authorization to the service layer and defining it over datastore agnostic structures (virtual tables or schemas), only one Authorization module is necessary. Additionally, virtual schemas may contain only a subset of the physical columns in a physical table, allowing the transparent definition of views, and columnar auth granularity
Routing determines which datastore will be responsible for executing the queries in a query plan. This is a key abstraction to separate physical (datastore tables) from virtual tables (views exposed to the clients). It also transparently manages failovers and other forms of complex routing (multi region read-write strategies etc.)
ID generation generates IDs for newly inserted rows. Note that offloading the ID generation to the underneath datastores will prevent transparent data migrations as clients are effectively being bound to a datastore through the ID generation algorithm.
Cache reduces the burden of the datastore layer and improves latency. It is a declarative module that uses the metadata and the query structure to determine what needs to be cached and how. There is some internal configuration that defines which operations need to be cached, how cache is invalidated, etc. This module, as any other, is independent of the API and datastore, hence shared by all
TTL enhances queries with Time-To-Live constraints.
Instrumentation emits different metrics about queries and results.
Rate Limiting supports different types of rate limiting (concurrency, qps, throughput, etc.). Note that rate limits are often defined at API level (which we do as well in some cases), but this type of rate limit is more focussed on protecting datastores than services (as single API requests can fan out to multiple datastore queries)
Mirror forks queries or query plans to other services.
Execution translates UQL queries to the datastore protocol determined by the Routing layer. It also transforms the results returned by the datastore to URF.
Encryption encrypts and decrypts data to and from datastores.

Backends are a great way to encapsulate functionalities and are very easy to understand as they have a quite simple API (query in, result out). When an engineer has to work on a given backend, they know pretty well which code should and should not be modified. Additionally, they don’t need to worry about where that backend will be used, as that is determined by the query Profiles.

Profiles

Figure 5: Example of two profiles. Profile A will execute the Authorization, Routing, Cache, ID gen, and Execution backends in that order. Profile B will execute the Authorization, Routing, and Execution backends.

A profile is a named sequence of backends that are declaratively defined in a configuration file and that the service loads at bootstrap time. There are two ways profiles can be used:

Statically there is a static configuration that specifies which profiles should be used in each case. For instance, there are rules such as “query plans addressing this namespace should use profile X.” There is always a default profile that will be used in case no specific mapping is available.
Dynamically clients can potentially and explicitly specify which profile to use through the API (when supported)

Profiles have a few major advantages:

They allow you to dynamically change how queries are executed, which transformations are applied, etc. without making any code changes.
They allow you to declaratively define the behavior of the system: just by looking at the configuration, engineers know how a query is executed, the order of transformations, etc. which helps to reduce the cognitive overhead of having to go to the source code to see how the execution is carried out.

In general, we found that the combination of Backends and profiles gave us a great deal of flexibility. A common question we received though, is why allowing only sequences of profiles and not DAGs. For the vast majority of use cases, a linear execution of steps is more than enough, but there is nothing preventing us from actually implementing a DAG: a backend can always fork a call and invoke another profile, hence ending up with a DAG-like execution.

Datastores

A datastore is simply an entity that allows you to store and retrieve data, such as a database or a service. Datastores can speak any protocol, and for them to be supported on SDS the only requirements are the following:

Implement the UQL and URF translations to and from the datastore protocol.
Implement how connections from SDS to the Datastore fleet are managed

Translation implementations are generally easy to implement and not so complicated to optimize performance, especially the UQL to datastore protocol phase. Managing connections from SDS to a datastore falls into the category of how to optimize connection management between two services, and as you may have heard, it is not trivial but doable. On the other end, letting SDS and not clients manage connections to the datastore is another way of protecting the datastore.

SDS Service

As shown in Figure 3, a single SDS deployment (a cluster) can serve multiple frontends within the same physical process by serving each frontend on a different port. In general, frontends share the middleware layer, and the rationale to do so is simple: if clients are accessing the same resource through two different APIs, setting the rate limit at the API level becomes extremely complicated.

Metadata Service

The SDS metadata service is responsible for storing, managing, and serving SDS table’s and runtime metadata.

The table’s metadata is used to represent metadata about the tables served by SDS. This includes:

Name: fully qualified table name
Schema: includes all the columns (with names, data types, different categories of properties such as isPII, primary keys, indices, etc.)
Ownership: project owner, DRIs, tiering, etc.
Documentation: design documents, wikis, etc. that support the table existence
Dashboards: help understand how the table is used, potential issues, etc.
- Sinks: used for data lineage and non-online execution process:
- Offline dumps: configuration to take snapshots of the table to S3
- Streaming (CDC): configuration regarding CDC pipelines, pub-sub topics where data is being written, etc.
- CopyOvers: flag that a table is copied to another location

The runtime metadata provides information about the state of each cluster or deployment. It includes information such as:

APIs or frontends served
Tables available
Profiles available, which includes the backends and the corresponding configuration for each of the backends
Tables being served

All the SDS metadata structures are Thrift structures that are defined as YAML configuration files and load and are consumed by each of the SDS components. In addition to them, all the SDS metadata can be accessed via a Thrift service which helps connect SDS with Pinteret’s metadata ecosystem.

SDS Clients

Each SDS frontend has their corresponding client implementations. The philosophy regarding client’s implementation is to keep them as simple as possible due to the amplification factor of having to maintain versions in different implementation languages.

Conclusion

Wins

Maintenance-wise we have replaced 3–4 services with a single service with multiple APIs. All the core modules are shared and maintained by all the team members, which means everyone is involved as all of them are used by all the services.
All the optimizations to a particular module produce benefits to all the services/APIs, which is a win-win for the team and the organization.
Separating Virtual or Public Tables from Physical Tables allows us to move data around without having any implication for the clients. This has so far unlocked several performance and cost optimizations.
Allowing clients to use different APIs to access different or the same table allows the platform to become a one-stop-shop for Online Query Serving.
Not coupling the service to any datastore technology allows for a free evolution of the storage stack with no implications for the clients: as long as the SLA is kept, the storage platform and clients can evolve independently.
The performance overhead introduced by the framework itself is in the order of microseconds to milliseconds.

Learns

Managing connections to the different datastores may become complicated if not done carefully.
SDS is a multi-tenant service, so achieving complete workload isolation is non trivial. We have made some progress in that direction, but more work is still ahead of us.

Acknowledgements

HBase deprecation, TiDB adoption and SDS productionization would not have been possible without the diligent and innovative work from the Storage and Caching team engineers including Alberto Ordonez Pereira, Ankita Girish Wagh, Gabriel Raphael Garcia Montoya, Ke Chen, Liqi Yi, Mark Liu, Sangeetha Pradeep and Vivian Huang. We would like to thank cross-team partners James Fraser, Aneesh Nelavelly, Pankaj Choudhary, Zhanyong Wan, Wenjie Zhang for their close collaboration and all our customer teams for their support on the migration. Special thanks to our leadership Bo Liu, Chunyan Wang, David Chaiken and Dave Burgess for their guidance and sponsorship on this initiative. Last but not least, thanks to PingCap for helping along the way introduce TiDB into the Pinterest tech stack from initial prototyping to productionization at scale.