Building Scalable, Real-Time Chat to Improve Customer Experience

Uber is a global business and has a customer base that’s spread throughout the world. Uber’s customer base is divided into many user personas, predominantly riders, drivers, eaters, couriers, and merchants. Being a global business, Uber’s customers also expect support at a global scale. We have customers reach out to us through various live (chat, phone) and non-live (inApp Messaging) channels, and expect swift resolutions to their issues. With millions of support interactions (known internally as contacts) being raised by Uber customers every week, our goal is to resolve these contacts within a predefined service level agreement (SLA). Contacts created by customers are resolved either via automation or with help from a customer support agent. 

For agent contacts, the cost of resolution of tickets plays an important role in how Uber structures its support channels and determines volumes across different live and non-live channels. Cost-per-contact (CPC) and FCR (first contact resolution) for the chat channel are most effective across different live channels, as they allow agents to handle multiple chat contacts concurrently while maintaining a lower average cost than channels like Phone. This channel is in the sweet spot for Uber,  as it has a good CSAT score (customer satisfaction rating, measured in the range of 1 to 5) while generally reducing CPC. This channel allows for a higher automation rate, higher staffing efficiency (as agents can work on multiple chats at the same time), and high FCR, which are all beneficial to Uber while providing quality support for customers.

Historically, from 2019 to early 2023, 1% of all contacts were served via live chat channel, 58% were served via inApp messaging channel (a non-live channel), and the rest were served via another live channel such as Phone. To achieve higher CSAT and FCR, the engineering team needed to scale the chat infrastructure to meet the demands of Uber’s growing business, as well as facilitate the migration of a large volume of in-app messaging and phone channel contacts to a chat channel. We will focus on the Chat live channel for this article.

To scale the chat channel to support 40+% of the Uber contact volume which is routed to Agents, the following were some of the major challenges the team was facing:

  1. Reliability issues with delivering messages from backend systems to an agent’s browser

    1. 46% of events originating from a customer trying to reach an agent were not delivered in time, resulting in delays for both customers and wastage of the agent’s bandwidth. Note that 46% does not indicate the number of unique contacts here but the overall number of events across all the chat contacts. 
  2. Missing Insights

    1. The observability to track the health of the chat contacts was unavailable.
    2. Since agents were idle for large amounts of time but queues were also not empty, Ops was left wondering if they were overstaffed or if it was a tech issue resulting in disproportionate volumes (referred to as a tech vs staffing issue).

Our legacy architecture was built using the WAMP protocol that was used primarily for message passing and PubSub over WebSockets to relay contact information to the agent’s machine.

Image

Figure 1: Describes the previous high-level flow of the chat contact from being created to being routed to an agent on the front end.

Note: This is different from the data path involving the exchange of chat messages between the customer and Uber Support, facilitated through HTTP Server-Sent Events (SSE). For this purpose, Uber utilizes Ramen as an internal service, serving a dual role in the control and data paths. In the control path, Ramen provides bi-directional support for client-to-mobile use cases, allowing effective communication. Simultaneously, in the data path, Ramen offers SSE capabilities for client-to-web use cases.

However, a noteworthy distinction arises in the data path, specifically for client-to-web use cases, where Ramen demonstrates a successful delivery rate of 94.5%. It operates in a unidirectional manner, prompting the necessity for new control flows. These new control flows are essential to detect and manage situations where the client is no longer responsive, thereby addressing the unidirectional limitation in the data path. In this blog, we will cover the new control flow to deliver the events from the backend to the Agent’s browser (Web) to enable the agent for the first reply.

The team launched the E2E architecture on production, and it started to see issues. Not immediately, but as traffic scaled beyond the few tickets coming through, the team realized that the architecture could not scale beyond its initial capabilities easily and production management was not so straightforward. Listed below are some of these core issues:

We were facing reliability issues with our 1.5X scaled traffic, resulting in up to 46% of events from the backend not being delivered to the Agent’s browser. This added to the customer’s wait time to speak to an agent.

Beyond a low RPS of around >~10, the system performance to deliver contacts from the backend deteriorated significantly due to high memory usage or file descriptor leaks. Horizontal scalability was not supported due to limitations with the older versions of WAMP Library being used, and upgrading the same was a huge effort.

The following were the major issues related to Observability:

  1. It was difficult to track the health of the chat contacts i.e. if chat contacts are missing the SLA due to Engineering related concerns or Staffing related concerns.
  2. Chat contacts were not onboarded on the Queue-based architecture resulting in over 8% of the chat volume not being routed to any Agent due to the agent’s attribute matching flow.
  3. The WAMP protocol and libraries (eg1, eg2) used were deprecated and did not provide a lot of insights into inner workings, resulting in debugging being much more difficult. Furthermore, we did not have Chat contact lifecycle debugging implemented end to end, & we were unable to accurately detect Chat SLA misses on the platform overall.

The services were stateful, complicating maintenance and restarts, which caused spikes in message delivery time and losses. The WebSocket proxy was added to perform authorization, and also because services overall were stateful, this, however, increased latency tremendously. The double socket proxy caused issues when either side disconnected. 

Following were some of the goals the tech team was working towards:

  1. Scale up Chat traffic from 1% to 40% of the overall contact volume by the end of 2023 (1.5 million tickets per week)

    1. Onboard and Scale the Chat traffic on Queues to support the Insight related to Queues
    2. Scale to handle over 80% of Uber’s overall contact volume by the end of 2024 (3 million tickets per week).
  2. Reservation (connecting a customer to an agent on the first try after an agent has been identified) success via the proxy pipeline (known as the Push Pipeline) should be >= 99.5%

  3. Build observability and debuggability over the entire Chat flow, end to end.

  4. Stateless services that would not need recalibration if they horizontally scaled or if instances went down for any reason

The new architecture needed to be simple to improve transparency into its inner workings and for the team to be able to easily scale. The team decided to go ahead with the Push Pipeline, which would be a simple, no-redundant WebSocket server that agent machines would connect to and be able to send and receive messages through one generic socket channel.

The new architecture as it exists today is showcased below:

Image

Figure 2: Describes the new high-level flow following the journey of the chat contact being created through being routed to an agent on the front end.

The architecture has the following parts:

Front End UI is used by agents to interact with customers. Widgets and different actions are available to agents to investigate and take appropriate actions for the customer.

Router is the service that finds the most appropriate match between the agent and contact. Upon finding the most suitable contact for an agent, the contact is pushed into a reserved state for the agent.

Upon successful reservation of the contact for the agent, the matched information is published to Apache Kafka®. On receiving this information through the socket via GraphQL subscriptions, Front End loads the contact for the agent along with all necessary widgets and actions enabling the agent to respond to the user.

Any agent who needs to start working needs to go online via a toggle on Front End, which when triggered updates the Agent State service with the relevant agent’s new state.

Any connection between the client browser and backend services happens via the Edge Proxy which safeguards Uber services as a firewall and proxy layer.

The following are the important points:

  1. Onboarded the Chat traffic on the Queues where subscribed Agents will receive the contacts based on the concurrency set of the Agent’s profile. Concurrency defines the number of chat contacts that an agent can simultaneously work on.
  2. Agent staffing to Queues becomes determinant in nature and features such as SLA Based Routing (Prioritizing chat contacts based on Queue SLA), Sticky Routing (Sticking reopen contacts with the Agents) and priority routing (prioritizing based on different rules defined on the Queues) were supported by default.
  3. With Queue onboarding,  dashboards were repurposed/enhanced for Ops teams to view Chat Queues SLA and agent availability & their real-time status, including the contact lifecycle states, Queue inflow/outflow, agent’s session counts, etc.

The major highlights related to the GraphQL subscriptions are:

We have enabled ping pong on the GraphQL subscription socket to make sure that the socket is disconnected automatically in the case of a non-reliable connection. When the socket is disconnected, the respective agent becomes ineligible to receive new contacts. Web socket reconnection is reattempted automatically. Upon successful reconnection, all the reserved/assigned contacts are fetched so the agent can accept them.

For the reserved contact for a given agent, if the front end does not send back an acknowledgment to the chat service, we try to reserve the same contact for another available agent. We check if the web socket and http protocols are working properly for the agent’s browser by sending the heartbeat over the GraphQL subscriptions, the response to which is sent via an HTTP API call from the agent’s browser to check if the agent is online.

Outlined below are some of the tech choices we made to improve the reliability and robustness of the chat system, while also considering the end-to-end latency impact of our choices on the user’s perceived wait time. For this, we needed to keep this system simplified, while enabling select product enhancements.

The front-end team utilizes GraphQL extensively for HTTP calls on its front-end services. This led the team to select GraphQL subscriptions for pushing data from the server to the client. The client would send messages to the server via subscription requests and the server, on matching queries would send back messages to the agent machines. More about the GraphQL subscription is covered in the below sections.

The new services that would be created would need to be stateless to horizontally scale and without needing to rebalance every now and then.

Since the system required bidirectional communication between agent machines and the proxy layer, having HTTP fallback would not really make any difference to the SLAs of the system. Hence, the team focused on increasing the availability of socket connections with the proxy via: 

  1. Bidirectional ping pong messages to prevent hung sockets
  2. Backed off reconnects after disconnects to prevent concurrent reconnects from overwhelming the service.
  3. Single proxy to connect sockets without any handover 

The contact messages already flowed through various services through Kafka before reaching the proxy layer. It was decided to continue & extend the usage of Kafka as it was reliable, fast & supported broadcasting (PubSub) capabilities.

We have performed both functional and non-functional tests to ensure both customers and agents are provided with the best experience end to end. To predict performance, a few of the tests that were done before launch were:

A ~10K socket connection could be established from a single machine, which will further be horizontally scalable as we add more machines. We tested successfully to push the event at 20X of the old stack.

Existing traffic was directed through both the old system and the new pipeline to test its capacity with 40,000 contacts and 2,000 agents daily. This process revealed no problems, and data metrics showed that latency and availability were satisfactory and met the desired thresholds.

Existing traffic was directed through the new system with the old user interface for agents, serving as a crucial reliability test. This was the initial use of the new system, and it successfully managed the traffic while maintaining latency within the defined SLAs.

As we went along, we encountered unique system and agent behavior issues and did some fixes to increase reliability and reduce latency on the pipeline overall. Some of the major issues were:

Browser cookies, when cleared, created issues related to auth and subsequent API failures, which prevented the pushed events from being acted upon by the front end. Agents used to remain online without working on any contacts in such cases.

Agents used to not be logged out because of issues such as out of order or missing events. Agents who finished their work for the day remained online in the system if they simply closed their tabs. This caused increases in customer wait times as the pipeline tried to push events to these agents who weren’t online. We then started automatically logging agents out based on recent acknowledgment misses and tracing logouts overall to the right causes to improve confidence in the system.

The Chat channel has been able to scale to about 36% of the overall Uber Contact volume which is routed to Agents, with more coming in the months ahead. It seems the team has regained the trust for scaling the Chat channel, as well as improving the overall customer experience around it. The team was also able to massively improve reliability, with the error rate of delivering the contact being around 46% in the old stack to roughly 0.45% in the new one. With each failed delivery, the customer’s ticket bounced back with the 30 seconds of delay, after which delivery was retried, and bringing this number down sub 0.45% at scale was massive for customer and agent experience overall.

We’ve also had other wins in this area, with perhaps the best one being simplicity. The new architecture has fewer services, fewer protocols, and better observability built into the system for visibility into contact delivery metrics, delay within the system, and end-to-end latency.

The new push pipeline enables the team to onboard other push use cases and opens up doors to improve user experience by providing real-time information for agents to act upon. Some use cases relating to Greenlight appointments and agent work overlaps on contacts will soon move on this new stack as a part of the next phase.

Further improving the user experience for the Chat channel will also happen as a whole, focusing on both enhancements and system architecture adjustments. This will be done based on learnings from the expansion of the product and addressing issues reported by customers and agents.

The cover image was found at this link: source

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.139.0. UTC+08:00, 2024-12-27 09:26
浙ICP备14020137号-1 $访客地图$