Beyond Query Optimization: Aurora Postgres Connection Pooling with SQLAlchemy & RDSProxy

The main limitation of using an external proxy with transaction pooling is that Postgres’ session-based features will not work correctly. This means any kind of “SET XXXX” will not work. Also, historically, prepared statements have also been a problem. However, AWS introduced some enhancements to RDSProxy to support prepared statements in late 2023 and we never encountered any issues. We encountered several issues when rolling out RDSProxy in our test/prod environment:

Statement timeout

When connecting with the database using the proxy, we received the following error:

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "aurora.reader.proxy.com" (10.0.146.4), port 5432 failed: FATAL: Feature not supported: RDS Proxy currently doesn't support command-line options.

The reason for this was the following configuration in our settings files:

SQLALCHEMY_ENGINE_OPTIONS = {'connect_args': {'options': '-c statement_timeout={timeout}'.format(
timeout=str_env('POSTGRES_STATEMENT_TIMEOUT')
)}}

To overcome this issue, we decided to set the statement_timeout at the user level and removed the SQLALCHEMY_ENGINE_OPTIONS:

ALTER ROLE scott SET statement_timeout = '5s';

Session pinning

This is probably the most important issue to be mindful of when moving to RDSProxy. Normally, RDSProxy optimizes database connections by multiplexing multiple client requests over fewer actual database connections. However, some database session states require a persistent connection to a specific backend instance, forcing RDSProxy to “pin” that session. The pinning of sessions reduces how many connections are available for reuse. To get the most benefit from connection pooling, it is worth minimizing behaviors that trigger session pinning.

Session pinning: How to catch them?

Conveniently, RDSProxy charts all occurrences of session pinning in the monitoring tab of the RDSProxy called “DatabaseConnectionsCurrentlySessionPinned.”

Session pinning metrics from AWS cloudwatch

In addition, it logs the cause of the session pinning as well. These logs can be found in CloudWatch at a WARN level and include the text, “The client session was pinned to the database connection . . .”.

We did encounter some rare queries which are session pinned, for which we never found the source of this session pinning where the SQL query exceeded the 16384 byte limit:

Example session pinning Cloudwatch log

RDSProxy does not log these SQL statements even with Enhanced Logging enabled. We attempted to find this query on the database side. Unfortunately, we could never find this query in pg_stat_statements. Probably because pg_stat_statements stores a parameterized version of the query that could be much shorter than the actual one. Especially, if it is a query with an IN clause looking for a lot of values. The only other alternative is logging every statement in the database to try and find them. But, given how rare the session pinning incidents are, we have not attempted to do that as of yet.

Apart from this instance, we went ahead and addressed each of the following session pinning issues:

Advisory lock_timeout

We had several flows in our application using advisory locks with varying timeouts.

db.session.execute('SET lock_timeout=3000');

This changes the state of the session and requires session pinning. One option was to set the lock_timeout at the user level. However, because each flow had a different timeout requirement, we opted to use SET LOCAL which is transaction-scoped and does not create session pinning.

db.session.execute('SET LOCAL lock_timeout=3000');

Sequence numbers

We had some legacy flows relying on sequence number generation in our write APIs. We had to refactor our code to ensure that we rely on the auto-incremented value generated on writes to the DB and not generate the sequence using nextval().

There are many other reasons why session pinning occurs. A full list is available in the official AWS documentation

Query latency

Due to an extra network hop added by the proxy, there will be some latency introduced. For our application, most of the APIs did not see any significant increase in latency. There was one API which was latency sensitive and we chose to directly connect with the reader instead of using the proxy. We will evaluate this API in future so see if we can change the query patterns.

Sample query for which the p95 latency increased due to the proxy rollout