-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
We found that when connectionTimeoutMillis
is hit the following things happen:
- A connection is opened on the Postgres side, and is never actually utilized
pg-pool
continues tracking the connection but never marks it as idle- The connection isn’t cleaned up until the pool itself is closed (or the NodeJS process dies)
We were able to track these connections in Postgres by running the following query:
select * from pg_stat_activity
where query_start is NULL and state = 'idle';
What's unusual is that the connection "times out" but continues to maintain an idle connection on the Postgres side, without ever marking it as idle in the pool. Might suggest a race condition here between the timeout and the client connection.
Reproduction script
/* eslint-disable prettier/prettier */
import Pool from 'pg-pool';
import Client from 'pg-native';
const pool = new Pool({
port: 5432,
// Setting this to 1 will cause everything to crash.
max: 2,
connectionTimeoutMillis: 1000,
// Comment this out to try with `pg`
// Client,
log: (msg) => {
console.log(msg);
}
});
let connection = 0;
const baseConnect = pool.Client.prototype.connect;
pool.Client.prototype.connect = function (cb) {
if (connection === 0) {
connection += 1;
return setTimeout(() =>
baseConnect.call(this, cb),
1100
)
}
connection += 1;
return baseConnect.call(this, cb)
}
let acquireCount = 0
pool.on('acquire', function (client) {
acquireCount++
});
let connectCount = 0
pool.on('connect', function () {
connectCount++
});
let releaseCount = 0
pool.on('release', function () {
releaseCount++
});
let removeCount = 0
pool.on('remove', function () {
removeCount++
});
let errorCount = 0
pool.on('error', function () {
errorCount++
});
await pool.connect();
let zombiesCount = 0;
const res = await pool.query(
'select count(*) from pg_stat_activity where query_start is NULL and state = $1::text;',
['idle']
);
if (res.rows) {
zombiesCount = res.rows[0]['count']
} else {
zombiesCount = res[0]['count']
}
console.log("-----------------")
console.log("Total connections", connectCount);
console.log("Total acquisitions: ", acquireCount);
console.log("Total releases: ", releaseCount);
console.log("Total removes: ", removeCount);
console.log("Total errors: ", errorCount);
console.log("Total leaked connections: ", zombiesCount);
console.log("Total current pool size: ", pool.totalCount);
console.log("Total current idle pool size: ", pool.idleCount);
console.log("-----------------")
pool.end();
Without exhausting the pool
Here you can see the issue in action, where a connection is held open after the initial failed connect
call and we end with a pool of two connections, only one of which is idle.
With pg
as client
➜ node test.mjs
checking client timeout
connecting new client
ending client due to timeout
new client connected
checking client timeout
connecting new client
new client connected
dispatching query
query dispatched
pulse queue
no queued requests
-----------------
Total connections 2
Total acquisitions: 2
Total releases: 1
Total removes: 0
Total errors: 0
Total leaked connections: 1
Total current pool size: 2
Total current idle pool size: 1
-----------------
pulse queue
no queued requests
ending
pulse queue
pulse queue on ending
With pg-native
as client
➜ node test.mjs
checking client timeout
connecting new client
ending client due to timeout
new client connected
checking client timeout
connecting new client
new client connected
dispatching query
query dispatched
-----------------
Total connections 2
Total acquisitions: 2
Total releases: 1
Total removes: 0
Total errors: 0
Total leaked connections: 1
Total current pool size: 1
Total current idle pool size: 0
-----------------
ending
pulse queue
pulse queue on ending
pulse queue
pulse queue ended
pulse queue
pulse queue ended
With exhausting the pool
You can test this by setting max: 1
in the reproduction script’s config. This illustrates that leaked connections don’t get cleaned up when a new connection is requested, meaning it’s possible to exhaust the pool with enough connection timeouts.
With pg
as client
➜ node test.mjs
checking client timeout
connecting new client
ending client due to timeout
new client connected
/Users/nick/node-postgres/packages/pg-pool/index.js:45
Error.captureStackTrace(err)
^
Error: timeout exceeded when trying to connect
at /Users/nick/node-postgres/packages/pg-pool/index.js:45:11
at async file:///Users/nick/node-postgres/packages/pg-pool/test.mjs:67:13
Node.js v20.17.0
With pg-native
as client
pg-native
has even more unusual behavior here, as it totally crashes the script and exits with code 13
. No error is thrown.
➜ node test.mjs
checking client timeout
connecting new client
ending client due to timeout
new client connected
If there is another Promise being awaited then no crash occurs and we resolve as usual (you can test this by creating an unresolved promise before the leaked connection query and awaiting it after the query is fired off).
const promise = new Promise((resolve) => setTimeout(resolve, 2000));
const res = await pool.query(
'select count(*) from pg_stat_activity where query_start is NULL and state = $1::text;',
['idle']
);
await promise
The output is as such:
➜ node test.mjs
checking client timeout
connecting new client
ending client due to timeout
new client connected
checking client timeout
connecting new client
new client connected
dispatching query
query dispatched
pulse queue
no queued requests
-----------------
Total connections 2
Total acquisitions: 2
Total releases: 1
Total removes: 1
Total errors: 0
Total leaked connections: 1
Total current pool size: 1
Total current idle pool size: 0
-----------------
ending
pulse queue
pulse queue on ending
pulse queue
pulse queue ended
Possible fix for pg
I was able to resolve this for pg
specifically with the following change
diff --git a/packages/pg-pool/index.js b/packages/pg-pool/index.js
index 3e505f79..59c258c6 100644
--- a/packages/pg-pool/index.js
+++ b/packages/pg-pool/index.js
@@ -244,7 +244,7 @@ class Pool extends EventEmitter {
this.log('ending client due to timeout')
timeoutHit = true
// force kill the node driver, and let libpq do its teardown
- client.connection ? client.connection.stream.destroy() : client.end()
+ client.connection ? client.connection.end() : client.end()
}, this.options.connectionTimeoutMillis)
}
Which resulted in the following output:
➜ node test.mjs
checking client timeout
connecting new client
ending client due to timeout
client failed to connect
pulse queue
no queued requests
checking client timeout
connecting new client
new client connected
dispatching query
query dispatched
pulse queue
no queued requests
-----------------
Total connections 1
Total acquisitions: 1
Total releases: 1
Total removes: 0
Total errors: 0
Total leaked connections: 0
Total current pool size: 1
Total current idle pool size: 1
-----------------
ending
pulse queue
pulse queue on ending
I have no idea what the ramifications of that change would be, but thought it was worth mentioning.
How did we run into this?
TypeORM's Postgres driver uses the connect
API here. During a timeout pg-pool
doesn't return any sort of functional release callback (which makes sense given that the connection "timed out"), so there's no way for TypeORM or other consumers, to clean up these leaked connections.