Multi-server pubsub healthcheck endpoint #1730

jordigh · 2025-07-24T22:20:27Z

Context

In order to properly restart multi-instance Grist installs, I need to know when all of the instances (home servers, doc workers) are up and ready and restarted with the new Grist parameters.

This adds a new option to the /status health check endpoint called allServersReady to make sure that all the servers are up and ready to serve requests.

In the absence of Redis, the same endpoint also works with in-memory dummy pubsub provided by the PubSubManager.

Proposed solution

This adds to FlexServer a new option handled by a HealthChecker class that uses Redis pubsub to broadcast and handle requests to check each instances ready status. Each server registers and de-registers itself on Redis during startup and shutdown respectively. During a request to check if all servers are okay, the server handling the request from the web browser will report back when all registered servers respond one way or another, with a default timeout of 10 seconds.

This also adds a new optional GRIST_INSTANCE_ID parameter, which defaults to a combination of instance's host and port. This is ID is used to identify registered instances as well as to supply the listening channels unique to each instance for pubsub.

Related issues

Follow-up to #1697

Has this been tested?

👍 yes, I added tests to the test suite
💭 no, because this PR is a draft and still needs work
🙅 no, because this is not relevant here
🙋 no, because I need help

app/server/lib/FlexServer.ts

dsagal · 2025-07-29T14:51:19Z

app/server/lib/HealthChecker.ts

@@ -0,0 +1,121 @@
+import {GristServer} from 'app/server/lib/GristServer';


Could you add a doc-comment to this file that describes how these work, and why it's done this way? In particular, I assume it's a deliberate choice to rely on redis rather than, say, call each server's /status endpoint -- is that because we don't want to assume that servers can communicate with each other directly? That also means that this check doesn't reflect the ability for the servers to communicate with each other, right?

Sure, I'll write something.

How would servers know what other servers exist without Redis?

And since we're already using Redis to communicate between doc workers, it seemed like a natural extension to communicate with all Grist instances.

Does that make sense?

Well... this isn't entirely convincing. But I might not fully understand the goal of this check. Is it about checking whether a multi-server setup works correctly, or is it more narrow than that, like how many home-server or doc-workers are running?

How would servers know what other servers exist without Redis?

Good question, even without the "Redis" part. Is this check intended to verify that servers know about each other for the purpose of their functionality, or only for the purpose of the health check?

we're already using Redis to communicate between doc workers

We use Redis to coordinate assignments of docs to workers, and for some other things, like sessions and notifications, but for most traffic between home servers and doc workers, I think we rely on them being able to make HTTP requests to one another.

I am asking these questions to understand the purposes and make sure we are on the same page, but not saying this should be part of this healthcheck -- I can imagine that when setting up, there is value to know that, say, 3 doc-workers are running and healthy, even if the networking part of communicating to them isn't yet working. Is that the goal?

One more question: if I restart some servers, do we want this check to tell me how many got successfully restarted? Just knowing how many are healthy doesn't tell me if I have a mix of old and new ones.

The ultimate goal here is to be able to restart all the instances. That will be the next step. The restart is a button on the admin, but currently it only restarts whatever server gets hit with the request to restart.

After restarting all the instances, how does the web browser know when it's time to reload the page and get the newly restarted Grist page?

I decided it should be when all of the servers were restarted and ready to serve requests. That's how I arrived to this design. If we reloaded the page without all servers being restarted, we might be in some ambiguous state where some servers aren't ready yet.

I can imagine that when setting up, there is value to know that, say, 3 doc-workers are running and healthy, even if the networking part of communicating to them isn't yet working. Is that the goal?

Not explicitly. This is using the ready status of an instance which is a little different from healthy. The healthy state just means that the status endpoint is ready, which happens very early during Grist initialisation. This new endpoint instead checks that all instances that are registered on Redis have hit the ready status, which is at the end of the initialisation of an instance, one of the last things that MergedServer does. The ready status should mean that networking is also ready. Of course, if you have some networking problem between your servers where they can't even reach each other or Redis, this check won't catch that.

As to the choice of using Redis instead of internal http requests between the Grist instances, it just seemed easier to me to use a shared Redis channel since Redis seemed inevitable for discovery to begin with.

If restart is the main goal, then I think my last question from previous comment might be relevant?

Right, so it seems to me like the problem of an instance spontaneously dying already requires infrastructure not directly shipped with Grist Core. Maybe we can offer advice on how to handle this problem or supply this infrastructure outside of this PR.

What I can add here is an extra endpoint to instruct a server to clean up the registry of instances on Redis.

I think you are right, and this comment comes closest to explaining why I feel this whole approach isn't the right one for checking health of a multi-server installation. The issue is that not one part of this knows anything about which servers are expected to be a part of the installation. That knowledge is presumably in some external infrastructure, but then only with the help of that external infrastructure can you know the health of the installation.

E.g. let's say you start a cluster with 3 home servers and 10 doc workers, and you forgot to configure Redis for doc-workers. Then home servers register with Redis, doc-workers don't, and the healthcheck done this way will report that everything is healthy, when it's not even close. Or let's say everything is up and connected to Redis, and you updated some config and told everything to restart, but half your servers didn't react to your restart request, and are still running with the old config (e.g. wrong activation key, or wrong version, etc). Still, this healthcheck will report healthy. If you messed up and got two home servers running that point to different Redis URLs, whichever one the LB happens to ask will each independently report it is healthy, though the setup is broken.

All this approach does is check whether servers that successfully connected to Redis and registered themselves are still connected. Even for the narrow situation when everything is up and clean and registered in Redis, and there is nothing stale, does this help with restarts? As soon as you tell all servers to restart, they'll unregister themselves before exiting, and the healthcheck will return "all clear" even if they don't start up again, since it only checks servers that have successfully started and registered themselves.

I'm just making up scenarios, but fundamentally, it seems like anyone running an actual multi-server installation must have some external infrastructure to know what the servers are intended to be running. If we try to build tools to manage a cluster (even e.g. turn on "enterprise") without at all being aware of this infrastructure, or knowing which servers the admin intends to be affected, I think we might make the job of the administrator harder rather than easier.

If we try to build tools to manage a cluster (even e.g. turn on "enterprise") without at all being aware of this infrastructure, or knowing which servers the admin intends to be affected

Tracing the problem back, "turning on enterprise" converted to a "reboot all instances" problem because of when configuration is loaded in the code. One could imagine changing that. But for an admin panel, if we expect in the future to offer ways to change configuration other than the enterprise toggle, the ability to reload the node process everywhere does feel like a good investment, so we don't have to be policing every PR that reads configuration. And being able to display some state about the instances also seems obviously useful.

But it is true that knowing the set of live instances in order to coordinate with them is awkward. Registering/deregistering is brittle. Heartbeats on a channel that e.g. home servers listen to and maintain some state for are an alternative, but would also have some failure modes.

For a configuration change, you could imagine the following process:

All servers broadcast an {id: ..., version: ...} heartbeat from time to time on heartbeat channel.

Home servers listen to that channel, store last heartbeats per server with timestamp and expiry after N minutes, in a heartbeats object.

When a home server wishes to initiate a config change, it broadcasts a {action: "restart", version: ...} message on a chatter channel everyone listens to. Version code is invented, to represent the config change.

Servers honor this action request once they see it, and change the version in their heartbeats to match once the config change is incorporated.

Home servers accumulate heartbeats as normal.

Rather than thinking about health, you can think about whether all servers issuing heartbeats are known to be using the latest config.

Needs fleshing out and has problems, just a different direction I was thinking about. But a reasonable alternative could also be to just back off from supporting an enterprise toggle in multi-server installations (but how do you know they are multi-server? use existence of redis as a hint?) and just give instructions.

Yes, it makes a lot more sense if we step away from "is cluster healthy" question, and instead ask something like: "are all servers that are connected to Redis on the expected version of config?"

Also, by exposing more information than a boolean (e.g. "list of all servers connected to Redis"), we'd make it possible for an admin to judge health by comparing that list to what they know about how many servers they expect to be up.

Am I right that this is the first foray into adding tools within Grist to make it easier for self-hosters to run multi-server installations? It would be good if it was a start that made further improvements easier. We know there are a number of difficult things about multi-server, including getting servers to be able to talk to each other (configuring APP_DOC_INTERNAL_URL and the like) and unregistering crashed doc-workers. What seems useful is to have all servers (doc-worker and home-servers) report their info. Maybe instead of broadcasting a heartbeat on Redis pubsub, they can update periodically their info in Redis: including start time, version of Grist, version of config (whatever that means), internal URL, status bits for things a server knows how to check for in its usual /status endpoint (like DB connectivity), and the timestamp of this update. This would be enough information to answer a lot of questions, particularly in combination with other info (like knowledge of expected servers; results of pings to internal urls; contents of workers-available key holding doc-worker registrations, etc)

For the restart task, a home server requesting a restart would only have to poll this info for up to a timeout, until it sees that everyone is reporting the expected version of config.

Setting data in Redis has the advantage over broadcasting is that a server (or even a separate tool) can do a check even if it hasn't been subscribed to pubsub since the right point in time. E.g. if a home server requesting a restart is itself restarting, or if a separate page load of Admin Panel is asking a different home server for whether the config change has taken effect.

Maybe instead of broadcasting a heartbeat on Redis pubsub, they can update periodically their info in Redis

That makes sense!

Am I right that this is the first foray into adding tools within Grist to make it easier for self-hosters to run multi-server installations? It would be good if it was a start that made further improvements easier.

Exactly, that was my hope. Until now, multi-server installations have been ignored.

app/server/lib/HealthChecker.ts

It will be convenient to allow this to be set outside of the Flex Server, so let's allow it.

There is already logic in here for defining a client or not depending if a Redis connexion is available or not, so let's reuse it and expose the publishing client, which is available.

This class can verify that every Grist instance responds as "ready" via Redis pubsub.

This new option uses the `HealthChecker` class from before.

This works across all instances instead of just reporting if the instance serving the request is reaady.

This connexion forwarder specifically tailored for redis will be useful for other multi-server Redis-related tests, so let's factor it out.

jordigh · 2025-07-31T17:55:19Z

So, should I throw out this PR and start over with a register-and-poll restart implementation?

jordigh changed the title ~~wip~~ Multi-server pubsub healthcheck endpoint Jul 24, 2025

jordigh force-pushed the jordigh/multi-server-health branch from 9dca8c9 to cbb01d6 Compare July 25, 2025 17:11

dsagal reviewed Jul 25, 2025

View reviewed changes

app/server/lib/FlexServer.ts Outdated Show resolved Hide resolved

jordigh force-pushed the jordigh/multi-server-health branch 9 times, most recently from 7f18ee8 to 6911c03 Compare July 29, 2025 14:23

dsagal reviewed Jul 29, 2025

View reviewed changes

jordigh force-pushed the jordigh/multi-server-health branch 2 times, most recently from d22a1a3 to d23e91b Compare July 29, 2025 16:12

dsagal reviewed Jul 29, 2025

View reviewed changes

app/server/lib/HealthChecker.ts Outdated Show resolved Hide resolved

jordigh force-pushed the jordigh/multi-server-health branch 2 times, most recently from 8658d5f to 08f3af5 Compare July 29, 2025 18:38

jordigh added 6 commits July 29, 2025 14:53

FlexServer: use actual getters and setters instead of setReady

1cebbf0

It will be convenient to allow this to be set outside of the Flex Server, so let's allow it.

PubSubManager: allow fetching the client

f11e75b

There is already logic in here for defining a client or not depending if a Redis connexion is available or not, so let's reuse it and expose the publishing client, which is available.

HealthChecker: new class for multi-server checks

4e9a3f7

This class can verify that every Grist instance responds as "ready" via Redis pubsub.

FlexServer: enable new allInstancesReady option in status endpoint

da702d4

This new option uses the `HealthChecker` class from before.

ConfigAPI: switch to the allInstancesReady=1 parameter

2b77efb

This works across all instances instead of just reporting if the instance serving the request is reaady.

tests: Factor out RedisForwarder

26eba14

This connexion forwarder specifically tailored for redis will be useful for other multi-server Redis-related tests, so let's factor it out.

jordigh force-pushed the jordigh/multi-server-health branch 3 times, most recently from 14b2bd9 to a3e3bdb Compare July 29, 2025 20:38

jordigh marked this pull request as ready for review July 29, 2025 20:58

jordigh force-pushed the jordigh/multi-server-health branch from a3e3bdb to 6e3ca9b Compare July 30, 2025 14:59

tests: Add tests for health checker

6c39891

jordigh force-pushed the jordigh/multi-server-health branch from 6e3ca9b to 4191ec6 Compare July 30, 2025 19:55

jordigh force-pushed the jordigh/multi-server-health branch from 4191ec6 to 6c39891 Compare August 14, 2025 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Multi-server pubsub healthcheck endpoint #1730

Multi-server pubsub healthcheck endpoint #1730

Uh oh!

jordigh commented Jul 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

dsagal Jul 29, 2025

Uh oh!

jordigh Jul 29, 2025 •

edited

Loading

Uh oh!

dsagal Jul 29, 2025

Uh oh!

jordigh Jul 29, 2025 •

edited

Loading

Uh oh!

dsagal Jul 29, 2025

Uh oh!

jordigh Jul 30, 2025 •

edited

Loading

Uh oh!

dsagal Jul 30, 2025

Uh oh!

paulfitz Jul 30, 2025

Uh oh!

dsagal Jul 30, 2025

Uh oh!

paulfitz Jul 30, 2025

Uh oh!

Uh oh!

jordigh commented Jul 31, 2025

Uh oh!

Uh oh!

		@@ -0,0 +1,121 @@
		import {GristServer} from 'app/server/lib/GristServer';

Uh oh!

Multi-server pubsub healthcheck endpoint #1730

Are you sure you want to change the base?

Multi-server pubsub healthcheck endpoint #1730

Uh oh!

Conversation

jordigh commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Proposed solution

Related issues

Has this been tested?

Uh oh!

Uh oh!

dsagal Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

jordigh Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsagal Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

jordigh Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsagal Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

jordigh Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsagal Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

paulfitz Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

dsagal Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

paulfitz Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jordigh commented Jul 31, 2025

Uh oh!

Uh oh!

jordigh commented Jul 24, 2025 •

edited

Loading

jordigh Jul 29, 2025 •

edited

Loading

jordigh Jul 29, 2025 •

edited

Loading

jordigh Jul 30, 2025 •

edited

Loading