Failover Mender server
Description
Affects versions
Environment
Checklist
Activity

Kristian AmlieAugust 13, 2018 at 8:30 AM
This implementation is done, and can be found in this PR (branch). Since there is an associated [integration test task|MEN-2043], the implementation will live on that feature branch until the test is ready. However, this task can be closed, since the work here is done and all the existing tests have passed. So unless you have anything more to add, , you can close this!

Kristian AmlieAugust 13, 2018 at 8:18 AM
: I created .

Kristian AmlieAugust 7, 2018 at 8:52 AM
Write integration tests for testing failover mechanism
You're right, let's treat it as a separate task.
There is no task for this, right? Shall I create it?

Kristian AmlieJune 29, 2018 at 10:23 AM
Add and parse a configuration for two (multiple) servers in the mender.conf
This is literally one line. I don't think it deserves its own task. Maybe a better task is to add support for this option in Yocto?
Add a retry mechanism before switching over to the recovery server
I think this is not necessary, and in fact I think it's better not to. With this feature, if the primary server is down, then you guarantee that all devices will wait the entire wait period (including all multiple retries and exponential backoff, which might be hours, even days) before connecting to the backup server. I think it's much better to attempt the backup server immediately, and then go to the retry state, like we do now. This would ensure a smooth and quick migration.
Write integration tests for testing failover mechanism
You're right, let's treat it as a separate task.
what if we have a hosted mender server as the first one and the on-premise as the other one; we will need a way to tell the Mender client that the tenant token belongs to the specific server (maybe we will need some nested JSON)
Ah good point. Although it's unlikely that many customers will migrate from Hosted Mender to a different multi-tenancy server, it's not impossible. I agree, nested JSON may be a good way to solve it.
how to handle custom or multiple certificates if we some; we will need to verify the TLS connection against all to see if there is at least one matching (this can be a separate task even)
I think we get this for free due to OpenSSL's ability to concatenate certificates transparently. The user just needs to add all relevant certificates to one file.
how many times we should retry (if any) in case we are getting some error code from the primary server; here probably we can handle different return codes differently (in some cases we have 409 Conflict indicating that the deployment was maybe canceled, or the server does not know about it - once we switched to the recovery one; this will be quite tricky case to solve)
retrying itself might be a separate task
As I wrote above, I don't think explicit retrying between different server attempts is a good idea. But nevertheless we should go through all return codes, and maybe each of the client facing APIs, just to make sure. My gut feeling is that we can stick to all 4xx and 5xx status codes, but best to make sure.

Marcin PasinskiJune 29, 2018 at 9:18 AMEdited
I think we can indeed start with the unit tests which will be pretty simple, but the long-term maybe having failover integration test could be nice. I am talking from experience with CFEngine HA feature. That one we are still testing manually and it is a pain in the neck to do that. Here, the concept is simple, but there are a lot of paths we will need to cover. And it will be only worse over time as we will start adding more features.
Next, maybe we can split this one into multiple tasks:
Add and parse a configuration for two (multiple) servers in the mender.conf
Handle communication with the failover server when the main is not responding/returning an error code
Add a retry mechanism before switching over to the recovery server
Write integration tests for testing failover mechanism
There are also a few details we should discuss:
what if we have a hosted mender server as the first one and the on-premise as the other one; we will need a way to tell the Mender client that the tenant token belongs to the specific server (maybe we will need some nested JSON)
how to handle custom or multiple certificates if we some; we will need to verify the TLS connection against all to see if there is at least one matching (this can be a separate task even)
how many times we should retry (if any) in case we are getting some error code from the primary server; here probably we can handle different return codes differently (in some cases we have 409 Conflict indicating that the deployment was maybe canceled, or the server does not know about it - once we switched to the recovery one; this will be quite tricky case to solve)
retrying itself might be a separate task
In order to migrate between Mender servers (see parent Epic), we need support in the client for a failover server (e.g. in mender.conf), that will be tried if the primary server fails to respond.
Acceptance criteria
It is possible to specify more than one server in mender.conf
If an operation (in particular reporting update status) fails on the first server, the next server in the list is tried. Failover tiggers on issues around timeouts, tls, and 4XX and 5XX http codes.