Mender clien not rolling back when there is not internet connection

Description

Mender client 2.0.0 does not roll back after update when there is not internet connection.

The used configuration is:

$ cat /etc/mender/mender.conf

{

"ClientProtocol": "https",

"RootfsPartA": "/dev/mmcblk0p2",

"RootfsPartB": "/dev/mmcblk0p3",

"UpdatePollIntervalSeconds": 1800,

"InventoryPollIntervalSeconds": 28800,

"RetryPollIntervalSeconds": 120,

"ServerURL": "https://mender-development.XXXXX.com:YYYY",

"ArtifactVerifyKey": "/etc/mender/mender-artifact-verify-key.pem"

}

Mender client log can be found in attahced file.

Affects versions

Environment

None

Attachments

2
  • 05 Aug 2019, 01:15 PM
  • 30 Jul 2019, 09:11 AM

Checklist

Activity

Show:

Ole Petter OrhagenSeptember 12, 2019 at 11:38 AM

eystein.maloy.stenbergAugust 5, 2019 at 5:43 PM

Will do, thanks!

Kristian AmlieAugust 5, 2019 at 1:34 PM

: Can we backlog this one too? This one is pretty easy to hit, and more serious than https://northerntech.atlassian.net/browse/MEN-2643#icft=MEN-2643 I think. Luckily it's not hard to fix if we just tweak the retry count in the algorithm.

Kristian AmlieAugust 5, 2019 at 1:30 PM

An easy fix for this is to change the retry count formula to something which is bounded, and quite low (I haven't looked at the exact numbers, but I'd say definitely not more than 10), which perhaps should be considered for usability reasons as well, since the current formula is a bit suboptimal. It will always retry for at least the entire UpdatePollIntervalSeconds period, which can be quite long.

For existing users, a workaround for this problem is to ensure that the UpdatePollIntervalSeconds / RetryPollIntervalSeconds formula doesn't produce a high number.

Kristian AmlieAugust 5, 2019 at 1:26 PM

I managed to reproduce this by starting and update, and then killing the backend components while the client was restarting. The settings I used were:

RetryPollIntervalSeconds = 10 UpdatePollIntervalSeconds = 300

I have uploaded the log.txt file.

The interesting line is the one that says "State transition loop detected in state update-commit: Forcefully aborting update. The system is likely to be in an inconsistent state after this". This is a protection mechanism meant to protect against update modules that get stuck in a loop between the ArtifactRollbackReboot and ArtifactVerifyRollbackReboot states, so that the client hangs forever. What happens here, however, is that the number of retries is calculated to be quite high, because the formula is UpdatePollIntervalSeconds / RetryPollIntervalSeconds. And this number of retries blows the limit that the client has set on the number of state transitions. At this point, the client assumes that the update module (rootfs updates are also update modules in this context) is busted, and instead of rolling back, which is the update module's responsibility to perform, it aborts the whole update, including the rollback. After this there is no more update to perform, and hence the client is stuck in the final report procedure forever.

Due to the bootloader integration, the client should still recover if the device is rebooted, but this requires manual intervention by a user.

Fixed

Details

Assignee

Reporter

Labels

Story Points

Priority

Days in progress

0

Backlog

yes

Zendesk Support

Checklist

Created July 30, 2019 at 9:12 AM
Updated June 25, 2024 at 12:02 PM
Resolved September 17, 2019 at 6:59 AM

Flag notifications