Discuss whether to make install scripts support most of Mender's install states

Description

They need some kind of notification of update in progress to allow their logic to make changes such as not disabling wifi, etc when an update is downloading. They didn’t say it, but I suspect there is a need for a number of different states to be reported here.

Debian packages have a similar concept where its scripts are called with a state parameter that describes which stage in the install the package manager is currently doing. Pre- and postinstall are two of these stages, but there are many more. I think this generalization makes a lot of sense for Mender as well.

Acceptance criteria for this task:

Figure out if such a state based scripting mechanism makes sense for Mender.
- Example: When entering download, call this script: ./install-script download-state
- Then, we about to reboot, call this script: ./install-script reboot-state
Turn our current Mender states into some sort of stable API.
- We can't change this once we publish it.
- It doesn't have to be a 1-to-1 relationship, we can have more internal states than we publish in the API.
The proposal should support states that need to be confirmed, such as a ok-to-reboot state, where the script, if it exists, need to approve reboot. Obviously there needs to be a way to query this script repeatedly in an efficient manner. This also relates to possible DBUS support, but that may be outside the scope of this ticket.

Current proposal

Architecture

State hooks

Each state has an enter/leave and error hook
All hooks have a configurable timeout, at which point the update will fail.
The order of hooks for each state is:
- Leave old state
- If error happens while executing action inside given state error hook is called (this is true for errors happening while executing enter and leave scripts as well)
- Enter new state

The states

Like an API, not necessarily 1-to-1 mapping with Mender client implementation
Deliberately avoiding the word "update" since we decided it was ambiguous. Will stick to "artifact"
States:
- Idle
- Sync
- Download
  - Encompasses all bootstrapping, inventory updates, checking for deployment, etc.
- ArtifactInstall
  - For rootfs this one will just be flipping partition, since content writing is done during download
- ArtifactReboot
  - Long term this an optional state (package updates usually don't need a reboot)
  - This is ONLY called while rebooting device after installing an update; won't be called during any other reboot attempt
- ArtifactCommit
  - This state will serve as the verify step for install scripts
- ArtifactRollback
  - For rootfs, flips back partition
- ArtifactRollbackReboot
  - Separate from the reboot state because reboot state needs approval, whereas rollback should not
- ArtifactFailure
  - Called if any of the artifact (download, install, commit, ...) related action fails

Order that states will usually be executed in:

Idle
Sync
Download
ArtifactInstall
ArtifactReboot
ArtifactCommit

Remaining states are failure states and can be entered almost anywhere

Implementation

Script locations

Because not all states are necessarily dealing with an artifact, there are various locations where scripts will be grabbed from, and this needs to be well defined

State \ script location	Rootfs-hosted	Artifact-hosted^[2]
Idle	X
Sync	X
Download	X
ArtifactInstall		X
ArtifactReboot		X
ArtifactCommit		X
ArtifactRollback		X
ArtifactRollbackReboot		X
ArtifactFailure		X

The two groups are located in different filesystem locations, and should be kept in:

Rootfs-hosted: /etc/mender/scripts
Artifact-hosted: /var/lib/mender/scripts

Along with the scripts, the version of the artifact should be stored, and the Mender client should check this version before attempting to run any script. This is to avoid the situation where you change the version of the agent and the new one doesn't understand the script semantics of the old one. IOW, if you use any scripts at all, then both the agent you upgrade from, and the one you upgrade to, have to understand the version of the artifact you're using to upgrade.

Script running

All scripts will be run without arguments.

All scripts must run under a timeout, which could potentially be different for different types of scripts (TBD).

In addition, each script invocation must be remembered in the local database, so that it can be repeated if he node reboots spontaneously. Probably the same should be the case for timeouts, so that reboots do not prolong timeouts indefinitely.

Script form

Each state/event pair can have a number of scripts, indexed by to leading digits and an underscore. They can also have an optional dot followed by an arbitrary string, for identification. Example:

Idle_Enter_00
Idle_Enter_01
Idle_Leave_00
Idle_Leave_01
Idle_Error_00
Idle_Error_01
Download_Enter_05_wifi-driver
Download_Enter_10_ask-user
Download_Leave_98_wifi-driver
Download_Leave_99
Download_Error_98_wifi-driver
Download_Error_99
ArtifactInstall_Enter_00
ArtifactInstall_Enter_01
...and so on...

The ordering within each pair does not influence other pairs, IOW, when executing Download_Enter scripts, only digit orderings within that group are considered. All scripts are executed in ascending order.

No script in the same state/event category can have the same two digits, even if the name after the optional dot is different (to prevent ambiguous execution order).

See also the changes to the format documentation.

Idle and Sync scripts will reside under /etc/mender/scripts, and are expected to be included in the update itself.

The rest will reside inside the artifact according to the artifact documentation. When the artifact is parsed/prepared, all scripts should be simultaneously extracted to /var/lib/mender/scripts. Alongside them, the version file from the top level of the artifact should also be extracted. This will be checked on every script execution that it matches a version that the Mender agent understands.

Footnotes:

^[2] If inside artifact, the script must be extracted and run from the /data partition. The rootfs partition can not be used because we need to support read-only rootfs.

Affects versions

None

Environment

None

Linked issues

relates to

MEN-660

Client state and artifact install scripts - i2

MEN-1193

Client state and artifact install scripts - i1

Checklist

Activity

Show:

Nick Anderson April 25, 2017 at 6:55 PM
Edited

:LOGBOOK:
CLOCK: [2017-04-25 Tue 13:40]--[2017-04-25 Tue 13:55] => 0:15
:END:
Hi from spacemacs.

eystein.maloy.stenberg April 25, 2017 at 6:17 PM

Now that this is closed, I think we should continue this discussion in the epic https://northerntech.atlassian.net/browse/MEN-660#icft=MEN-660.

Kristian Amlie April 25, 2017 at 1:10 PM

With that in mind we could start with the waiting approach, then potentially implement support for retrying later. In fact, a retry type script can be implemented with the waiting script to start with I think. So for now return codes could be 0 = success (proceed), 1 = fail (stop & report), then we could add 255 = intermittent-fail (retry later a certain amount of times).

Actually, @Marcin Pasinski and I realized that we have to have this already in the first iteration. The reason is that it's impossible to separate between a script that has hung, and a script that is waiting for approval, without giving back explicit feedback, and you don't want to wait three days for a hung script.

So how we currently envision it is by having approval scripts return a 'retry-later' type of return code and then do just that. The timeout for a running script however, will be much shorter, maybe as short as just a few minutes, so that we can catch bugs in the scripts that cause them to hang.

Kristian Amlie April 25, 2017 at 10:15 AM

It might be that you have a great handle on this, on my end it feels a bit rushed and that some considerations or side-effects might be overlooked.

No, I feel exactly the same!

I'm finishing the epic as much as I can, but I expect there to be tweaks after the fact.

With that in mind we could start with the waiting approach, then potentially implement support for retrying later. In fact, a retry type script can be implemented with the waiting script to start with I think. So for now return codes could be 0 = success (proceed), 1 = fail (stop & report), then we could add 255 = intermittent-fail (retry later a certain amount of times).

Agreed. We've also baked versioning into it, so that the client will detect if scripts from an incompatible version is used. This should help when doing future incompatible changes.

Will there just be one script per state? So we only allow one postinstall script for example? Or did I misunderstand this? I think we should have a directory of scripts and they would run in numerical/lexo order (some predictable sorting that makes sense).

I mostly made this call based on my experience with packaging, and that multiple scripts for each stage tend to become messy. But I didn't really think about concrete use cases, and I suppose one could argue that getting user approval vs checking some wireless driver are fairly distinct things that should be in different scripts.

But in the spirit of the earlier comment about keeping things minimal, and given that we version the scripts, maybe we should stick to just one script per state/event for now?

eystein.maloy.stenberg April 25, 2017 at 12:25 AM

It might be that you have a great handle on this, on my end it feels a bit rushed and that some considerations or side-effects might be overlooked. That said, it might be best to get going and rather adjust over time (but before a supported release). This is a classic huge and complex piece with too many requirements, and it is typically best to start with something simple (not many config options or too much code logic).

With that in mind we could start with the waiting approach, then potentially implement support for retrying later. In fact, a retry type script can be implemented with the waiting script to start with I think. So for now return codes could be 0 = success (proceed), 1 = fail (stop & report), then we could add 255 = intermittent-fail (retry later a certain amount of times).

Will there just be one script per state? So we only allow one postinstall script for example? Or did I misunderstand this? I think we should have a directory of scripts and they would run in numerical/lexo order (some predictable sorting that makes sense).

Fixed

Details
Assignee
Kristian Amlie
Reporter
Kristian Amlie
Labels
Client
Story Points
5
Priority
Medium
Parent
MEN-1193 Client state and artifact install scripts - i1
Backlog
yes

Zendesk Support

Checklist

Created April 11, 2017 at 2:44 PM

Updated June 25, 2024 at 11:55 AM

Resolved April 25, 2017 at 1:45 PM

Discuss whether to make install scripts support most of Mender's install states

Description

Current proposal

Architecture

State hooks

The states

Order that states will usually be executed in:

Remaining states are failure states and can be entered almost anywhere

Implementation

Script locations

Script running

Script form

Footnotes:

Affects versions

Environment

Linked issues

relates to

Checklist

Activity

Nick Anderson April 25, 2017 at 6:55 PMEdited

eystein.maloy.stenberg April 25, 2017 at 6:17 PM

Kristian Amlie April 25, 2017 at 1:10 PM

Kristian Amlie April 25, 2017 at 10:15 AM

eystein.maloy.stenberg April 25, 2017 at 12:25 AM

DetailsAssigneeKristian AmlieKristian AmlieReporterKristian AmlieKristian AmlieLabelsClientStory Points5PriorityMediumParentMEN-1193 Client state and artifact install scripts - i1Backlogyes

Details

Assignee

Reporter

Labels

Story Points

Priority

Parent

Backlog

Zendesk SupportLinked Tickets

Zendesk Support

ChecklistOpen Checklist

Checklist

Nick Anderson April 25, 2017 at 6:55 PM
Edited

Details
Assignee
Kristian Amlie
Reporter
Kristian Amlie
Labels
Client
Story Points
5
Priority
Medium
Parent
MEN-1193 Client state and artifact install scripts - i1
Backlog
yes

Zendesk Support

Checklist