Discuss whether to make install scripts support most of Mender's install states
Description
Affects versions
Environment
relates to
Checklist
Activity

Nick Anderson April 25, 2017 at 6:55 PMEdited
:LOGBOOK:
CLOCK: [2017-04-25 Tue 13:40]--[2017-04-25 Tue 13:55] => 0:15
:END:
Hi from spacemacs.

eystein.maloy.stenberg April 25, 2017 at 6:17 PM
Now that this is closed, I think we should continue this discussion in the epic https://northerntech.atlassian.net/browse/MEN-660#icft=MEN-660.

Kristian Amlie April 25, 2017 at 1:10 PM
With that in mind we could start with the waiting approach, then potentially implement support for retrying later. In fact, a retry type script can be implemented with the waiting script to start with I think. So for now return codes could be 0 = success (proceed), 1 = fail (stop & report), then we could add 255 = intermittent-fail (retry later a certain amount of times).
Actually, @Marcin Pasinski and I realized that we have to have this already in the first iteration. The reason is that it's impossible to separate between a script that has hung, and a script that is waiting for approval, without giving back explicit feedback, and you don't want to wait three days for a hung script.
So how we currently envision it is by having approval scripts return a 'retry-later' type of return code and then do just that. The timeout for a running script however, will be much shorter, maybe as short as just a few minutes, so that we can catch bugs in the scripts that cause them to hang.

Kristian Amlie April 25, 2017 at 10:15 AM
It might be that you have a great handle on this, on my end it feels a bit rushed and that some considerations or side-effects might be overlooked.
No, I feel exactly the same!
I'm finishing the epic as much as I can, but I expect there to be tweaks after the fact.
With that in mind we could start with the waiting approach, then potentially implement support for retrying later. In fact, a retry type script can be implemented with the waiting script to start with I think. So for now return codes could be 0 = success (proceed), 1 = fail (stop & report), then we could add 255 = intermittent-fail (retry later a certain amount of times).
Agreed. We've also baked versioning into it, so that the client will detect if scripts from an incompatible version is used. This should help when doing future incompatible changes.
Will there just be one script per state? So we only allow one postinstall script for example? Or did I misunderstand this? I think we should have a directory of scripts and they would run in numerical/lexo order (some predictable sorting that makes sense).
I mostly made this call based on my experience with packaging, and that multiple scripts for each stage tend to become messy. But I didn't really think about concrete use cases, and I suppose one could argue that getting user approval vs checking some wireless driver are fairly distinct things that should be in different scripts.
But in the spirit of the earlier comment about keeping things minimal, and given that we version the scripts, maybe we should stick to just one script per state/event for now?

eystein.maloy.stenberg April 25, 2017 at 12:25 AM
It might be that you have a great handle on this, on my end it feels a bit rushed and that some considerations or side-effects might be overlooked. That said, it might be best to get going and rather adjust over time (but before a supported release). This is a classic huge and complex piece with too many requirements, and it is typically best to start with something simple (not many config options or too much code logic).
With that in mind we could start with the waiting approach, then potentially implement support for retrying later. In fact, a retry type script can be implemented with the waiting script to start with I think. So for now return codes could be 0 = success (proceed), 1 = fail (stop & report), then we could add 255 = intermittent-fail (retry later a certain amount of times).
Will there just be one script per state? So we only allow one postinstall script for example? Or did I misunderstand this? I think we should have a directory of scripts and they would run in numerical/lexo order (some predictable sorting that makes sense).
Details
Assignee
Kristian AmlieKristian AmlieReporter
Kristian AmlieKristian AmlieLabels
Story Points
5Priority
MediumBacklog
yes
Details
Details
Assignee

Reporter

Labels
Story Points
Priority
Backlog
Zendesk Support
Linked Tickets
Zendesk Support
Linked Tickets
Zendesk Support

From customer in this document:
Debian packages have a similar concept where its scripts are called with a state parameter that describes which stage in the install the package manager is currently doing. Pre- and postinstall are two of these stages, but there are many more. I think this generalization makes a lot of sense for Mender as well.
Acceptance criteria for this task:
Figure out if such a state based scripting mechanism makes sense for Mender.
Example: When entering download, call this script:
./install-script download-state
Then, we about to reboot, call this script:
./install-script reboot-state
Turn our current Mender states into some sort of stable API.
We can't change this once we publish it.
It doesn't have to be a 1-to-1 relationship, we can have more internal states than we publish in the API.
The proposal should support states that need to be confirmed, such as a
ok-to-reboot
state, where the script, if it exists, need to approve reboot. Obviously there needs to be a way to query this script repeatedly in an efficient manner. This also relates to possible DBUS support, but that may be outside the scope of this ticket.Current proposal
Architecture
State hooks
Each state has an enter/leave and error hook
All hooks have a configurable timeout, at which point the update will fail.
The order of hooks for each state is:
Leave old state
If error happens while executing action inside given state error hook is called (this is true for errors happening while executing enter and leave scripts as well)
Enter new state
The states
Like an API, not necessarily 1-to-1 mapping with Mender client implementation
Deliberately avoiding the word "update" since we decided it was ambiguous. Will stick to "artifact"
States:
Idle
Sync
Download
Encompasses all bootstrapping, inventory updates, checking for deployment, etc.
ArtifactInstall
For rootfs this one will just be flipping partition, since content writing is done during download
ArtifactReboot
Long term this an optional state (package updates usually don't need a reboot)
This is ONLY called while rebooting device after installing an update; won't be called during any other reboot attempt
ArtifactCommit
This state will serve as the verify step for install scripts
ArtifactRollback
For rootfs, flips back partition
ArtifactRollbackReboot
Separate from the reboot state because reboot state needs approval, whereas rollback should not
ArtifactFailure
Called if any of the artifact (download, install, commit, ...) related action fails
Order that states will usually be executed in:
Idle
Sync
Download
ArtifactInstall
ArtifactReboot
ArtifactCommit
Remaining states are failure states and can be entered almost anywhere
Implementation
Script locations
Because not all states are necessarily dealing with an artifact, there are various locations where scripts will be grabbed from, and this needs to be well defined
State \ script location
Rootfs-hosted
Artifact-hosted[2]
Idle
X
Sync
X
Download
X
ArtifactInstall
X
ArtifactReboot
X
ArtifactCommit
X
ArtifactRollback
X
ArtifactRollbackReboot
X
ArtifactFailure
X
The two groups are located in different filesystem locations, and should be kept in:
Rootfs-hosted:
/etc/mender/scripts
Artifact-hosted:
/var/lib/mender/scripts
Along with the scripts, the version of the artifact should be stored, and the Mender client should check this version before attempting to run any script. This is to avoid the situation where you change the version of the agent and the new one doesn't understand the script semantics of the old one. IOW, if you use any scripts at all, then both the agent you upgrade from, and the one you upgrade to, have to understand the version of the artifact you're using to upgrade.
Script running
All scripts will be run without arguments.
All scripts must run under a timeout, which could potentially be different for different types of scripts (TBD).
In addition, each script invocation must be remembered in the local database, so that it can be repeated if he node reboots spontaneously. Probably the same should be the case for timeouts, so that reboots do not prolong timeouts indefinitely.
Script form
Each state/event pair can have a number of scripts, indexed by to leading digits and an underscore. They can also have an optional dot followed by an arbitrary string, for identification. Example:
Idle_Enter_00 Idle_Enter_01 Idle_Leave_00 Idle_Leave_01 Idle_Error_00 Idle_Error_01 Download_Enter_05_wifi-driver Download_Enter_10_ask-user Download_Leave_98_wifi-driver Download_Leave_99 Download_Error_98_wifi-driver Download_Error_99 ArtifactInstall_Enter_00 ArtifactInstall_Enter_01 ...and so on...
The ordering within each pair does not influence other pairs, IOW, when executing
Download_Enter
scripts, only digit orderings within that group are considered. All scripts are executed in ascending order.No script in the same state/event category can have the same two digits, even if the name after the optional dot is different (to prevent ambiguous execution order).
See also the changes to the format documentation.
Idle
andSync
scripts will reside under/etc/mender/scripts
, and are expected to be included in the update itself.The rest will reside inside the artifact according to the artifact documentation. When the artifact is parsed/prepared, all scripts should be simultaneously extracted to
/var/lib/mender/scripts
. Alongside them, theversion
file from the top level of the artifact should also be extracted. This will be checked on every script execution that it matches a version that the Mender agent understands.Footnotes:
[2] If inside artifact, the script must be extracted and run from the /data partition. The rootfs partition can not be used because we need to support read-only rootfs.