Accounts with liveManifestCalls Set to True Have Incorrect Dynamic Lookup Results

Issue

After enabling liveManifestCalls: true, the environment begins exhibiting odd behaviors. Resources that have been deployed/changed in previous stages are not being taken in to consideration in current stages, leading to errors and issues in the Pipeline Deployment

This can be especially detrimental to any pipelines that use a rollout strategy along with a strategy.spinnaker.io/max-version-history annotation, causing inconsistent state of deployment targets, and also pipeline failures

Cause

When the flag is set to true, the 'Deploy Manifest' stage waits for the newly-deployed resource by directly polling the cluster instead of by checking in Spinnaker's cache. In general, the stage will finish more quickly, as it can complete as soon as the resource is ready instead of once the new resource is reflected in the cache.

One significant issue that may occur though, that the stage may complete before the cache reflects these changes. Spinnaker expects that stages mutating infrastructure will not complete until the cache has been updated to reflect these mutations.

The result is that any downstream stages that rely on the cache being up-to-date (as stages are generally allowed to do) will either fail or produce incorrect results.

As for how it relates to this issue, any stages that use dynamic target selection to patch/enable/disable a resource. Stages looking in the cache to find the oldest/newest/etc. resource, and act based on the state of the cache when they run will also be affected. Finally, rollout strategies are another example where dynamic target selection is affected.

As a result of the cache not being up to date, this can lead to omitting a newly deployed/deleted/patched resource from a prior stage.

For pipelines that have use a rollout strategy along with a strategy.spinnaker.io/max-version-history annotation, this can be especially painful.

When a max version flag is set, at the time of execution Clouddriver knows only of N replicasets existing and will try to disable the N-1 older replicasets.

This means that there are situations where although Orca plans for X disable manifest tasks, the oldest one is already deleted at the time of the task execution causing a pipeline failure.

Furthermore if a failed pipeline is executed 2-3 or more times until it succeeds it causing a very inconsistent state of the deployment targets depending on which Disable manifest task completes and which doesn't.

Solution

There are a couple ways to resolve this issue; a short term resolution and a longer term resolution that should be planned and tested before implementing.

Temporarily Disable liveManifestCalls

Setting liveManifestCalls: false, will allow the pipelines to work as intended again. However, please note that by setting lifeManifestsCalls to this setting, will once again cause delays in the pipeline resolution. There are also possibilities where this will cause errors in pipelines that are expecting the flag to be set to true, and it is suggested to inform end users of this change.

It is therefore a temporary fix to allow for the environment to continue, but it should be viewed as a temporary fix

Upgrade to 2.23.x+

As per the discussion in the OSS thread, https://github.com/spinnaker/spinnaker/issues/5607, this issue was looked into and a resolution was added in Spinnaker 1.23 (Armory 2.23.x) as per the following list of changes:

https://github.com/spinnaker/spinnaker/issues/5607#issuecomment-692717738

As Tested On Version

2.22.x