State transitions

This section describes state transitions for:

In addition, state transitions are shown for the handling of resources with respect to the ManageFaults service group attribute.

See State transitions with respect to ManageFaults attribute.

The states shown in these diagrams are associated with each resource by the agent framework. These states are used only within the agent framework and are independent of the IState resource attribute values indicated by the engine.

The agent writes resource state transition information into the agent log file when the LogDbg parameter, a static resource type attribute, is set to the value DBG_AGINFO. Agent developers can make use of this information when debugging agents.

Figure: Opening a resource

Opening a resource

When the agent starts up, each resource starts with the initial state of Detached. In the Detached state (Enabled=0), the agent rejects all commands to bring a resource online or take it offline.

Figure: Resource in a steady state

Resource in a steady state

When resources are in a steady state of Online or Offline, they are monitored at regular intervals. The intervals are specified by the MonitorInterval attribute in the Online state and by the OfflineMonitorInterval attribute in the Offline state. An Online resource that is unexpectedly detected as Offline is considered to be faulted. Refer to diagrams describing faulted resources.

Figure: Bringing a resource online: ManageFaults=ALL

Bringing a resource online: ManageFaults=ALL

When the agent receives a request from the engine to bring the resource online, the resource enters the Going Online state, where the online entry point is invoked.

If online entry point completes, the resource enters the Going Online Waiting state where it waits for the next monitor cycle.

If online entry point timesout, the agent call clean.

If monitor of GoingOnlineWaiting state returns a status as online, the resource moves to the Online state.

If monitor of GoingOnlineWaiting state returns a status as Intentional Offline, the resource moves to the Offline state.

If, however, the monitor times out, or returns a status of "not Online" (that is, unknown or offline), the following actions are considered:

Figure: Taking a resource offline and ManageFault = ALL

Taking a resource offline and ManageFault = ALL

Upon receiving a request from the engine to take a resource offline, the agent places the resource in a GoingOffline state and invokes the offline entry point and stop periodic monitoring.

If offline completes, the resource enters the GoingOffline Waiting state, agent starts periodic monitoring of resource and also insert a monitor command for the resource. If offline times out, the clean entry point is called for the resource. If clean times out or complete then start periodic monitoring and reset Offline Wait Count if clean was success and move resource to Going Offline Waiting state

If monitor of Going Offline Waiting state returns offline or intentional offline then resource moves to offline state

If monitor of the GoingOffline Waiting state returns unknown or online, or if the monitor times out then,

Figure: Resource fault when RestartLimit reached and ManageFault = ALL

Resource fault when RestartLimit reached and ManageFault = ALL

This diagram describes the activity that occurs when a resource faults and the RestartLimit is reached. When the monitor entry point times out successively and FaultOnMonitorTimeout is reached, or monitor returns offline and the ToleranceLimit is reached.

If clean retry limit is reached then set ADMIN_WAIT flag for resource and move resource to online state if not reached the agent invokes the clean entry point.

If clean fails, or if it times out, the agent places the resource in the online state as if no fault has occurred and starts periodic monitoring. If clean succeeds, the resource is placed in the Going Offline Waiting state and start periodic monitoring, where the agent waits for the next monitor.

If clean succeeds, the resource is placed in the GoingOffline Waiting state, where the agent waits for the next monitor.

Note:

If clean succeeds, the agent move resource to GoingOfflineWait and the resource is marked faulted. If monitoring of GoingOfflineWaiting returns online then the resource is moved to online state as engine does not expects the resource to go in offline state the as GoingOfflineWaiting state was set by the agent as a result of clean success.

Figure: Resource fault when RestartLimit not reached and ManageFault = ALL

Resource fault when RestartLimit not reached and ManageFault = ALL

This diagram describes the activity that occurs when a resource faults and the RestartLimit is not reached. When the monitor entry point times out successively and FaultOnMonitorTimeout is reached, or monitor returns offline and the ToleranceLimit is reached then agent checks the clean counter to check if the clean entry point can be invoked.

If CleanRetryLimit is reached then set ADMIN_WAIT flag for the resource and move the resource to online state. If clean retry limit fails to reach, the agent invokes the clean entry point.

Refer to the diagram "Resource fault without automatic restart," for a discussion of activity when a resource faults and the RestartLimit is reached.

Figure: Monitoring of persistent resources

Monitoring of persistent resources

If monitor returns offline and the ToleranceLimit is reached, the resource is placed in an Offline state and noted as FAULTED. If monitor timeout and FaultOnMonitorTimeouts is reached, the resource is placed in an Offline state and noted as FAULTED.

Figure: Closing a resource

Closing a resource

The state diagram explains all the states from where a resource can move to Closing state. The following tables describes the actions performed in different state by which a resource can move to Closing state,

State

Action

Online to Closing

hastop - local - force or hares -delete or Enabled = 0 only if resource is persistent resource

Offline to Closing

Enabled = 0 or hastop - local or hastop - local - force or hares -delete

GoingOnlineWaiting

hastop - local - force or hares -delete

GoingOfflineWaiting

hastop - local - force or hares -delete

GoingMigrateWaiting

hastop - local - force or hares -delete

GoingOnline

hastop - local - force

GoingOffline

hastop - local - force

GoingMigrate

hastop - local - force

Probing

Enabled = 0 or hastop - local or hastop - local - force or hares - delete

Figure: Migrating a resource

Migrating a resource

The migration process is initiated from the source system, where virtual machine (VM) is online and the VM is migrated to the target system where it was offline. When the agent on the source system receives a migration request from the engine to migrate the resource, the resource goes to Going Migrate state, where migrate entry point is invoked. If the migrate entry point fails with return code 255, the resource is transitioned back to the online state and failure of migrate operation is communicated to the engine. This indicates that the migration operation cannot be performed.

Agent framework ignores any value returned between 101 to 254 range and will return to online state. If the migrate entry point completes successfully or times out is reached, the resource enters the Going Migrate Waiting state where it waits for the next monitor cycle and the monitor calls with the frequency as configured in MonitorInterval. If monitor returns an offline status, the resource moves to the offline state and the migration on the source system is considered complete.

Even after moving to offline state the agent keeps on monitoring the resource with same monitor frequency as configured in MonitorInterval. This is to detect if VM fails back at source node early. However, if monitor entry point times out or reports the state as online or unknown, the resource waits for the MigrateWaitLimit resource cycle to complete.

If any of the monitor within MigrateWaitLimit reports the state as offline, the resource transitions to offline state and the same is reported to the engine. If the monitor entry point times out or reports the state as online or unknown even after MigrateWaitLimit has reached, the ADMIN_WAIT flag is set.

If resource migration operation is successful on source node then on target node the agent change the monitoring frequency from OfflineMonitorInterval to MonitorInternal to detect success full migration early. But if resource is not detected as online on target node even after MigrateWaitLimit is reached then resource is moved to ADMIN_WAIT state and agent fail back to monitor frequency as configured in OfflineMonitorInterval

Note:

: The agent does not call clean if the migrate entry point times out or if monitor after migrate entry point times out or reports the state as online or unknown even after MigrateWaitLimit has reached. You need to manually clear the ADMIN_WAIT flag after resolving the issue.

Figure: Resource fault: ManageFaults attribute = ALL

Resource fault: ManageFaults attribute = ALL

Figure: Resource fault (monitor hung): ManageFaults attribute = ALL

Resource fault (monitor hung): ManageFaults attribute = ALL