#194 new
bahuvrihi

Audit Cache/Store (ie Replay and Recovery)

Reported by bahuvrihi | March 22nd, 2009 @ 09:47 AM

Been considering how to do error recovery. Unwinding a workflow is now possible, in many cases, since Joins can store state as full-fledged objects (although if joins are blocks this isn't true... procs don't serialize). However, when you unwind a partially-run workflow you can lose information that is within some types of joins, like a theoretical counter, or in the current Join running in no-stack mode.

Once you unwind back to app.run, you will have audits of all the results. That suggests a replay is possible but you need something that a given task can query for cached results. Then, in replay mode, you load a clean workflow, load the audits into the cache, and run the tasks so they use the cached results when possible.

This should quickly re-create the state at the time of crash, without re-running upstream tasks. Notes:

  • Tasks cannot have side-effect in this case. Result is the only meaningful output of executing with inputs.
  • A clean, new workflow must be made. This allows joins to have 'side effects' like counters and such.

Note after unwinding you will also have (via join.inputs/join.outputs) a full representation of the workflow, but this may be 'polluted' in the sense that there may be state changes. This is useful for inspection during debugging, and potentially for restart.

For restart:

  • Joins would have to track their state explicitly, ie with @_results and @index so that a dump has a handle on results and where to restart.

Comments and changes to this ticket

  • bahuvrihi

    bahuvrihi March 23rd, 2009 @ 10:35 AM

    When an error occurs:

    • set termination/raise termination error
    • mark the failing task(s)
    • as you unwind, cache the results for each audit
    • requeue the lead tasks

    Then during debuggin'

    • the failed task(s) located and manipulated
    • when the app is restarted the tasks run with cached results, allowing the join state to be recreated

    A similar thing can be accomplished if results are cached along the way, for example to an external data store. How caching/auditing occurs is a separate issue with speed vs memory issues.

  • bahuvrihi

    bahuvrihi June 1st, 2009 @ 12:14 PM

    One issue is that this may run into a duplication problem if a join has enqued tasks. If:

      a -- b -- c --[0][1,2]q
    

    And task c fails, I guess this is actually ok because by the time c runs, b will have already been taken off the stack.

    Not ok, however if:

      a -- b --:q c -- d --[0][1,3]
    

    Because c will be enqued after b, then d fails. Replay will enque d c again, leading to duplication. If b doesn't run at all, however, then this model is ok.

    So, this would have to be a dispatch-level cache/replay and not a middleware thing. Don't dispatch if completed.

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile »

A framework for making configurable, file-based tasks and workflows.

People watching this ticket

Pages