This example describes how to use Dagster's versioning and memoization features.
Dagster can use versions to determine whether or not it is necessary to re-execute a particular step. Given versions of the code from each op in a job, the system can infer whether an upcoming execution of a step will differ from previous executions by tagging op outputs with a version. This allows for the outputs of previous runs to be re-used. We call this process memoized execution.
from dagster import SourceHashVersionStrategy, job
@job(version_strategy=SourceHashVersionStrategy())defthe_job():...
When memoization is enabled, the outputs of ops will be cached. Ops will only be re-run if:
An upstream output's version changes
The config to the op changes
The version of a required resource changes
The value returned by your VersionStrategy for that particular op changes. In the case of SourceHashVersionStrategy, this only occurs when the code within your ops and resources changes.
Memoization is enabled by using a version strategy on your job, in tandem with MemoizableIOManager. In addition to the handle_output and load_input methods from the traditional IOManager, MemoizableIOManagers also implement a has_output method. This is intended to check whether an output already exists that matches specifications.
Before execution occurs, the Dagster system will determine a set of which steps actually need to run. If using memoization, Dagster will check whether all the outputs of a given step have already been memoized by calling the has_output method on the io manager for each output. If has_output returns True for all outputs, then the step will not run again.
Several of the persistent IO managers provided by Dagster are memoizable. This includes Dagster's default io manager, the s3_pickle_io_manager and the fs_io_manager.
There will likely be cases where the default SourceHashVersionStrategy will not suffice. In these cases, it is advantageous to implement your own VersionStrategy to match your requirements.
If you are using a custom IO manager and want to make use of memoization functionality, then your custom IO manager must be memoizable. This means they must implement the has_output function. The OutputContext.get_output_identifier will provide a path which includes version information that you can both store and check outputs to. Check out our implementations of io managers for inspiration.