dplutils.pipeline.ray.RayDataPipelineExecutor

dplutils.pipeline.ray.RayDataPipelineExecutor#

class dplutils.pipeline.ray.RayDataPipelineExecutor(graph, n_batches: int = 100, **kwargs)[source]#

Executor using ray datasets pipelines.

Ray datasets can execute a pipeline represented as a simple graph across a distributed cluster. This executor feeds tasks using the range source to mock an infinite stream.

Some limitations of this executor:

  • Task batch_size is interpreted as a hard requirement and will split the workload into separate remote tasks, however it will block until all complete in order to return a merged result. The primary reason for this is ray data lacks streaming repartitioning support, but we want to be able to maximize (particularly GPU) utilization.

  • There is no ability to pause and resume, so batch sizes should be tuned to capture acceptable fault tolerance.

Parameters:
__init__(graph, n_batches: int = 100, **kwargs)[source]#

Methods

__init__(graph[, n_batches])

execute()

Execute the task graph in batches.

from_graph(graph)

make_pipeline()

run()

Validate and run the pipeline.

set_config([coord, value, from_yaml])

Set task configuration options for this instance.

set_config_from_dict(config)

set_context(key, value)

validate()

writeto(outdir[, partition_by_task, ...])

Run pipeline, writing results to parquet table.

Attributes

run_id

tasks_idx