dplutils.pipeline.ray.RayDataPipelineExecutor#

class dplutils.pipeline.ray.RayDataPipelineExecutor(graph, n_batches: int = 100, **kwargs)[source]#

Executor using ray datasets pipelines.

Ray datasets can execute a pipeline represented as a simple graph across a distributed cluster. This executor feeds tasks using the range source to mock an infinite stream.

Some limitations of this executor:

Task batch_size is interpreted as a hard requirement and will split the workload into separate remote tasks, however it will block until all complete in order to return a merged result. The primary reason for this is ray data lacks streaming repartitioning support, but we want to be able to maximize (particularly GPU) utilization.
There is no ability to pause and resume, so batch sizes should be tuned to capture acceptable fault tolerance.

Parameters:

graph – task graph, see PipelineExecutor.
n_batches – total number of batches to feed using range source.
**kwargs – kwargs passed to PipelineExecutor.

__init__(graph, n_batches: int = 100, **kwargs)[source]#

Methods

`__init__`(graph[, n_batches])
`execute`()	Execute the task graph in batches.
`from_graph`(graph)
`make_pipeline`()
`run`()	Validate and run the pipeline.
`set_config`([coord, value, from_yaml])	Set task configuration options for this instance.
`set_config_from_dict`(config)
`set_context`(key, value)
`validate`()
`writeto`(outdir[, partition_by_task, ...])	Run pipeline, writing results to parquet table.

Attributes

`run_id`
`tasks_idx`

dplutils.pipeline.ray.RayDataPipelineExecutor

Contents

dplutils.pipeline.ray.RayDataPipelineExecutor#