Compute
YTsaurus handles compute resource scheduling and management, providing an abstraction of "jobs". A job is somewhat similar to k8s pod, running one or many containers and consuming resources such as CPU, RAM and GPU, but jobs are well-integrated with the data foundation of YTsaurus.
A group of jobs united by a common task is called an "operation".
MapReduce
The most native data processing framework in YTsaurus is MapReduce. It is a model of parallel processing, allowing to process big datasets with simple code.
When using Python SDK, defining a mapper or reducer is as easy as defining a function. See the example below:
import yt.wrapper as yt
# Assume that there is a table `//home/peter/texts` with a single string column `text`
def mapper(row):
words = row['text'].split()
for word in words:
yield {"word": word}
def reducer(key, rows):
yield {"word": key, "count": sum(1 for row in rows)}
yt.run_map_reduce(mapper, reducer, source_table="//home/peter/texts", destination_table="//home/peter/words", reduce_by=["word"])
There are following kinds of MapReduce operations:
- Map - a simple way to split input into portions and process them independently in parallel.
- Reduce - takes one or many sorted input tables and performs aggregation by some key, which is common for all input tables.
- MapReduce - a compound operation which allows to skip the intermediate step of sorting.
- Sort - sorts an input table (or a union of tables) by some key, producing a schemaful sorted table.
- Merge - merges several input tables into a single, either by simply concatenating them, or by interleaving their rows to meet the required output sort order (provided that the input tables are sorted by the same key).
MapReduce operations are defined by operation specs. Possible spec options may be found in YTsaurus documentation – generic ones and per-operation ones.
Vanilla operations
In order to run arbitrary workload, which are not necessarily data processing, there is a notion of a Vanilla operation. Vanilla operation is roughly equivalent to k8s deployment, allowing to run arbitrary number of containers of a certain kind. Vanilla operations are used as a foundation to allocate resources for other parts of YTsaurus and TractoAI, e.g. for running ClickHouse servers or notebook kernels.
SQL
For SQL-related compute, there is a separate article here.