Skip to main content

Cypress

Cypress is a hierarchical distributed storage system under the hood of TractoAI.

Cypress stores various kinds of objects, the most important ones are:

  • Tables
  • Files
  • Directories
  • Notebooks
  • Symbolic Links
  • Workflows

Tables

Tables are one of the most frequent kind of objects you will meet in Cypress. They are used to represents datasets.

Tables are horizontally scalable and are designed to entirely hide the distributed nature of the storage - you may as easily save a small gigabyte dataset as a huge petabyte one.

Tables in Cypress

Tables are schemaful, which means that you can define a set of columns and their types. Types may be primitive, such as int64, string, float64, or complex, such as list<int64> or tagged<string, 'image/png'>. You may also work with non-schematized tables, but this would be less convenient for strictly typed data processing engines, such as YQL or Spark.

Tables are stored in YTsaurus internal format, which stores columnar codecs and additional metadata for efficient IO and querying.

Related API methods:

  • create("table", path)
  • read_table(path)
  • write_table(path)

Related YTsaurus documentation.

Files

Files may be used to store artifacts of your workflows, such as binaries, model checkpoints, etc.

Files in Cypress are tightly integrated with the data processing capabilities of the system, allowing using the same artifact in tens of thousands of parallel jobs without IO bottlenecks due to artifact caching subsystem together with the internal p2p file distribution means.

  • create("file", path)
  • read_file(path)
  • write_file(path)
  • file_paths parameter of run_map, run_reduce, run_map_reduce etc user job spec

Related YTsaurus documentation.

Files vs Tables

It is not recommended to build data processing on top of files, as TractoAI's distributed processing paradigm is based on tables, while files are treated by the system as opaque blobs.

Comparison table:

PropertyTablesFiles
APIread_table, write_table, run_query, start_operationread_file, write_file
Scalabilityscale up to petabytesscale up to hundreds of gigabytes
Schematizationschemafulopaque
Storage efficiencycolumnar codecs + compression + erasureonly compression + erasure
Parallel processingMapReduce, SQL, Sparkmanual parallelization via offsets
column-based ACLssupportednot supported
download/upload formatsJSON, Parquet, CSV, etcoriginal blob without any format

Notebooks

Notebooks are a special kind of objects that store code and results of your experiments in a format compatible with Jupyter notebooks.

In TractoAI, Jupyter notebooks can be accessed via UI, may be downloaded or uploaded to the system.