Skip to content

IPL-UV/xarrayvideo

Repository files navigation

xarrayvideo

Save multichannel data from xarray datasets as videos to save up massive amounts of space (e.g. 20-50x compression) with minimal quality loss.

The library revolves around two functions: xarray2video encodes selected variables into videos, and video2xarray rebuilds the dataset from those videos. The project supports standard ffmpeg codecs such as libx265, vp9, and ffv1, GDAL-backed image codecs such as JP2OpenJPEG, and now also direct wrappers for external codecs such as vvenc and uavs3e.

Features

  • Encode multiband data into groups of three channels per video.
  • Mix lossy and lossless outputs in the same dataset export.
  • Use ffmpeg-backed video codecs or GDAL-backed image codecs.
  • Run optional PCA/KLT over the channel dimension before encoding.
  • Benchmark codecs with the same evaluation pipeline used in the paper.

Paper

If you find this library useful, please consider citing the accompanying paper:

Pellicer-Valero, O. J., Aybar, C., & Camps-Valls, G. (2025). Video compression for spatiotemporal Earth system data. arXiv. https://doi.org/10.48550/arXiv.2506.19656

Installation

Base install:

git clone https://github.com/OscarPellicer/xarrayvideo.git
cd xarrayvideo
pip install -e .[all]

If you prefer to install dependencies manually:

pip install xarray numpy scikit-image scikit-learn pyyaml zarr netcdf4 ffmpeg-python gcsfs pillow tqdm seaborn h5netcdf tacoreader pytortilla tacotoolbox

# Optional helpers
pip install ipython opencv-python
pip install git+https://github.com/OscarPellicer/txyvis.git
pip install torchmetrics

pip install -e . --no-deps

GDAL is optional, but required for GDAL-backed image codecs such as JP2OpenJPEG.

Linux and macOS:

pip install gdal

Windows:

mamba install -c conda-forge gdal
# or
conda install -c conda-forge gdal

External codecs: vvenc, uavs3e, and VTM

The main library works with plain ffmpeg alone. You only need the external toolchain if you want to benchmark H.266/VVC or AVS3.

Build the external codecs with the helper script:

bash scripts/install_vvc_avs3_codecs.sh

By default this builds vvenc, vvdec, uavs3e, and uavs3d. To restrict the build, set CODECS_TO_BUILD:

CODECS_TO_BUILD=vvenc,uavs3e bash scripts/install_vvc_avs3_codecs.sh
CODECS_TO_BUILD=vtm bash scripts/install_vvc_avs3_codecs.sh

Then expose the resulting binaries to the current shell:

source scripts/activate_codec_tools.sh

If your builds live outside $HOME/codec_toolchains, pass the root explicitly:

source scripts/activate_codec_tools.sh /path/to/codec_toolchains

Notes:

  • vvenc and uavs3e are the practical external codecs integrated into the benchmark path.
  • VTM remains optional and significantly slower; it is mainly kept as a reference path.
  • The benchmark scripts read XV_CODEC_THREADS or SLURM_CPUS_PER_TASK to control external codec threading.

Examples

Open the example notebooks in JupyterLab or VS Code:

  • example_dynamicearthnet.ipynb
  • example_deepextremecubes.ipynb
  • example_simples2.ipynb
  • example_era5.ipynb

Basic usage

Example with a DeepExtremeCubes sample:

import xarray as xr
import numpy as np
from xarrayvideo import xarray2video, video2xarray, plot_image

array_id = '-111.49_38.60'
input_path = '../mc_-111.49_38.60_1.2.2_20230702_0.zarr'
output_path = './out'

minicube = xr.open_dataset(input_path, engine='zarr')
minicube['SCL'] = minicube['SCL'].astype(np.uint8)
minicube['cloudmask_en'] = minicube['cloudmask_en'].astype(np.uint8)

lossless_params = {'c:v': 'ffv1'}
lossy_params = {
    'c:v': 'libx265',
    'preset': 'medium',
    'crf': 51,
    'x265-params': 'qpmin=0:qpmax=0.01',
    'tune': 'psnr',
}
conversion_rules = {
    'rgb': (('B04', 'B03', 'B02'), ('time', 'y', 'x'), 0, lossy_params, 12),
    'ir3': (('B8A', 'B06', 'B05'), ('time', 'y', 'x'), 0, lossy_params, 12),
    'masks': (('SCL', 'cloudmask_en', 'invalid'), ('time', 'y', 'x'), 0, lossless_params, 8),
}

# Compress, with compute_stats it takes a bit longer, but shows compression info
arr_dict = xarray2video(
    minicube,
    array_id,
    conversion_rules,
    output_path=output_path,
    compute_stats=True,
    loglevel='verbose',
    save_dataset=True,
)

minicube_new = video2xarray(output_path, array_id)

plot_image(minicube, ['B04', 'B03', 'B02'], save_name='./out/RGB_original.jpg')
plot_image(minicube_new, ['B04', 'B03', 'B02'], save_name='./out/RGB_compressed.jpg')

Testing and benchmarks

There is no dedicated pytest suite in this repository at the moment. The canonical regression checks are the benchmark and validation scripts in scripts/.

Main benchmark driver:

python scripts/run_tests.py --dataset deepextremes --rules_name gapfill3
python scripts/run_tests.py --dataset dynamicearthnet --rules_name 4channels2
python scripts/run_tests.py --dataset custom --rules_name pca
python scripts/run_tests.py --dataset era5 --rules_name all

Useful options for fast validation:

python scripts/run_tests.py \
  --dataset deepextremes \
  --rules_name smoke-vvc-avs3-vtm-fast-qplow \
  --codec_names vvenc,uavs3e,vtm \
  --sample_limit 1 \
  --quality_limit 1 \
  --skip_plot \
  --skip_latex \
  --debug

Sample image generation:

python scripts/run_tests.py --dataset deepextremes --rules_name img --id 10.38_50.15 --plot_samples
python scripts/run_tests.py --dataset dynamicearthnet --rules_name img --id 8077_5007 --plot_samples
python scripts/run_tests.py --dataset custom --rules_name img --id cubo1 --plot_samples
python scripts/run_tests.py --dataset era5 --rules_name img --plot_samples

Cluster launchers wrapping scripts/run_tests.py:

  • scripts/launchers/run_cpu_xv_codec_smoke.sh: minimal external-codec smoke test.
  • scripts/launchers/run_cpu_xv_codec_calibration.sh: calibration run for the new codec ladders.
  • scripts/launchers/run_cpu_xv_custom_codecs.sh: custom-dataset codec benchmark.
  • scripts/launchers/run_cpu_xv_deepextremes_codecs.sh: DeepExtremeCubes codec benchmark.
  • scripts/launchers/run_cpu_xv_dynamicearthnet_codecs.sh: DynamicEarthNet codec benchmark.
  • scripts/launchers/run_cpu_xv_era5_codecs.sh: ERA5 codec benchmark.
  • scripts/launchers/run_cpu_build_vvc_avs3.sh: cluster launcher for building the external codec toolchain.

Outputs:

  • Final result pickles, plots, and tables are written under results/.
  • Temporary encoded cubes are written under testing/.
  • Most benchmark logs are written at the repo root by the shell or SLURM launchers.

Additional regression-oriented scripts:

python scripts/run_xarrayvideo_single_cube.py --help
python scripts/synthetic_missing_tests.py
python scripts/reproduce_repetition_comprehensive.py --help

TerraCodec comparison workflow

The repository now includes dedicated scripts used for the direct TerraCodec comparisons reported during review.

  • scripts/run_terracodec_tests.py: benchmark one cube and emit the same MultiIndex pickle layout used by scripts/run_tests.py.
  • scripts/run_terracodec_suite.py: launch the repo's benchmark subset repeatedly across multiple cubes.
  • scripts/check_terracodec_scaling.py: confirm reflectance scaling before running TerraCodec on public Sentinel-2 cubes.
  • scripts/launchers/run_gpu_terracodec_node10.slurm and scripts/launchers/run_gpu_terracodec_suite_node10.slurm: GPU launchers used for those comparisons.
  • scripts/run_xarrayvideo_single_cube.py: classical codec baseline on the same cube slices used for TerraCodec.

These scripts are intentionally separate from the default ffmpeg-centric workflow because TerraCodec has different environment, hardware, and model assumptions.

TACO and Tortilla integration

This repo also contains packaging helpers for .tortilla and .taco datasets.

  • pytortilla is used to wrap single samples.
  • tacotoolbox is used to assemble collections of samples.

Relevant scripts:

  • scripts/process_deepextremes.py: convert DeepExtremeCubes into xarrayvideo/TACO-friendly form.
  • scripts/process_dynamicearthet.py: convert DynamicEarthNet into xarray first, then into xarrayvideo/TACO form. The filename has a historical typo but is the current tracked script.
  • scripts/download_from_hf.py: download prepared artifacts from Hugging Face.
  • scripts/upload_taco.py: upload packaged datasets.
  • scripts/legacy/deepextremecubes_to_tacov2.py and scripts/legacy/dynamicearthnet_to_tacov2.py: migration helpers for older TACO layouts.

Scripts overview

Benchmarking and validation:

  • scripts/run_tests.py: main paper benchmark runner.
  • scripts/run_xarrayvideo_single_cube.py: single-cube rate-distortion check.
  • scripts/reproduce_repetition_comprehensive.py: repetition-versus-padding compression study.
  • scripts/synthetic_missing_tests.py: missing-data handling tests.
  • scripts/era5_diagnostics.py: post-process ERA5 benchmark outputs into diagnostics.

Codec setup and probing:

  • scripts/install_vvc_avs3_codecs.sh: build external codec toolchains.
  • scripts/activate_codec_tools.sh: add codec binaries to PATH.
  • scripts/find_encoders.sh: inspect ffmpeg encoder and pixel-format support.
  • scripts/launchers/: cluster and batch launchers kept out of the repo root.

TerraCodec utilities:

  • scripts/run_terracodec_tests.py: per-cube TerraCodec benchmark.
  • scripts/run_terracodec_suite.py: multi-cube TerraCodec benchmark orchestration.
  • scripts/check_terracodec_scaling.py: verify TerraCodec input scaling.

Dataset preparation and packaging:

  • scripts/process_deepextremes.py: DeepExtremeCubes processing.
  • scripts/process_dynamicearthet.py: DynamicEarthNet processing.
  • scripts/gen_patches.py: patch extraction helper.
  • scripts/fix_metadata.py: metadata repair utility.
  • scripts/upload_taco.py: Hugging Face upload helper.
  • scripts/download_from_hf.py: Hugging Face download helper.

Legacy and one-off helpers kept for reproducibility:

  • scripts/legacy/README.md: quick index for archived helpers.
  • scripts/legacy/check_max_vals.py: inspect dataset value ranges.
  • scripts/legacy/find_processing_gap.py: locate processing gaps in prepared datasets.
  • scripts/legacy/measure_ram.py: memory measurement helper.
  • scripts/legacy/plot_synthetic.py: plotting helper for synthetic missing-data experiments.
  • scripts/legacy/deepextremecubes_to_tacov2.py: DeepExtremeCubes TACO migration.
  • scripts/legacy/dynamicearthnet_to_tacov2.py: DynamicEarthNet TACO migration.
  • scripts/uavs3e_ra.cfg: AVS3 encoder configuration used by the benchmarks.

The main scripts/ folder now contains the active workflows. Older migration and ad hoc analysis utilities live under scripts/legacy/ so the top-level script surface stays focused.

Contact

Contact: oscar.pellicer [at] uv.es or open an Issue.

About

Save multichannel data from xarray datasets as videos to save up massive amounts of space

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages