polario

Polars IO utility library

Helpers to make it easier to read and write Hive partitioned parquet dataset with Polars.

It is meant to be a library to deal with datasets easily, but also contains a commandline interface which allows you to inspect parquet files and datasets more easily.

Dataset

Example of use of polario.hive_dataset.HiveDataset

from polario.hive_dataset import HiveDataset
import polars as pl
df = pl.from_dicts(
        [
            {"p1": 1, "v": 1},
            {"p1": 2, "v": 1},
        ]
    )

ds = HiveDataset("file:///tmp/", partition_columns=["p1"])

ds.write(df)

for partition_df in ds.read_partitions():
    print(partition_df)

To model data storage, we use three layers: dataset, partition, fragment.

  • Each dataset is a lexical ordered set of partitions
  • Each partition is a lexical ordered set of fragments
  • Each fragment is a file on disk with rows in any order
 1"""
 2.. include:: ../README.md
 3"""
 4from importlib.metadata import version
 5from typing import Optional, TypeVar
 6
 7__version__ = version(__name__)
 8
 9T = TypeVar("T")
10
11
12def unwrap(value: Optional[T]) -> T:
13    """Simple unwrap method to read datasets that are assumed to have data
14
15    Example:
16    ```python
17    dataset.write(pl.DataFrame(...))
18    unwrap(dataset.scan()).collect() # Should not raise
19    ```
20
21    Raises:
22        ValueError: If value is None
23    """
24    if value is None:
25        raise ValueError("Value is None")
26    return value
def unwrap(value: Optional[~T]) -> ~T:
13def unwrap(value: Optional[T]) -> T:
14    """Simple unwrap method to read datasets that are assumed to have data
15
16    Example:
17    ```python
18    dataset.write(pl.DataFrame(...))
19    unwrap(dataset.scan()).collect() # Should not raise
20    ```
21
22    Raises:
23        ValueError: If value is None
24    """
25    if value is None:
26        raise ValueError("Value is None")
27    return value

Simple unwrap method to read datasets that are assumed to have data

Example:

dataset.write(pl.DataFrame(...))
unwrap(dataset.scan()).collect() # Should not raise

Raises: ValueError: If value is None