polario
Polars IO utility library
Helpers to make it easier to read and write Hive partitioned parquet dataset with Polars.
It is meant to be a library to deal with datasets easily, but also contains a commandline interface which allows you to inspect parquet files and datasets more easily.
Dataset
Example of use of polario.hive_dataset.HiveDataset
from polario.hive_dataset import HiveDataset
import polars as pl
df = pl.from_dicts(
[
{"p1": 1, "v": 1},
{"p1": 2, "v": 1},
]
)
ds = HiveDataset("file:///tmp/", partition_columns=["p1"])
ds.write(df)
for partition_df in ds.read_partitions():
print(partition_df)
To model data storage, we use three layers: dataset, partition, fragment.
- Each dataset is a lexical ordered set of partitions
- Each partition is a lexical ordered set of fragments
- Each fragment is a file on disk with rows in any order
1""" 2.. include:: ../README.md 3""" 4 5from typing import Optional, TypeVar 6 7from ._version import __version__ as __version__ 8 9T = TypeVar("T") 10 11 12def unwrap(value: Optional[T]) -> T: 13 """Simple unwrap method to read datasets that are assumed to have data 14 15 Example: 16 ```python 17 dataset.write(pl.DataFrame(...)) 18 unwrap(dataset.scan()).collect() # Should not raise 19 ``` 20 21 Raises: 22 ValueError: If value is None 23 """ 24 if value is None: 25 raise ValueError("Value is None") 26 return value
def
unwrap(value: Optional[~T]) -> ~T:
13def unwrap(value: Optional[T]) -> T: 14 """Simple unwrap method to read datasets that are assumed to have data 15 16 Example: 17 ```python 18 dataset.write(pl.DataFrame(...)) 19 unwrap(dataset.scan()).collect() # Should not raise 20 ``` 21 22 Raises: 23 ValueError: If value is None 24 """ 25 if value is None: 26 raise ValueError("Value is None") 27 return value
Simple unwrap method to read datasets that are assumed to have data
Example:
dataset.write(pl.DataFrame(...))
unwrap(dataset.scan()).collect() # Should not raise
Raises: ValueError: If value is None