DataFrame Utilities
The genome_kit.df subpackage contains utilities for working with Polars DataFrames that contain GenomeKit objects. This includes utilities for serializing DataFrames with GenomeKit objects to Parquet and deserializing them back to GenomeKit objects. This is useful when sharing tabular data sets, or when saving intermediate DataFrames to disk during data processing.
Important
genome_kit.df depends on optional polars dependencies, which are not installed by default. These can be installed with the [df] extra:
pip install "genomekit[df]"
The [df] extra is not included in the default genomekit installation.
If you are running an x86 version of Python on an Apple Silicon Mac (e.g. M1 chip), this will also install the polars-runtime-compat package, which is required to run Polars on Apple Silicon due to AVX features compatibility issues.
Quickstart
The serialization and deserialization entry points are read_parquet() and write_parquet():
import polars as pl
import genome_kit as gk
genome = gk.Genome("ncbi_refseq.v110")
df = pl.DataFrame(
{
"gene": [genome.genes[0], genome.genes[1]],
"score": [0.1, 0.8],
}
)
gk.write_parquet(df, "genes.parquet")
...
...
restored_df = gk.read_parquet("genes.parquet")
Note
The written parquet files can be read by any software that supports the parquet format, but the GenomeKit objects will only be restored when read with genome_kit.df.read_parquet().
Supported GenomeKit Objects
The currently supported GenomeKit objects for serialization are:
genome_kit.CDSgenome_kit.UTR
Public API
- read_parquet(path: str | Path, lazy: bool = False) pl.DataFrame | pl.LazyFrame[source]
Deserialize a Parquet file containing GenomeKit objects into a Polars DataFrame or LazyFrame.
- Parameters:
path – The file path to read the Parquet file from.
lazy – If True, return a LazyFrame. Otherwise, return a DataFrame.
- Returns:
A Polars DataFrame or LazyFrame with deserialized GenomeKit objects.
- write_parquet(df: pl.DataFrame | pl.LazyFrame, path: str | Path, infer_schema_length: int = 100) None[source]
Serialize a DataFrame with GenomeKit objects to a Parquet file.
- Parameters:
df – A Polars DataFrame or LazyFrame with columns containing GenomeKit objects.
path – The file path to write the Parquet file to.
infer_schema_length – The number of rows to use for schema inference when writing the Parquet file.