Genomes
GenomeKit now supports building and using assemblies and annotations. For assemblies, the schema will follow the UCSC format, and for annotations, they must be specified in GENCODE/Ensembl/NCBI GFF3 formats.
Examples
starter/build.sh
is a bash script that builds a few assemblies and annotations to get you started.
Assemblies
Copy the 2bit, chrom.sizes, and chromAlias.txt files from https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/. into your
python -c 'import appdirs; print(appdirs.user_data_dir("genome_kit"))'
directory.
If you need to generate from a fasta:
conda create -n ucsc-tools ucsc-fatotwobit ucsc-twobitinfo
conda activate ucsc-tools
follow the instructions at https://genome.ucsc.edu/goldenPath/help/twoBit.html
optionally create an chromAlias.txt with any contig aliases required.
Annotations
python -c 'import genome_kit as gk; print(gk.GenomeAnnotation.build_gencode("MY_ANNO.gff3", "MY_ANNO", gk.Genome("MY_ASSEMBLY")))'
Copy the resulting files into the
python -c 'import appdirs; print(appdirs.user_data_dir("genome_kit"))'
directory.
The .dganno file contains the compiled
GFF3
and the .cfg file contains metadata, such asrefg=hg38
.