Genomes

GenomeKit now supports building and using assemblies and annotations. For assemblies, the schema will follow the UCSC format, and for annotations, they must be specified in GENCODE/Ensembl/NCBI GFF3 formats.

Examples

Clone the GenomeKit git repo to see scripts under data-src/ for examples of how to build annotation data files.

git clone https://github.com/deepgenomics/GenomeKit.git
pushd GenomeKit

Scripts under data-src are used to obtain and generate the data files:

  • data-src/<assembly>/assembly for the assembly, e.g data-src/hg19/assembly

  • data-src/<assembly>/<annotation-source>/<annotation>, e.g data-src/hg19/GENCODE/v26lift37

Assemblies

  1. Generate a hash file

  2. echo $(python -c 'import genome_kit as gk; print(gk.Genome._refg_hash("hg19"))') > hg19.hash
    

(replace hg19 with the desired assembly name)

  1. Copy the 2bit, chrom.sizes, and chromAlias.txt files from https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/. and the hash file you generated into your

    python -c 'import os ; import appdirs ; print(os.environ.get("GENOMEKIT_DATA_DIR", appdirs.user_data_dir("genome_kit")))'
    

    directory.

    If you need to generate from a fasta:

    1. conda create -n ucsc-tools ucsc-fatotwobit ucsc-twobitinfo
      
    2. conda activate ucsc-tools
      
    3. follow the instructions at https://genome.ucsc.edu/goldenPath/help/twoBit.html

    4. optionally create an chromAlias.txt with any contig aliases required.

Annotations

  1. python -c 'import genome_kit as gk; print(gk.GenomeAnnotation.build_gencode("MY_ANNO.gff3", "MY_ANNO", gk.Genome("MY_ASSEMBLY")))'
    
  2. Copy the resulting files into the

    python -c 'import appdirs; print(appdirs.user_data_dir("genome_kit"))'
    

    directory.

    The .dganno file contains the compiled GFF3 and the .cfg file contains metadata, such as refg=hg38.