CaeNDR | Data Releases

C. tropicalis releases

Release Notes

The 20250627 release includes genotypes from whole-genome sequences. Genotypes are compared for concordance, and strains that are 99.9915% identical to each other are grouped into isotypes. One strain within each isotype is the reference strain for that isotype. To look up isotype assignment, see Isotype List. All isotype reference strains are available on CaeNDR.

Release Summary

Strains: 785
WGS strains: 785
Isotypes: 622
Genome: NIC58
BioProject: PRJNA53597

Datasets

Datasets available for this species.
Dataset	Description	Download
Strain Data	Includes strain, isotype, location information, and more.	20250627_c_tropicalis_strain_data.csv
Strain Issues		This link contains all strain issues up to this release
Alignment Data	Alignment data are stored as BAM files, which are binary representations of the Sequence Alignment/Map format. The specifications for these file formats continue to develop. Current specifications for BAM and VCF can be found at hts-specs.	This link contains all alignment data as BAM or BAI files.
Variant Data	Strain-level variant information is stored in the VCF and genomic VCF format. The gVCF format contains information for every base regardless of whether a variant is present or not and is suitable for compiling and joint calling variants across a custom strain set. These files were produced by GCTA.	This link contains all genomic variant data as VCF, TBI, or gVCF files.
Soft-Filtered Variants	Variant information is stored in the VCF format, which is a tab delimited format for storing variant calls and individual genotypes. It is able to store all variant calls from single nucleotide variants to insertions and deletions (~50 bp). The soft-filtered VCF includes all variants and annotations called by the GATK pipeline. The QC status of each variant (INFO field=`FILTER`) and genotype (Format Field=`FT`) is specified by a VCF Field. The specifications for these file formats continue to develop. Current specifications for BAM and VCF can be found at hts-specs.	All Strains WI.20250627.soft-filter.vcf.gz WI.20250627.soft-filter.vcf.gz.tbi Isotypes WI.20250627.soft-filter.isotype.vcf.gz WI.20250627.soft-filter.isotype.vcf.gz.tbi
Hard-Filtered Variants	Variant information is stored in the VCF format, which is a tab delimited format for storing variant calls and individual genotypes. It is able to store all variant calls from single nucleotide variants to insertions and deletions (~50 bp). The hard-filtered VCF includes only high-quality variants after all variants and genotypes with a failed QC status are removed. To obtain vcf for a single or a subset of strains, use `bcftools view --samples`. The specifications for these file formats continue to develop. Current specifications for BAM and VCF can be found at hts-specs.	All Strains WI.20250627.hard-filter.vcf.gz WI.20250627.hard-filter.vcf.gz.tbi Isotypes WI.20250627.hard-filter.isotype.vcf.gz WI.20250627.hard-filter.isotype.vcf.gz.tbi
Annotated Variants	Variant information is stored in the VCF format, which is a tab delimited format for storing variant calls and individual genotypes. It is able to store all variant calls from single nucleotide variants to insertions and deletions (~50 bp). The annotated VCFs include all the variants from the hard-filtered Isotype VCF and have been annotated using 4 different tools: ANNOVAR, CSQ, SnpEff, and VEP. The specifications for these file formats continue to develop. Current specifications for BAM and VCF can be found at hts-specs.	ANNOVAR WI.20250627.annovar.isotype.vcf.gz WI.20250627.annovar.isotype.vcf.gz.tbi WI.20250627.annovar.strain-annotation.csv.gz CSQ WI.20250627.csq.isotype.vcf.gz WI.20250627.csq.isotype.vcf.gz.tbi WI.20250627.csq.strain-annotation.csv.gz SnpEff WI.20250627.snpeff.isotype.vcf.gz WI.20250627.snpeff.isotype.vcf.gz.tbi WI.20250627.snpeff.strain-annotation.csv.gz VEP WI.20250627.vep.isotype.vcf.gz WI.20250627.vep.isotype.vcf.gz.tbi WI.20250627.vep.strain-annotation.csv.gz
Imputed Variants	Variant information is stored in the VCF format, which is a tab delimited format for storing variant calls and individual genotypes. It is able to store all variant calls from single nucleotide variants to insertions and deletions (~50 bp). The imputed VCF includes all the variants from the hard-filtered Isotype VCF, but all missing genotypes have been imputed using Beagle v5.1. The specifications for these file formats continue to develop. Current specifications for BAM and VCF can be found at hts-specs.	Imputed WI.20250627.impute.isotype.vcf.gz WI.20250627.impute.isotype.vcf.gz.tbi
Reference Genome FASTA (NIC58)	The reference genome build from Noble, 2021 used for alignment and annotation.	20250627_c_tropicalis_June2021.genome.fa
Gene models	Gene models were constructed using a combination of BRAKER (short-read) and StringTie + TransDecoder (long-read) followed by QC with AGAT and manual curation with Apollo using the reference genome NIC58.	canonical_geneset.gtf.gz annotations.gff3.gz current.geneIDs.txt.gz
Tree	Tree generated using neighbour-joining algorithm as implemented in QuickTree in Newick and PDF format.	All Strains WI.20250627.hard-filter.min4.tree WI.20250627.hard-filter.min4.tree.pdf Isotype WI.20250627.hard-filter.isotype.min4.tree WI.20250627.hard-filter.isotype.min4.tree.pdf
Haplotypes	Haplotypes for isotypes were calculated and plotted as described in Lee et al.	20250627_c_tropicalis_haplotype.png 20250627_c_tropicalis_haplotype.pdf
Hyper-Divergent Regions	The hyper-divergent regions are characterized by higher-than-average density of small variants and large genomic spans where short sequence reads fail to align to the reference genome. For more information, see the FAQ.	20250627_c_tropicalis_divergent_regions_strain.bed 20250627_c_tropicalis_divergent_regions_strain.bed.gz
Download BAMs Script	You can batch download individual strain BAMs using this script.	20250627_c_tropicalis_bam_bai_download.sh

Methods are not available at this time.

Click to open full-sized in a new window

Download as PNG
Download as PDF