Abstract

Bioinformatic analysis—such as genome assembly quality assessment, alignment summary statistics, relative synonymous codon usage, file format conversion, and processing and analysis—is integrated into diverse disciplines in the biological sciences. Several command-line pieces of software have been developed to conduct some of these individual analyses, but unified toolkits that conduct all these analyses are lacking. To address this gap, we introduce BioKIT, a versatile command line toolkit that has, upon publication, 42 functions, several of which were community-sourced, that conduct routine and novel processing and analysis of genome assemblies, multiple sequence alignments, coding sequences, sequencing data, and more. To demonstrate the utility of BioKIT, we conducted a comprehensive examination of relative synonymous codon usage across 171 fungal genomes that use alternative genetic codes, showed that the novel metric of gene-wise relative synonymous codon usage can accurately estimate gene-wise codon optimization, evaluated the quality and characteristics of 901 eukaryotic genome assemblies, and calculated alignment summary statistics for 10 phylogenomic data matrices. BioKIT will be helpful in facilitating and streamlining sequence analysis workflows. BioKIT is freely available under the MIT license from GitHub (https://github.com/JLSteenwyk/BioKIT), PyPi (https://pypi.org/project/jlsteenwyk-biokit/), and the Anaconda Cloud (https://anaconda.org/jlsteenwyk/jlsteenwyk-biokit). Documentation, user tutorials, and instructions for requesting new features are available online (https://jlsteenwyk.com/BioKIT).

Introduction

Bioinformatics is the application of computational tools to process and analyze biological data, such as nucleotide or amino acid sequences in the form of genome assemblies, gene annotations, and multiple sequence alignments (Bayat 2002). Diverse disciplines in the biological sciences rely on bioinformatic methods and software (Wren 2016). Researchers have acknowledged the need to consider diverse types of biological scientists with different levels of experience, including those with no experience—such as students in classroom settings, when developing software (Kumar and Dudley 2007). It is also essential to implement high standards of software development that ensure software functionality and archival stability (Mangul, Martin, et al. 2019; Mangul, Mosqueiro, et al. 2019). For example, code quality can be improved by utilizing tests that help ensure faithful function of code and facilitate debugging (Darriba et al. 2018). More specifically, integration tests examine end-to-end functionality of software, whereas unit tests evaluate the functionality of smaller chunks of code. Both tests ensure specific inputs result in expected outputs. As a result, the development of effective and user-friendly software for diverse biologists often requires an interdisciplinary team of software engineers, biologists, and others.

Even though numerous bioinformatic pieces of software are available, there are still several barriers to creating seamless and reproducible workflows (Kim et al. 2018). This issue in part stems from different pieces of software requiring different input file formats, being unable to account for nonstandard biological phenomena, such as the use of alternative genetic codes, or can only be executed using web servers or graphical user interfaces, which cannot be incorporated into high-throughput pipelines. Another factor is that multiple pieces of software or custom scripts are typically needed to execute different steps in a larger bioinformatic pipeline; for example, bioinformatic workflows often rely on 1 software/script for converting file formats, another software/script for translating sequences using standard and nonstandard genetic codes, another software/script to examine the properties of genomes or multiple sequence alignments, etc. As a result, maintaining efficacious bioinformatic workflows is cumbersome (Kulkarni et al. 2018). To address this need, package, and workflow managers have facilitated stringing together numerous pieces of software. An alternative approach to addressing this issue is the development of multipurpose toolkits that contain diverse processing and analysis functions and therefore reduce the overall number of pieces of software users need to install regardless of the use of package and workflow managers.

To address this need, we—an interdisciplinary team of software engineers, evolutionary biologists, molecular biologists, microbiologists, and others—developed BioKIT, a versatile toolkit that currently has 42 functions, several of which were requested by the research community, that conduct routine and novel processing and analysis of diverse sequence files including genome assemblies, multiple sequence alignments, protein coding sequences, and sequencing data (Table 1). Functions implemented in BioKIT facilitate a wide variety of standard bioinformatic analyses, including processing and analysis of protein coding sequences (e.g. translation using 26 genetic codes including user-specified translation tables, GC content at the first, second, and third codon positions, and relative synonymous codon usage), evaluation of genome assembly quality [e.g. N50, L50, assembly size, guanine–cytosine (GC) content, number of scaffolds, and others], and the calculation of multiple sequence alignment properties (i.e. number of taxa, alignment length, the number of constant sites, the number of parsimony-informative sites, and the number of variable sites). To demonstrate the utility of BioKIT, we calculated relative synonymous codon usage in 171 fungal genomes, estimated codon optimization in each gene from 2 Saccharomyces budding yeast species using a novel metric, gene-wise relative synonymous codon usage (gRSCU), examined the genome assembly quality of 901 eukaryotic genomes, and evaluated the properties of 10 phylogenomic data matrices. We hope that BioKIT will improve reproducibility and accessibility of diverse bioinformatic analysis, serve as an effective teaching tool, and facilitate discovery in the biological sciences.

Table 1.

Summary of 42 functions available in BioKIT at the time of publication.

Function nameDescriptionType of functionInput dataCitationExample software that performs this function
alignment_lengthCalculate alignment lengthAnalysisMultiple-sequence file in FASTA formatNAAMAS (Borowiec 2016)
alignment_recodingRecode alignments using reduced character statesProcessingMultiple-sequence file in FASTA formatWoese et al. (1991), Embley et al. (2003), Kosiol et al. (2004), Hrdy et al. (2004), Susko and Roger (2007)Custom scripts (Hernandez and Ryan 2021)
alignment_summarySummarize diverse properties of a multiple sequence alignmentAnalysisMultiple-sequence file in FASTA formatNAAMAS (Borowiec 2016); custom scripts (Shen, Salichos, et al. 2016); PhyKIT (Steenwyk, Buida, et al. 2021)
consensus_sequenceGenerates a consensus sequenceAnalysisMultiple-sequence file in FASTA formatSternke et al. (2019)Geneious (https://www.geneious.com)
constant_sitesDetermine the number of constant sites in an alignmentAnalysisMultiple-sequence file in FASTA formatKumar et al. (2016)IQ-TREE (Minh et al. 2020)
parsimony_ informative_sitesDetermine the number of parsimony-informative sites in an alignmentAnalysisMultiple-sequence file in FASTA formatKumar et al. (2016)AMAS (Borowiec 2016); custom scripts (Shen, Salichos, et al. 2016)
position_specific_ score_matrixGenerates a position specific score matrix for an alignmentAnalysisMultiple-sequence file in FASTA formatGribskov et al. (1987)BLAST+ (Camacho et al. 2009)
variable_sitesDetermine the number of variable sites in an alignmentAnalysisMultiple-sequence file in FASTA formatShen, Salichos, et al. (2016)AMAS (Borowiec 2016); custom scripts (Shen, Salichos, et al. 2016); PhyKIT (Steenwyk, Buida, et al. 2021)
gc_content_first_ positionDetermine the GC content of the first codon position among protein coding sequencesAnalysisProtein coding sequences in FASTA formatBentele et al. (2013)Custom scripts (Bentele et al. 2013)
gc_content_second_ positionDetermine the GC content of the second codon position among protein coding sequencesAnalysisProtein coding sequences in FASTA formatBentele et al. (2013)Custom scripts (Bentele et al. 2013)
gc_content_third_ positionDetermine the GC content of the third codon position among protein coding sequencesAnalysisProtein coding sequences in FASTA formatBentele et al. (2013)Custom scripts (Bentele et al. 2013)
gene_wise_relative_ synonymous_ codon_usageCalculate gene-wise relative synonymous codon usageAnalysisProtein coding sequences in FASTA formatThis studyThis study
relative_synonymous_ codon_usageCalculate relative synonymous codon usageAnalysisProtein coding sequences in FASTA formatXu et al. (2008)MEGA (Kumar et al. 2016)
translate_sequenceTranslate protein coding sequences to amino acid sequencesProcessingProtein coding sequences in FASTA formatNAEMBOSS (Rice et al. 2000)
fastq_read_lengthsExamine the distribution of read lengthsAnalysisSequence reads in FASTQ formatNAFQStat (Chanumolu et al. 2019)
subset_pe_fastq_readsDown sample paired-end readsProcessingSequence reads in FASTQ formatNASeqKit (Shen, Le, et al. 2016)
subset_se_fastq_readsDown sample single-end readsProcessingSequence reads in FASTQ formatNASeqKit (Shen, Le, et al. 2016)
trim_pe_fastq_readsTrim paired-end reads based on quality and length thresholdsAnalysisSequence reads in FASTQ formatNATrimmomatic (Bolger et al. 2014)
trim_se_fastq_readsTrim single-end reads based on quality and length thresholdsAnalysisSequence reads in FASTQ formatNATrimmomatic (Bolger et al. 2014)
trim_pe_adapters_fastq_readsTrim adapters from paired-end reads and implement length thresholdsAnalysisSequence reads in FASTQ formatNATrimmomatic (Bolger et al. 2014)
trim_se_ adapters_fastq_readsTrim adapters from single-end reads and implement length thresholdsAnalysisSequence reads in FASTQ formatNATrimmomatic (Bolger et al. 2014)
gc_contentDetermine GC contentAnalysisFASTA file of nucleotide sequencesRomiguier et al. (2010)custom scripts (Shen, Salichos, et al. 2016); GC-Profile (Gao and Zhang 2006)
genome_assembly_metricsDetermine diverse properties of a genome assembly for quality assessment and characterizationAnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013); REAPR (Hunt et al. 2013)
l50L50AnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013)
l90L90AnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013)
longest_scaffoldDetermine the length of the longest entry in a FASTA fileAnalysisFASTA fileGurevich et al. (2013)Custom scripts (Ou et al. 2020)
n50N50AnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013)
n90N90AnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013)
number_of_large_scaffoldsDetermine the number and length of scaffolds longer than 500 nucleotides. Length threshold of 500 nucleotides can be modified by the userAnalysisFASTA fileNAQUAST (Gurevich et al. 2013)
number_of_scaffoldsDetermine the number of FASTA entriesAnalysisFASTA fileNAQUAST (Gurevich et al. 2013)
sum_of_scaffold_lengthsDetermine the total length of all FASTA entriesAnalysisFASTA fileNAQUAST (Gurevich et al. 2013)
character_frequencyDetermine the frequency of each character. Gaps are assumed to be represented as ‘?’ and ‘-’ charactersAnalysisFASTA fileNABiostrings (https://rdrr.io/bioc/Biostrings/)
faidxGet sequence entry from FASTA fileProcessingFASTA fileNASAMtools (Li et al. 2009)
file_format_converterConverts multiple sequence alignments from one format to anotherProcessingFASTA, Clustal, MAF, Mauve, Phylip, Phylip-sequential, Phylip-relaxed, and StockholmNAALTER (Glez-Pena et al. 2010)
multiple_line_to_single_line_fastaReformat sequences to be represented on one lineProcessingFASTA fileNAFASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
remove_fasta_entryRemove sequence based on entry identifierProcessingFASTA fileNANA
remove_short_sequencesRemove short sequencesProcessingFASTA fileNANA
rename_fasta_entriesRename FASTA entriesProcessingFASTA fileNAFASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
reorder_by_sequence_lengthReorder FASTA entries by lengthProcessingFASTA fileNASeqKit (Shen, Le, et al. 2016)
sequence_complementGenerate sequence complements in the forward or reverse directionProcessingFASTA fileBritten (1998)EMBOSS (Rice et al. 2000)
sequence_lengthCalculate the length of each FASTA fileAnalysisFASTA fileNAbioawk (https://github.com/lh3/bioawk)
single_line_to_multiple_line_fastaReformat sequences to be represented on multiple linesProcessingFASTA fileNAFASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
Function nameDescriptionType of functionInput dataCitationExample software that performs this function
alignment_lengthCalculate alignment lengthAnalysisMultiple-sequence file in FASTA formatNAAMAS (Borowiec 2016)
alignment_recodingRecode alignments using reduced character statesProcessingMultiple-sequence file in FASTA formatWoese et al. (1991), Embley et al. (2003), Kosiol et al. (2004), Hrdy et al. (2004), Susko and Roger (2007)Custom scripts (Hernandez and Ryan 2021)
alignment_summarySummarize diverse properties of a multiple sequence alignmentAnalysisMultiple-sequence file in FASTA formatNAAMAS (Borowiec 2016); custom scripts (Shen, Salichos, et al. 2016); PhyKIT (Steenwyk, Buida, et al. 2021)
consensus_sequenceGenerates a consensus sequenceAnalysisMultiple-sequence file in FASTA formatSternke et al. (2019)Geneious (https://www.geneious.com)
constant_sitesDetermine the number of constant sites in an alignmentAnalysisMultiple-sequence file in FASTA formatKumar et al. (2016)IQ-TREE (Minh et al. 2020)
parsimony_ informative_sitesDetermine the number of parsimony-informative sites in an alignmentAnalysisMultiple-sequence file in FASTA formatKumar et al. (2016)AMAS (Borowiec 2016); custom scripts (Shen, Salichos, et al. 2016)
position_specific_ score_matrixGenerates a position specific score matrix for an alignmentAnalysisMultiple-sequence file in FASTA formatGribskov et al. (1987)BLAST+ (Camacho et al. 2009)
variable_sitesDetermine the number of variable sites in an alignmentAnalysisMultiple-sequence file in FASTA formatShen, Salichos, et al. (2016)AMAS (Borowiec 2016); custom scripts (Shen, Salichos, et al. 2016); PhyKIT (Steenwyk, Buida, et al. 2021)
gc_content_first_ positionDetermine the GC content of the first codon position among protein coding sequencesAnalysisProtein coding sequences in FASTA formatBentele et al. (2013)Custom scripts (Bentele et al. 2013)
gc_content_second_ positionDetermine the GC content of the second codon position among protein coding sequencesAnalysisProtein coding sequences in FASTA formatBentele et al. (2013)Custom scripts (Bentele et al. 2013)
gc_content_third_ positionDetermine the GC content of the third codon position among protein coding sequencesAnalysisProtein coding sequences in FASTA formatBentele et al. (2013)Custom scripts (Bentele et al. 2013)
gene_wise_relative_ synonymous_ codon_usageCalculate gene-wise relative synonymous codon usageAnalysisProtein coding sequences in FASTA formatThis studyThis study
relative_synonymous_ codon_usageCalculate relative synonymous codon usageAnalysisProtein coding sequences in FASTA formatXu et al. (2008)MEGA (Kumar et al. 2016)
translate_sequenceTranslate protein coding sequences to amino acid sequencesProcessingProtein coding sequences in FASTA formatNAEMBOSS (Rice et al. 2000)
fastq_read_lengthsExamine the distribution of read lengthsAnalysisSequence reads in FASTQ formatNAFQStat (Chanumolu et al. 2019)
subset_pe_fastq_readsDown sample paired-end readsProcessingSequence reads in FASTQ formatNASeqKit (Shen, Le, et al. 2016)
subset_se_fastq_readsDown sample single-end readsProcessingSequence reads in FASTQ formatNASeqKit (Shen, Le, et al. 2016)
trim_pe_fastq_readsTrim paired-end reads based on quality and length thresholdsAnalysisSequence reads in FASTQ formatNATrimmomatic (Bolger et al. 2014)
trim_se_fastq_readsTrim single-end reads based on quality and length thresholdsAnalysisSequence reads in FASTQ formatNATrimmomatic (Bolger et al. 2014)
trim_pe_adapters_fastq_readsTrim adapters from paired-end reads and implement length thresholdsAnalysisSequence reads in FASTQ formatNATrimmomatic (Bolger et al. 2014)
trim_se_ adapters_fastq_readsTrim adapters from single-end reads and implement length thresholdsAnalysisSequence reads in FASTQ formatNATrimmomatic (Bolger et al. 2014)
gc_contentDetermine GC contentAnalysisFASTA file of nucleotide sequencesRomiguier et al. (2010)custom scripts (Shen, Salichos, et al. 2016); GC-Profile (Gao and Zhang 2006)
genome_assembly_metricsDetermine diverse properties of a genome assembly for quality assessment and characterizationAnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013); REAPR (Hunt et al. 2013)
l50L50AnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013)
l90L90AnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013)
longest_scaffoldDetermine the length of the longest entry in a FASTA fileAnalysisFASTA fileGurevich et al. (2013)Custom scripts (Ou et al. 2020)
n50N50AnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013)
n90N90AnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013)
number_of_large_scaffoldsDetermine the number and length of scaffolds longer than 500 nucleotides. Length threshold of 500 nucleotides can be modified by the userAnalysisFASTA fileNAQUAST (Gurevich et al. 2013)
number_of_scaffoldsDetermine the number of FASTA entriesAnalysisFASTA fileNAQUAST (Gurevich et al. 2013)
sum_of_scaffold_lengthsDetermine the total length of all FASTA entriesAnalysisFASTA fileNAQUAST (Gurevich et al. 2013)
character_frequencyDetermine the frequency of each character. Gaps are assumed to be represented as ‘?’ and ‘-’ charactersAnalysisFASTA fileNABiostrings (https://rdrr.io/bioc/Biostrings/)
faidxGet sequence entry from FASTA fileProcessingFASTA fileNASAMtools (Li et al. 2009)
file_format_converterConverts multiple sequence alignments from one format to anotherProcessingFASTA, Clustal, MAF, Mauve, Phylip, Phylip-sequential, Phylip-relaxed, and StockholmNAALTER (Glez-Pena et al. 2010)
multiple_line_to_single_line_fastaReformat sequences to be represented on one lineProcessingFASTA fileNAFASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
remove_fasta_entryRemove sequence based on entry identifierProcessingFASTA fileNANA
remove_short_sequencesRemove short sequencesProcessingFASTA fileNANA
rename_fasta_entriesRename FASTA entriesProcessingFASTA fileNAFASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
reorder_by_sequence_lengthReorder FASTA entries by lengthProcessingFASTA fileNASeqKit (Shen, Le, et al. 2016)
sequence_complementGenerate sequence complements in the forward or reverse directionProcessingFASTA fileBritten (1998)EMBOSS (Rice et al. 2000)
sequence_lengthCalculate the length of each FASTA fileAnalysisFASTA fileNAbioawk (https://github.com/lh3/bioawk)
single_line_to_multiple_line_fastaReformat sequences to be represented on multiple linesProcessingFASTA fileNAFASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
Table 1.

Summary of 42 functions available in BioKIT at the time of publication.

Function nameDescriptionType of functionInput dataCitationExample software that performs this function
alignment_lengthCalculate alignment lengthAnalysisMultiple-sequence file in FASTA formatNAAMAS (Borowiec 2016)
alignment_recodingRecode alignments using reduced character statesProcessingMultiple-sequence file in FASTA formatWoese et al. (1991), Embley et al. (2003), Kosiol et al. (2004), Hrdy et al. (2004), Susko and Roger (2007)Custom scripts (Hernandez and Ryan 2021)
alignment_summarySummarize diverse properties of a multiple sequence alignmentAnalysisMultiple-sequence file in FASTA formatNAAMAS (Borowiec 2016); custom scripts (Shen, Salichos, et al. 2016); PhyKIT (Steenwyk, Buida, et al. 2021)
consensus_sequenceGenerates a consensus sequenceAnalysisMultiple-sequence file in FASTA formatSternke et al. (2019)Geneious (https://www.geneious.com)
constant_sitesDetermine the number of constant sites in an alignmentAnalysisMultiple-sequence file in FASTA formatKumar et al. (2016)IQ-TREE (Minh et al. 2020)
parsimony_ informative_sitesDetermine the number of parsimony-informative sites in an alignmentAnalysisMultiple-sequence file in FASTA formatKumar et al. (2016)AMAS (Borowiec 2016); custom scripts (Shen, Salichos, et al. 2016)
position_specific_ score_matrixGenerates a position specific score matrix for an alignmentAnalysisMultiple-sequence file in FASTA formatGribskov et al. (1987)BLAST+ (Camacho et al. 2009)
variable_sitesDetermine the number of variable sites in an alignmentAnalysisMultiple-sequence file in FASTA formatShen, Salichos, et al. (2016)AMAS (Borowiec 2016); custom scripts (Shen, Salichos, et al. 2016); PhyKIT (Steenwyk, Buida, et al. 2021)
gc_content_first_ positionDetermine the GC content of the first codon position among protein coding sequencesAnalysisProtein coding sequences in FASTA formatBentele et al. (2013)Custom scripts (Bentele et al. 2013)
gc_content_second_ positionDetermine the GC content of the second codon position among protein coding sequencesAnalysisProtein coding sequences in FASTA formatBentele et al. (2013)Custom scripts (Bentele et al. 2013)
gc_content_third_ positionDetermine the GC content of the third codon position among protein coding sequencesAnalysisProtein coding sequences in FASTA formatBentele et al. (2013)Custom scripts (Bentele et al. 2013)
gene_wise_relative_ synonymous_ codon_usageCalculate gene-wise relative synonymous codon usageAnalysisProtein coding sequences in FASTA formatThis studyThis study
relative_synonymous_ codon_usageCalculate relative synonymous codon usageAnalysisProtein coding sequences in FASTA formatXu et al. (2008)MEGA (Kumar et al. 2016)
translate_sequenceTranslate protein coding sequences to amino acid sequencesProcessingProtein coding sequences in FASTA formatNAEMBOSS (Rice et al. 2000)
fastq_read_lengthsExamine the distribution of read lengthsAnalysisSequence reads in FASTQ formatNAFQStat (Chanumolu et al. 2019)
subset_pe_fastq_readsDown sample paired-end readsProcessingSequence reads in FASTQ formatNASeqKit (Shen, Le, et al. 2016)
subset_se_fastq_readsDown sample single-end readsProcessingSequence reads in FASTQ formatNASeqKit (Shen, Le, et al. 2016)
trim_pe_fastq_readsTrim paired-end reads based on quality and length thresholdsAnalysisSequence reads in FASTQ formatNATrimmomatic (Bolger et al. 2014)
trim_se_fastq_readsTrim single-end reads based on quality and length thresholdsAnalysisSequence reads in FASTQ formatNATrimmomatic (Bolger et al. 2014)
trim_pe_adapters_fastq_readsTrim adapters from paired-end reads and implement length thresholdsAnalysisSequence reads in FASTQ formatNATrimmomatic (Bolger et al. 2014)
trim_se_ adapters_fastq_readsTrim adapters from single-end reads and implement length thresholdsAnalysisSequence reads in FASTQ formatNATrimmomatic (Bolger et al. 2014)
gc_contentDetermine GC contentAnalysisFASTA file of nucleotide sequencesRomiguier et al. (2010)custom scripts (Shen, Salichos, et al. 2016); GC-Profile (Gao and Zhang 2006)
genome_assembly_metricsDetermine diverse properties of a genome assembly for quality assessment and characterizationAnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013); REAPR (Hunt et al. 2013)
l50L50AnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013)
l90L90AnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013)
longest_scaffoldDetermine the length of the longest entry in a FASTA fileAnalysisFASTA fileGurevich et al. (2013)Custom scripts (Ou et al. 2020)
n50N50AnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013)
n90N90AnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013)
number_of_large_scaffoldsDetermine the number and length of scaffolds longer than 500 nucleotides. Length threshold of 500 nucleotides can be modified by the userAnalysisFASTA fileNAQUAST (Gurevich et al. 2013)
number_of_scaffoldsDetermine the number of FASTA entriesAnalysisFASTA fileNAQUAST (Gurevich et al. 2013)
sum_of_scaffold_lengthsDetermine the total length of all FASTA entriesAnalysisFASTA fileNAQUAST (Gurevich et al. 2013)
character_frequencyDetermine the frequency of each character. Gaps are assumed to be represented as ‘?’ and ‘-’ charactersAnalysisFASTA fileNABiostrings (https://rdrr.io/bioc/Biostrings/)
faidxGet sequence entry from FASTA fileProcessingFASTA fileNASAMtools (Li et al. 2009)
file_format_converterConverts multiple sequence alignments from one format to anotherProcessingFASTA, Clustal, MAF, Mauve, Phylip, Phylip-sequential, Phylip-relaxed, and StockholmNAALTER (Glez-Pena et al. 2010)
multiple_line_to_single_line_fastaReformat sequences to be represented on one lineProcessingFASTA fileNAFASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
remove_fasta_entryRemove sequence based on entry identifierProcessingFASTA fileNANA
remove_short_sequencesRemove short sequencesProcessingFASTA fileNANA
rename_fasta_entriesRename FASTA entriesProcessingFASTA fileNAFASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
reorder_by_sequence_lengthReorder FASTA entries by lengthProcessingFASTA fileNASeqKit (Shen, Le, et al. 2016)
sequence_complementGenerate sequence complements in the forward or reverse directionProcessingFASTA fileBritten (1998)EMBOSS (Rice et al. 2000)
sequence_lengthCalculate the length of each FASTA fileAnalysisFASTA fileNAbioawk (https://github.com/lh3/bioawk)
single_line_to_multiple_line_fastaReformat sequences to be represented on multiple linesProcessingFASTA fileNAFASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
Function nameDescriptionType of functionInput dataCitationExample software that performs this function
alignment_lengthCalculate alignment lengthAnalysisMultiple-sequence file in FASTA formatNAAMAS (Borowiec 2016)
alignment_recodingRecode alignments using reduced character statesProcessingMultiple-sequence file in FASTA formatWoese et al. (1991), Embley et al. (2003), Kosiol et al. (2004), Hrdy et al. (2004), Susko and Roger (2007)Custom scripts (Hernandez and Ryan 2021)
alignment_summarySummarize diverse properties of a multiple sequence alignmentAnalysisMultiple-sequence file in FASTA formatNAAMAS (Borowiec 2016); custom scripts (Shen, Salichos, et al. 2016); PhyKIT (Steenwyk, Buida, et al. 2021)
consensus_sequenceGenerates a consensus sequenceAnalysisMultiple-sequence file in FASTA formatSternke et al. (2019)Geneious (https://www.geneious.com)
constant_sitesDetermine the number of constant sites in an alignmentAnalysisMultiple-sequence file in FASTA formatKumar et al. (2016)IQ-TREE (Minh et al. 2020)
parsimony_ informative_sitesDetermine the number of parsimony-informative sites in an alignmentAnalysisMultiple-sequence file in FASTA formatKumar et al. (2016)AMAS (Borowiec 2016); custom scripts (Shen, Salichos, et al. 2016)
position_specific_ score_matrixGenerates a position specific score matrix for an alignmentAnalysisMultiple-sequence file in FASTA formatGribskov et al. (1987)BLAST+ (Camacho et al. 2009)
variable_sitesDetermine the number of variable sites in an alignmentAnalysisMultiple-sequence file in FASTA formatShen, Salichos, et al. (2016)AMAS (Borowiec 2016); custom scripts (Shen, Salichos, et al. 2016); PhyKIT (Steenwyk, Buida, et al. 2021)
gc_content_first_ positionDetermine the GC content of the first codon position among protein coding sequencesAnalysisProtein coding sequences in FASTA formatBentele et al. (2013)Custom scripts (Bentele et al. 2013)
gc_content_second_ positionDetermine the GC content of the second codon position among protein coding sequencesAnalysisProtein coding sequences in FASTA formatBentele et al. (2013)Custom scripts (Bentele et al. 2013)
gc_content_third_ positionDetermine the GC content of the third codon position among protein coding sequencesAnalysisProtein coding sequences in FASTA formatBentele et al. (2013)Custom scripts (Bentele et al. 2013)
gene_wise_relative_ synonymous_ codon_usageCalculate gene-wise relative synonymous codon usageAnalysisProtein coding sequences in FASTA formatThis studyThis study
relative_synonymous_ codon_usageCalculate relative synonymous codon usageAnalysisProtein coding sequences in FASTA formatXu et al. (2008)MEGA (Kumar et al. 2016)
translate_sequenceTranslate protein coding sequences to amino acid sequencesProcessingProtein coding sequences in FASTA formatNAEMBOSS (Rice et al. 2000)
fastq_read_lengthsExamine the distribution of read lengthsAnalysisSequence reads in FASTQ formatNAFQStat (Chanumolu et al. 2019)
subset_pe_fastq_readsDown sample paired-end readsProcessingSequence reads in FASTQ formatNASeqKit (Shen, Le, et al. 2016)
subset_se_fastq_readsDown sample single-end readsProcessingSequence reads in FASTQ formatNASeqKit (Shen, Le, et al. 2016)
trim_pe_fastq_readsTrim paired-end reads based on quality and length thresholdsAnalysisSequence reads in FASTQ formatNATrimmomatic (Bolger et al. 2014)
trim_se_fastq_readsTrim single-end reads based on quality and length thresholdsAnalysisSequence reads in FASTQ formatNATrimmomatic (Bolger et al. 2014)
trim_pe_adapters_fastq_readsTrim adapters from paired-end reads and implement length thresholdsAnalysisSequence reads in FASTQ formatNATrimmomatic (Bolger et al. 2014)
trim_se_ adapters_fastq_readsTrim adapters from single-end reads and implement length thresholdsAnalysisSequence reads in FASTQ formatNATrimmomatic (Bolger et al. 2014)
gc_contentDetermine GC contentAnalysisFASTA file of nucleotide sequencesRomiguier et al. (2010)custom scripts (Shen, Salichos, et al. 2016); GC-Profile (Gao and Zhang 2006)
genome_assembly_metricsDetermine diverse properties of a genome assembly for quality assessment and characterizationAnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013); REAPR (Hunt et al. 2013)
l50L50AnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013)
l90L90AnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013)
longest_scaffoldDetermine the length of the longest entry in a FASTA fileAnalysisFASTA fileGurevich et al. (2013)Custom scripts (Ou et al. 2020)
n50N50AnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013)
n90N90AnalysisFASTA file of a genome assemblyGurevich et al. (2013)QUAST (Gurevich et al. 2013)
number_of_large_scaffoldsDetermine the number and length of scaffolds longer than 500 nucleotides. Length threshold of 500 nucleotides can be modified by the userAnalysisFASTA fileNAQUAST (Gurevich et al. 2013)
number_of_scaffoldsDetermine the number of FASTA entriesAnalysisFASTA fileNAQUAST (Gurevich et al. 2013)
sum_of_scaffold_lengthsDetermine the total length of all FASTA entriesAnalysisFASTA fileNAQUAST (Gurevich et al. 2013)
character_frequencyDetermine the frequency of each character. Gaps are assumed to be represented as ‘?’ and ‘-’ charactersAnalysisFASTA fileNABiostrings (https://rdrr.io/bioc/Biostrings/)
faidxGet sequence entry from FASTA fileProcessingFASTA fileNASAMtools (Li et al. 2009)
file_format_converterConverts multiple sequence alignments from one format to anotherProcessingFASTA, Clustal, MAF, Mauve, Phylip, Phylip-sequential, Phylip-relaxed, and StockholmNAALTER (Glez-Pena et al. 2010)
multiple_line_to_single_line_fastaReformat sequences to be represented on one lineProcessingFASTA fileNAFASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
remove_fasta_entryRemove sequence based on entry identifierProcessingFASTA fileNANA
remove_short_sequencesRemove short sequencesProcessingFASTA fileNANA
rename_fasta_entriesRename FASTA entriesProcessingFASTA fileNAFASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
reorder_by_sequence_lengthReorder FASTA entries by lengthProcessingFASTA fileNASeqKit (Shen, Le, et al. 2016)
sequence_complementGenerate sequence complements in the forward or reverse directionProcessingFASTA fileBritten (1998)EMBOSS (Rice et al. 2000)
sequence_lengthCalculate the length of each FASTA fileAnalysisFASTA fileNAbioawk (https://github.com/lh3/bioawk)
single_line_to_multiple_line_fastaReformat sequences to be represented on multiple linesProcessingFASTA fileNAFASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)

Materials and methods

BioKIT is an easy-to-install command-line software that conducts diverse bioinformatic analyses. BioKIT is written in the Python programming language and has few dependencies, namely Biopython (Cock et al. 2009) and NumPy (Van Der Walt et al. 2011).

BioKIT currently has 42 functions that process and analyze sequence files such as genome assemblies, multiple-sequence alignments, protein coding sequences, and sequencing data (Table 1). Processing functions include those that convert various file formats, subset sequence reads from FASTQ files, rename entries in FASTA files, and others. Analysis functions include those that trim sequence reads in FASTQ files according to quality and length thresholds, calculate relative synonymous codon usage, estimate codon optimization, and others. Where applicable, BioKIT functions can also take standard output as input thereby enabling the use of POSIX pipes.

Details about each function, their usage, tutorials, and other information such as how to request additional functions can be found in the online documentation (https://jlsteenwyk.com/BioKIT). To demonstrate the utility of BioKIT, we highlight 4 use-cases: (1) determination of relative synonymous codon usage using different genetic codes; (2) determination of a novel metric for estimation of gene-wise codon optimization, gene-wise relative synonymous codon usage (gRSCU); (3) genome assembly quality assessment; and (4) summarizing properties of multiple sequence alignments.

Examining features of coding sequences including relative synonymous codon usage

BioKIT contains multiple functions that process or analyze protein coding sequences including translating protein coding sequences into amino acids. One distinct advantage of BioKIT is its ability to account for diverse genetic codes (currently, BioKIT comes complete with 26 genetic codes) or take as input a user-defined translation table. Other coding sequence-related analyses include determining the GC content at the first, second, and third codon positions.

Here, we highlight the relative_synonymous_codon_usage function, which calculates relative synonymous codon usage, the ratio of the observed frequency of synonymous codons to an expected frequency in which all synonymous codons are used equally (Xu et al. 2008). In this analysis, overrepresented codons have relative synonymous codon usage values greater than one whereas underrepresented codons have relative synonymous codon usage values less than 1. Relative synonymous codon usage values of one fit the neutral expectation. The relative_synonymous_codon_usage function can be used with one of 26 genetic codes including user-specified translation tables. The ability of BioKIT to account for diverse genetic codes, which cannot be done by other currently available software, makes it uniquely suitable for analyses of lineages that contain multiple genetic codes (Krassowski et al. 2018; LaBella et al. 2019). Other software that conduct similar analyses include DAMBE and GCUA (McInerney 1998; Xia 2013).

We also highlight the gene_wise_relative_synonymous_codon_usage function, which calculates a novel metric, gRSCU, to examine biases in codon usage among individual genes encoded in a genome. Specifically, the gRSCU is calculated by determining the mean or median relative synonymous codon usage value for all codons in each gene based on their genome-wide values. Thus, BioKIT calculates relative synonymous codon usage for each codon based on codon usage in an entire set of protein coding genes, individually re-examines each gene and the relative synonymous codon usage value for each codon therein, and then determines the mean or median relative synonymous codon usage value for the individual gene. The formula for the mean gRSCU calculation is as follows:
gRSCUa= i=1jRSCUin,
where gRSCUa is the gene that gRSCU is being calculated for, RSCUi is the relative synonymous codon usage value (calculated from all protein coding genes in a genome) for the ith codon of j codons in a gene, and n is the number of codons in a gene. To evaluate within-gene variation in relative synonymous codon usage, BioKIT also reports the standard deviation of relative synonymous codon usage values for each gene. Like the relative_synonymous_codon_usage function, gRSCU can be calculated using alternative genetic codes including user-specified ones. Taken together, these functions can be used individually or in tandem to investigate diverse biological phenomena, including codon usage bias (Brandis and Hughes 2016; LaBella et al. 2019) while also accounting for diverse genetic codes.

Genome assembly quality assessment

Determination of genome assembly properties is essential when evaluating assembly quality (Gurevich et al. 2013; Hunt et al. 2013). To facilitate these analyses, the genome_assembly_metrics function in BioKIT calculates 14 diverse properties of genome assemblies that evaluate assembly quality and characteristics including:

  • assembly size: sum length of all contigs/scaffolds;

  • L50 (and L90): the number of contigs/scaffolds that make up 50% (or, in the case of L90, 90%) of the total length of the genome assembly;

  • N50 (and N90): the length of the contig/scaffold which, along with all contigs/scaffolds longer than or equal to that contig/scaffold, contain 50% (or, in the case of N90, 90%) the length of a particular genome assembly;

  • GC content: fraction of total bases that are either G or C;

  • number of scaffolds: total number of contigs/scaffolds;

  • number and sum length of large scaffolds: total number and sum length of contigs/scaffolds above 500 nucleotides in length (length threshold of a “large scaffold” can be modified by the user); and

  • frequency of nucleotides: fraction of occurrences for adenine (A), thymine, (T), G, and C nucleotides.

Each metric can also be called using individual functions (e.g. the n50 function calculates the N50 of an assembly and the number_of_large_scaffolds function calculates the number of large scaffolds in an assembly). We anticipate the ability of BioKIT to summarize genome assembly properties will be helpful for assessing genome quality as well as in comparative studies of genome properties, such as the evolution of genome size and GC content (Walker et al. 2015; Shen et al. 2020). Other pieces of software that conduct similar analyses include QUAST, REAPR, and GenomeQC (Gurevich et al. 2013; Hunt et al. 2013; Manchanda et al. 2020).

Processing and assessing the properties of multiple sequence alignments

Multiple sequence alignments—the alignment of 3 or more biological sequences—contain a wealth of information. To facilitate easy use and manipulation of multiple sequence alignments, BioKIT implements 16 functions that process or analyze alignments including: generating consensus sequences; generating a position-specific score matrix (which represents the frequency of observing a particular amino acid or nucleotide at a specific position); recoding an alignment using different schemes, such as the RY-nucleotide scheme for nucleotide alignments (Woese et al. 1991; Phillips et al. 2001) or the Dayhoff-6, S&R-6, and KGB-6 schemes for amino acid alignments (Embley et al. 2003; Hrdy et al. 2004; Kosiol et al. 2004; Susko and Roger 2007); converting alignments among the following formats: FASTA, Clustal, MAF, Mauve, PHYLIP, PHYLIP-sequential, PHYLIP-relaxed, and Stockholm; extracting entries in FASTA files; removing entries from FASTA file; removing short sequences from a FASTA file; and others.

We highlight the alignment_summary function, which calculates numerous summary statistics for a multiple sequence alignment, a common step in many molecular evolutionary analyses (Plomion et al. 2018; Winterton et al. 2018). More specifically, the alignment_summary function calculates:

  • alignment length: the total number of sites in an alignment;

  • number of taxa: the total number of sequences in an alignment;

  • number of parsimony-informative sites: a site in an alignment with at least 2 distinct nucleotides or amino acids that each occur at least twice;

  • number of variable sites: a site in an alignment with at least 2 distinct nucleotides or amino acids;

  • number of constant sites: sites with the same nucleotide or amino acid (excluding gaps); and

  • the frequency of all character states: the fraction of occurrence for all nucleotides or amino acids (including gap characters represented as “–” or “?” in an alignment).

Like the genome_assembly_metrics function, each metric can be calculated individually (e.g. the constant_sites function calculates the number of constant sites in an alignment and the character_frequency function calculates the frequency of all character states). We anticipate the alignment_summary function will assist researchers in statistically evaluating the properties of their alignments, which may help guide subsampling strategies in phylogenomic studies (Edwards 2016; Mongiardino Koch 2021). Other pieces of software that perform similar operations include AMAS (Borowiec 2016) and Mesquite (Mesquite Project Team 2014). One advantage of BioKIT compared with other software [such as Phyutility (Smith and Dunn 2008)] the ability to calculate diverse summary statistics.

Implementing high standards of software development

Archival instability is a concern for bioinformatic tools and threatens the reproducibility of bioinformatic research. For example, in an analysis that aimed to evaluate the “installability” of bioinformatic software, 28% of over 36,000 bioinformatic tools failed to properly install due to implementation errors (Mangul, Mosqueiro, et al. 2019). To ensure archival stability of BioKIT, we implemented a previously established protocol (Steenwyk et al. 2020; Steenwyk, Buida, et al. 2021; Steenwyk, Goltz, et al. 2021; Steenwyk and Rokas 2021b) for high standards of software development and design practices. More specifically, we wrote 339 unit and integration tests that ensure faithful functionality of BioKIT and span 95.35% of the codebase. We also implemented a continuous integration pipeline, which builds, packages, installs, and tests the functionality of BioKIT across Python versions 3.6, 3.7, 3.8, and 3.9. To accommodate diverse installation workflows, we also made BioKIT freely available under the MIT license across popular platforms including GitHub (https://github.com/JLSteenwyk/BioKIT), PyPi (https://pypi.org/project/jlsteenwyk-biokit/), and the Anaconda Cloud (https://anaconda.org/jlsteenwyk/jlsteenwyk-biokit). To make BioKIT more user-friendly, we wrote online documentation, user tutorials, and instructions for requesting new features (https://jlsteenwyk.com/BioKIT). We anticipate our strategy to implement high standards of software development, coupled to our approach to decrease barriers to software usage (e.g. easy installation and extensive documentation), will address instabilities observed among many bioinformatic software and increase the long-term usability of BioKIT.

Results

Relative synonymous codon usage in 107 budding yeast and filamentous fungi

To demonstrate the utility of BioKIT in analyzing protein coding sequences, we calculated the relative synonymous codon usage of all codons in the protein coding sequences of 103 Eurotiomycetes (filamentous fungi) and 68 Saccharomycetes (budding yeasts) genomes obtained from the RefSeq database of NCBI (Fig. 1). This example also demonstrates the flexibility of BioKIT to account for nonstandard genetic codes, which are observed among some budding yeasts that use the CUG codon to encode a serine or alanine rather than a leucine (Krassowski et al. 2018). Hierarchical clustering of relative synonymous codon usage values per codon (columns in Fig. 1) revealed similar patterns across groups of codons. For example, CUA, AUA, and GUA—3 of the 4 codons that end in UA—were underrepresented in all fungi. Hierarchical clustering of relative synonymous codon usage values per species (rows in Fig. 1) revealed filamentous fungi and budding yeasts often clustered separately. For example, UGA, GUG, AAC, UAC, AAG, UUC, UCC, ACC, GCC, CGC, CUG, AUC, GUC, CUC, and GGC are more often overrepresented among filamentous fungi in comparison to budding yeasts; in contrast, UUG, GUU, CCA, and GGU are more often overrepresented among budding yeasts in comparison to filamentous fungi. Variation within each lineage was also observed; for example, UUA was underrepresented in most, but not all, budding yeasts. These findings demonstrate how BioKIT can be used to examine patterns of RSCU in a single genome or across species.

Fig. 1.

Relative synonymous codon usage across 171 fungal genomes. Relative synonymous codon usage (RSCU) was calculated from the coding sequences of 103 Eurotiomycetes (filamentous fungi) and 68 Saccharomycetes (budding yeasts) genomes obtained from NCBI. Hierarchical clustering was conducted across the fungal species (rows) and codons (columns). Eight groups of clustered rows were identified; 7 groups of clustered columns were identified. Broad differences were observed in the RSCU values of Eurotiomycetes and Saccharomycetes genomes. For example, Saccharomycetes tended to have higher RSCU values for the AGA codon, whereas Eurotiomycetes tended to have higher RSCU values for the CUG codon. To account for the use of an alternative genetic code in budding yeast genomes from the CUG-Ser1 and CUG-Ser2 lineages (Krassowski et al. 2018), the alternative yeast nuclear code—which is one of 26 alternative genetic codes incorporated into BioKIT—was used during RSCU determination. User’s may also provide their own genetic code if it is unavailable in BioKIT. This figure was made using pheatmap (Kolde 2012).

Patterns of gene-wise codon usage bias can be used to assess codon optimization and predict steady-state gene expression levels

To evaluate the utility of BioKIT in examining gene-wise codon usage biases, we calculated the mean and median gRSCU value, a novel metric introduced in the present manuscript, for individual protein coding genes in the genome of the baker’s yeast Saccharomyces cerevisiae (Fig. 2a). Mean and median gRSCU values were often, but not always, similar—the average absolute difference between mean and median gRSCU is 0.05 ± 0.04. In S. cerevisiae, as well as other organisms, genes encoding ribosomal components and histones are known to be codon optimized and highly expressed (Sharp et al. 1986; Hershberg and Petrov 2009; LaBella et al. 2021). Therefore, we hypothesized that genes with high gRSCU values will have functions related to ribosomes or histones because patterns of gene-wise codon usage bias may be indicative of codon optimization. Supporting this hypothesis, examination of the 10 genes with the highest mean gRSCU revealed 5 genes with ribosome-related functions [RPL41B (YDL133C-A), mean gRSCU: 1.60; RPL41A (YDL184C), mean gRSCU: 1.58; RPS14A (YCR031C), mean gRSCU: 1.44; RPS9B (YBR189W), mean gRSCU: 1.43; and RPL18A (YOL120C), mean gRSCU: 1.43] and 4 genes with histone-related functions [HHF1 (YBR009C), mean gRSCU: 1.45; HTA2 (YBL003C), mean gRSCU: 1.44; HHF2 (YNL030W), mean gRSCU: 1.43; and HTA1 (YDR225W), mean gRSCU: 1.43]. Examination of the 10 most optimized genes according to median gRSCU revealed similar observations wherein 9 genes had ribosome-related functions [RPS14A (YCR031C), median gRSCU: 1.48; RPS12 (YOR369C), median gRSCU: 1.40; RPS30B (YOR182C), median gRSCU: 1.40; RPP2A (YOL039W), median gRSCU: 1.40; RPL18A (YOL120C), median gRSCU; RPS3 (YNL178W), median gRSCU: 1.40; RPL13B (YMR142C), median gRSCU: 1.40; RPP0 (YLR340W), median gRSCU: 1.40; and RPS0B (YLR048W), median gRSCU: 1.40]. More broadly, genes associated with the 60S and 40S ribosomal units (gold color in Fig. 2a) tended to have high gRSCU values. These results suggest gRSCU values may be useful for estimating codon optimization.

Fig. 2.

Mean gene-wise relative synonymous codon usage accurately estimates codon optimization. a) Gene-wise relative synonymous codon usage (gRSCU), the mean (x-axis) or median (y-axis) relative synonymous codon usage value per gene (based on RSCU values calculated from the entire set of protein coding genes), was calculated from the coding sequences of the model budding yeast Saccharomyces cerevisiae. b, c) In S. cerevisiae, a significant correlation was observed between tRNA adaptation index (tAI), a well-known measure of codon optimization (Sabi and Tuller 2014), and mean as well as median gRSCU (r2 = 0.52, P < 0.001 and r2 = 0.25, P < 0.001, respectively; Pearson's Correlation Coefficient). d) Using previously published data, a correlation is observed between median log2 gene expression and tAI in Saccharomyces mikatae (LaBella et al. 2019), which is evidence of tAI values being indicative of codon optimization. Comparison of mean and median gRSCU (e and f, respectively) and median log2 gene expression revealed similarly strong correlations (r2 = 0.57, P < 0.001 and r2 = 0.41, P < 0.001, respectively; Pearson’s correlation coefficient). Of note, mean gRSCU had a strong correlation to gene expression than median gRSCU. Each gene is represented by a dot. In (a), the size of each dot represents the standard deviation of RSCU values observed in the gene.

To further explore the relationship between gRSCU and codon optimization, we compared gRSCU values to the values of the tRNA adaptation index, a measure of codon optimization (Sabi and Tuller 2014), in S. cerevisiae as well as in steady state gene expression data from Saccharomyces mikatae (LaBella et al. 2019). In S. cerevisiae, strong correlation was observed between mean gRSCU and tRNA adaptation index values (Fig. 2b) and a less robust, but still significant, correlation was observed between median gRSCU and tRNA adaptation index values (Fig. 2c). Examination of gRSCU and gene expression data from S. mikatae revealed a robust correlation (Fig. 2, e and f) suggesting gRSCU, and in particular the mean gRSCU, can serve as a measure of gene-wise codon optimization.

Genome assembly quality and characteristics among 901 eukaryotic genomes

Other functions in BioKIT can facilitate, but are not limited to, examining data quality, a prerequisite for many downstream analyses. For example, BioKIT can be useful in examining genome assembly quality and characteristics. To do so, we calculated 14 diverse genome assembly metrics among 901 scaffold-level haploid assemblies of eukaryotic genomes, which were obtained from NCBI, and span 3 major classes of animals (Mammalia; N = 350), plants (Magnoliopsida; N = 336), and fungi (Eurotiomycetes; N = 215). Genome assembly properties exhibited variation both within and between the 3 classes (Fig. 3). For example, fungi had the smallest average genome size of 32.71 ± 7.04 Megabases (Mb) whereas mammals had the largest average genome size of 2,645.50 ± 487.48 Mb. Extensive variation in genome size within each class corroborates previous findings of extreme genome size variation among eukaryotes (Elliott and Gregory 2015). Variation in GC content, a genome property that has been actively investigated for decades (Galtier et al. 2001; Romiguier et al. 2010; Serres-Giardi et al. 2012), was observed among the 3 eukaryotic classes—animals, plants, and fungi had an average GC content of 0.40 ± 0.04, 0.35 ± 0.04, and 0.49 ± 0.03, respectively. Lastly, there was wide variation in genome assembly metrics associated with continuity of assembly. For example, the average N50 values for animals, plants, and fungi were 12,287.64 ± 25,317.31 Mb, 5,030.15 ± 19,358.58 Mb, and 1,370.77 ± 1,552.13 Mb, respectively. Taken together, these results demonstrate BioKIT can assist researchers in examining genome assembly quality but can also be used for studying genome evolution—such as the evolution of genome size and GC content.

Fig. 3.

Summary of genome assembly metrics across 901 genomes from 3 eukaryotic classes. Nine hundred and one scaffold-level genome assemblies from 3 major eukaryotic classes [215 Eurotiomycetes (kingdom: Fungi), 336 Magnoliopsida (kingdom: Plantae), 350 Mammalia (kingdom: Animalia)] were obtained from NCBI and examined for diverse metrics including assembly size, GC content, frequency of A, T, C, and G, N50, N90, L50, L90, number of scaffolds, number of large scaffolds (defined as being greater than 500 nucleotides, which can be modified by the user), sum length of large scaffolds, and longest scaffold in the assembly. Bar plots represent the mean for each taxonomic class. Error bars represent the standard deviation of values. This figure was made using ggplot2 (Wickham 2009) and ggpubfigs (Steenwyk and Rokas 2021a).

Properties of multiple sequence alignment from 10 phylogenomic studies

The properties of multiple sequence alignments are routinely used to evaluate aspects of alignment information content (Shen, Salichos, et al. 2016)—such as in phylogenomic and systematic studies (Oliveira et al. 2011; Pyron et al. 2013). To demonstrate the utility of BioKIT in calculating summary statistics for multiple sequence alignments, we calculated 6 properties across 10 previously published phylogenomic data matrices of amino acid sequences (Misof et al. 2014; Nagy et al. 2014; Borowiec et al. 2015; Chen et al. 2015; Struck et al. 2015; Whelan et al. 2015; Yang et al. 2015; Shen, Zhou, et al. 2016, 2018; Steenwyk et al. 2019) (Fig. 4). Phylogenomic data matrices varied in the number of taxa (mean = 109.50 ± 87.26; median = 94; max = 343; min = 36). Alignment length is associated with greater phylogenetic accuracy and bipartition support (Shen, Salichos, et al. 2016); however, recent analyses suggest that in some instances shorter alignments that contain a wealth of informative sites (such as parsimony-informative sites) harbor robust phylogenetic signal (Steenwyk et al. 2020). Interestingly, the longest observed alignment (1,806,035 sites; Chen, Vertebrates in Fig. 4) (Chen et al. 2015) contained the highest number of constant sites (N = 610,994), which are phylogenetically uninformative, as well as the highest number of variable sites (N = 1,195,041), which are phylogenetically informative (Shen, Salichos, et al. 2016). In contrast to the multiple sequence alignment of vertebrate sequences, the second longest alignment of budding yeast sequences (1,162,805 sites; Shen, 332 Yeast in Fig. 4) has few constant sites (N = 2,761) and many parsimony-informative (N = 1,152,145) and variable sites (N = 1,160,044). This observation may be driven in part by the rapid rate of budding yeast evolution compared to animals (Shen et al. 2018). These results demonstrate BioKIT is useful in summarizing properties of multiple sequence alignments.

Fig. 4.

Summary metrics among multiple sequence alignments from phylogenomic studies. Multiple sequence alignments of amino acid sequences from 10 phylogenomic data matrices (Misof et al. 2014; Nagy et al. 2014; Borowiec et al. 2015; Chen et al. 2015; Struck et al. 2015; Whelan et al. 2015; Yang et al. 2015; Shen, Zhou, et al. 2016; Shen et al. 2018; Steenwyk et al. 2019) were examined for 5 metrics: number of taxa, alignment length, number of constant sites, number of parsimony-informative sites, and number of variable sites. The x-axis depicts the last name of the first author of the phylogenomic study followed by a description of the organisms that were under study. The abbreviation PI represents parsimony-informative sites. Although excluded here for simplicity and clarity, BioKIT also determines character state frequency (nucleotide or amino acid) when summarizing alignment metrics. This figure was made using ggplot2 (Wickham 2009) and ggpubfigs (Steenwyk and Rokas 2021a).

Discussion

The exemplary analyses discussed in the present manuscript represent a small fraction of the ways to implement BioKIT. To further demonstrate how BioKIT may be used in multistep bioinformatic workflows, we depict how BioKIT can be incorporated into genome assembly workflows, genomic analysis pipelines, and phylogenomic workflows (Fig. 5). Taken together, these analyses demonstrate how BioKIT, alone or in tandem with other software, can facilitate the execution of diverse bioinformatic pipelines.

Fig. 5.

BioKIT can be incorporated into diverse bioinformatic workflows. This flowchart depicts how BioKIT can be incorporated into numerous bioinformatic pipeline using genome assembly (left), genome analysis (middle), and phylogenomic analysis (right) as examples. For each workflow, steps that can be executed by BioKIT are depicted in bold-faced font.

As noted throughout the manuscript, there are numerous other tools that perform similar functions to BioKIT. Our intent is not to replace these toolkits but provide a convenient package that conducts diverse analyses. Toolkits that conduct diverse analyses, like BioKIT, can help facilitate the long-term stability of bioinformatic workflows—such as those created using SnakeMake—by reducing the number of dependencies. We also believe BioKIT can be used as an effective teaching tool. The first step in teaching bioinformatics is often downloading and installing software, however, this step can be challenging and time-consuming due to lack of experience, differing installation requirements, and installation errors. Thus, BioKIT helps overcome this hurdle in the classroom.

Future releases of BioKIT will likely feature improvements in its scope and speed. BioKIT is a community-driven tool, thus, users are encouraged to request new functions or contribute their own. We recommend following standard GitHub operating procedures when doing so. In brief, new functions can be requested as GitHub issues and novel functions can be developed in a separate branch and then integrated into the main branch (or the distributed package). To help ensure long-term stability of BioKIT, users should write integration and unit tests for new functions prior to requesting to add them to the main branch. Beyond other functions, visualizing results will be considered in future releases of BioKIT. For the present manuscript, visualization of results was done with ggplot2 (Wickham 2009), pheatmap (Kolde 2012), and ggpubfigs (Steenwyk and Rokas 2021a) in the R programming language. Other future improvements may be related to the speed of BioKIT. Although numerous functions are relatively fast—for example, summarizing genome assembly metrics took 1.25 and 2.55 s for the genomes of S. cerevisiae and Aspergillus fumigatus—some functions, especially those that process large files, such as FASTQ files, may benefit from codebase modifications that improve speed. Speed improvements will likely be guided by user needs.

Conclusion

BioKIT is a multipurpose toolkit that has diverse applications for bioinformatics research. The utilities implemented in BioKIT aim to facilitate the execution of seamless bioinformatic workflows that handle diverse sequence file types. Implementation of state-of-the-art software development and design principles in BioKIT help ensure faithful function and archival stability. BioKIT will be helpful for bioinformaticians with varying levels of expertise and biologists from diverse disciplines including molecular biology.

Data availability

BioKIT is freely available under the MIT license from GitHub (https://github.com/JLSteenwyk/BioKIT), PyPi (https://pypi.org/project/jlsteenwyk-biokit/), and the Anaconda Cloud (https://anaconda.org/jlsteenwyk/jlsteenwyk-biokit). Documentation, user tutorials, and instructions for requesting new features are available online (https://jlsteenwyk.com/BioKIT).

Acknowledgments

We thank the Rokas lab for helpful discussion and feedback. We also thank the bioinformatic community for providing suggestions of functions to implement in BioKIT.

Funding

JLS and AR were funded by the Howard Hughes Medical Institute through the James H. Gilliam Fellowships for Advanced Study program. Research in AR’s lab is supported by grants from the National Science Foundation (DEB-1442113 and DEB-2110404), the National Institutes of Health/National Institute of Allergy and Infectious Diseases (R56 AI146096 and R01 AI153356), and the Burroughs Wellcome Fund.

Conflicts of interest

AR is a scientific consultant for LifeMine Therapeutics, Inc. JLS is a scientific consultant for Latch AI Inc.

Literature cited

Bayat
A.
Science, medicine, and the future: ioinformatics
.
BMJ
.
2002
;
324
(
7344
):
1018
1022
.

Bentele
K
,
Saffert
P
,
Rauscher
R
,
Ignatova
Z
,
Blüthgen
N.
Efficient translation initiation dictates codon usage at gene start
.
Mol Syst Biol
.
2013
;
9
:
675
.

Bolger
AM
,
Lohse
M
,
Usadel
B.
Trimmomatic: a flexible trimmer for Illumina sequence data
.
Bioinformatics
.
2014
;
30
(
15
):
2114
2120
.

Borowiec
ML.
AMAS: a fast tool for alignment manipulation and computing of summary statistics
.
PeerJ
.
2016
;
4
:
e1660
.

Borowiec
ML
,
Lee
EK
,
Chiu
JC
,
Plachetzki
DC.
Extracting phylogenetic signal and accounting for bias in whole-genome data sets supports the Ctenophora as sister to remaining Metazoa
.
BMC Genomics
.
2015
;
16
:
987
.

Brandis
G
,
Hughes
D.
The selective advantage of synonymous codon usage bias in Salmonella
.
PLoS Genet
.
2016
;
12
(
3
):
e1005926
.

Britten
RJ.
Precise sequence complementarity between yeast chromosome ends and two classes of just-subtelomeric sequences
.
Proc Natl Acad Sci U S A
.
1998
;
95
(
11
):
5906
5912
.

Camacho
C
,
Coulouris
G
,
Avagyan
V
,
Ma
N
,
Papadopoulos
J
,
Bealer
K
,
Madden
TL.
BLAST+: architecture and applications
.
BMC Bioinformatics
.
2009
;
10
:
421
.

Chanumolu
SK
,
Albahrani
M
,
Otu
HH.
FQStat: a parallel architecture for very high-speed assessment of sequencing quality metrics
.
BMC Bioinformatics
.
2019
;
20
(
1
):
424
.

Chen
M-Y
,
Liang
D
,
Zhang
P.
Selecting question-specific genes to reduce incongruence in phylogenomics: a case study of jawed vertebrate backbone phylogeny
.
Syst Biol
.
2015
;
64
(
6
):
1104
1120
.

Cock
PJA
,
Antao
T
,
Chang
JT
,
Chapman
BA
,
Cox
CJ
,
Dalke
A
,
Friedberg
I
,
Hamelryck
T
,
Kauff
F
,
Wilczynski
B
, et al. 
Biopython: freely available Python tools for computational molecular biology and bioinformatics
.
Bioinformatics
.
2009
;
25
(
11
):
1422
1423
.

Darriba
D
,
Flouri
T
,
Stamatakis
A.
The state of software for evolutionary biology
.
Mol Biol Evol
.
2018
;
35
(
5
):
1037
1046
.

Edwards
SV.
Phylogenomic subsampling: a brief review
.
Zool Scr
.
2016
;
45
(
S1
):
63
74
.

Elliott
TA
,
Gregory
TR.
What’s in a genome? The C-value enigma and the evolution of eukaryotic genome content
.
Philos Trans R Soc Lond B Biol Sci
.
2015
;
370
(
1678
):
20140331
.

Embley
M
,
van der Giezen
M
,
Horner
DS
,
Dyal
PL
,
Foster
P.
Mitochondria and hydrogenosomes are two forms of the same fundamental organelle
.
Philos Trans R Soc Lond B Biol Sci
.
2003
;
358
(
1429
):
191
203
.

Galtier
N
,
Piganeau
G
,
Mouchiroud
D
,
Duret
L.
GC-content evolution in mammalian genomes: the biased gene conversion hypothesis
.
Genetics
.
2001
;
159
(
2
):
907
911
.

Gao
F
,
Zhang
C-T.
GC-Profile: a web-based tool for visualizing and analyzing the variation of GC content in genomic sequences
.
Nucleic Acids Res
.
2006
;
34
(
Web Server Issue
):
W686
W691
.

Glez-Pena
D
,
Gomez-Blanco
D
,
Reboiro-Jato
M
,
Fdez-Riverola
F
,
Posada
D.
ALTER: program-oriented conversion of DNA and protein alignments
.
Nucleic Acids Res
.
2010
;
38
(
Web Server Issue
):
W14
W18
.

Gribskov
M
,
McLachlan
AD
,
Eisenberg
D.
Profile analysis: detection of distantly related proteins
.
Proc Natl Acad Sci U S A
.
1987
;
84
(
13
):
4355
4358
.

Gurevich
A
,
Saveliev
V
,
Vyahhi
N
,
Tesler
G.
QUAST: quality assessment tool for genome assemblies
.
Bioinformatics
.
2013
;
29
(
8
):
1072
1075
.

Hernandez
AM
,
Ryan
JF.
Six-state amino acid recoding is not an effective strategy to offset compositional heterogeneity and saturation in phylogenetic analyses
.
Syst Biol
.
2021
;70(6):
1200
1212
.

Hershberg
R
,
Petrov
DA.
General rules for optimal codon choice
.
PLoS Genet
.
2009
;
5
(
7
):
e1000556
.

Hrdy
I
,
Hirt
RP
,
Dolezal
P
,
Bardonová
L
,
Foster
PG
,
Tachezy
J
,
Embley
TM.
Trichomonas hydrogenosomes contain the NADH dehydrogenase module of mitochondrial complex I
.
Nature
.
2004
;
432
(
7017
):
618
622
.

Hunt
M
,
Kikuchi
T
,
Sanders
M
,
Newbold
C
,
Berriman
M
,
Otto
TD.
REAPR: a universal tool for genome assembly evaluation
.
Genome Biol
.
2013
;
14
(
5
):
R47
.

Kim
Y-M
,
Poline
J-B
,
Dumas
G.
Experimenting with reproducibility: a case study of robustness in bioinformatics
.
Gigascience
.
2018
;
7
(
7
):giy077.

Kolde
R.
Package ‘pheatmap’
.
Bioconductor
.
2012
;
1
6
.

Kosiol
C
,
Goldman
N
,
Buttimore
NH.
A new criterion and method for amino acid classification
.
J Theor Biol
.
2004
;
228
(
1
):
97
106
.

Krassowski
T
,
Coughlan
AY
,
Shen
X-X
,
Zhou
X
,
Kominek
J
,
Opulente
DA
,
Riley
R
,
Grigoriev
IV
,
Maheshwari
N
,
Shields
DC
, et al. 
Evolutionary instability of CUG-Leu in the genetic code of budding yeasts
.
Nat Commun
.
2018
;
9
(
1
):
1887
.

Kulkarni
N
,
Alessandrì
L
,
Panero
R
,
Arigoni
M
,
Olivero
M
,
Ferrero
G
,
Cordero
F
,
Beccuti
M
,
Calogero
RA.
Reproducible bioinformatics project: a community for reproducible bioinformatics analysis pipelines
.
BMC Bioinformatics
.
2018
;
19
(
Suppl 10
):
349
.

Kumar
S
,
Dudley
J.
Bioinformatics software for biologists in the genomics era
.
Bioinformatics
.
2007
;
23
(
14
):
1713
1717
.

Kumar
S
,
Stecher
G
,
Tamura
K.
MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets
.
Mol Biol Evol
.
2016
;33(7):
1870
1874
.

LaBella
AL
,
Opulente
DA
,
Steenwyk
JL
,
Hittinger
CT
,
Rokas
A.
Signatures of optimal codon usage in metabolic genes inform budding yeast ecology
.
PLoS Biol
.
2021
;
19
(
4
):
e3001185
.

LaBella
AL
,
Opulente
DA
,
Steenwyk
JL
,
Hittinger
CT
,
Rokas
A.
Variation and selection on codon usage bias across an entire subphylum
.
PLoS Genet
.
2019
;
15
(
7
):
e1008304
.

Li
H
,
Handsaker
B
,
Wysoker
A
,
Fennell
T
,
Ruan
J
,
Homer
N
,
Marth
G
,
Abecasis
G
,
Durbin
R
; 1000 Genome Project Data Processing Subgroup.
The Sequence Alignment/Map format and SAMtools
.
Bioinformatics
.
2009
;
25
(
16
):
2078
2079
.

Manchanda
N
,
Portwood
JL
,
Woodhouse
MR
,
Seetharam
AS
,
Lawrence-Dill
CJ
,
Andorf
CM
,
Hufford
MB.
GenomeQC: a quality assessment tool for genome assemblies and gene structure annotations
.
BMC Genomics
.
2020
;
21
(
1
):
193
.

Mangul
S
,
Martin
LS
,
Eskin
E
,
Blekhman
R.
Improving the usability and archival stability of bioinformatics software
.
Genome Biol
.
2019
;
20
(
1
):
47
.

Mangul
S
,
Mosqueiro
T
,
Abdill
RJ
,
Duong
D
,
Mitchell
K
,
Sarwal
V
,
Hill
B
,
Brito
J
,
Littman
RJ
,
Statz
B
, et al. 
Challenges and recommendations to improve the installability and archival stability of omics computational tools
.
PLoS Biol
.
2019
;
17
(
6
):
e3000333
.

McInerney
JO.
GCUA: general codon usage analysis
.
Bioinformatics
.
1998
;
14
(
4
):
372
373
.

Mesquite Project Team Mesquite. A Modular System for Evolutionary Analysis;

2014
. https://bomeara.github.io/wikispaces_mesquite (accessed: 2022 May 11).

Minh
BQ
,
Schmidt
HA
,
Chernomor
O
,
Schrempf
D
,
Woodhams
MD
,
von Haeseler
A
,
Lanfear
R.
IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era
.
Mol Biol Evol
.
2020
;
37
(
5
):
1530
1534
.

Misof
B
,
Liu
S
,
Meusemann
K
,
Peters
RS
,
Donath
A
,
Mayer
C
,
Frandsen
PB
,
Ware
J
,
Flouri
T
,
Beutel
RG
, et al. 
Phylogenomics resolves the timing and pattern of insect evolution
.
Science
.
2014
;
346
(
6210
):
763
767
.

Mongiardino Koch
N.
Phylogenomic subsampling and the search for phylogenetically reliable loci
.
Mol Biol Evol
.
2021
;38(9):
4025
4038
.

Nagy
LG
,
Ohm
RA
,
Kovács
GM
,
Floudas
D
,
Riley
R
,
Gácser
A
,
Sipiczki
M
,
Davis
JM
,
Doty
SL
,
de Hoog
GS
, et al. 
Latent homology and convergent regulatory evolution underlies the repeated emergence of yeasts
.
Nat Commun
.
2014
;
5
:
4471
.

Oliveira
C
,
Avelino
GS
,
Abe
KT
,
Mariguela
TC
,
Benine
RC
,
Ortí
G
,
Vari
RP
,
Corrêa e Castro
RM.
Phylogenetic relationships within the speciose family Characidae (Teleostei: Ostariophysi: Characiformes) based on multilocus analysis and extensive ingroup sampling
.
BMC Evol Biol
.
2011
;
11
:
275
.

Ou
S
,
Liu
J
,
Chougule
KM
,
Fungtammasan
A
,
Seetharam
AS
,
Stein
JC
,
Llaca
V
,
Manchanda
N
,
Gilbert
AM
,
Wei
S
, et al. 
Effect of sequence depth and length in long-read assembly of the maize inbred NC358
.
Nat Commun
.
2020
;
11
(
1
):
2288
.

Phillips
MJ
,
Lin
Y-H
,
Harrison
G
,
Penny
D.
Mitochondrial genomes of a bandicoot and a brushtail possum confirm the monophyly of Australidelphian marsupials
.
Proc Biol Sci
.
2001
;
268
(
1475
):
1533
1538
.

Plomion
C
,
Aury
J-M
,
Amselem
J
,
Leroy
T
,
Murat
F
,
Duplessis
S
,
Faye
S
,
Francillonne
N
,
Labadie
K
,
Le Provost
G
, et al. 
Oak genome reveals facets of long lifespan
.
Nat Plants
.
2018
;
4
(
7
):
440
452
.

Pyron
R
,
Burbrink
FT
,
Wiens
JJ.
A phylogeny and revised classification of Squamata, including 4161 species of lizards and snakes
.
BMC Evol Biol
.
2013
;
13
:
93
.

Rice
P
,
Longden
I
,
Bleasby
A.
EMBOSS: The European Molecular Biology Open Software Suite
.
Trends Genet
.
2000
;
16
(
6
):
276
277
.

Romiguier
J
,
Ranwez
V
,
Douzery
EJP
,
Galtier
N.
Contrasting GC-content dynamics across 33 mammalian genomes: relationship with life-history traits and chromosome sizes
.
Genome Res
.
2010
;
20
(
8
):
1001
1009
.

Sabi
R
,
Tuller
T.
Modelling the efficiency of codon–tRNA interactions based on codon usage bias
.
DNA Res
.
2014
;
21
(
5
):
511
526
.

Serres-Giardi
L
,
Belkhir
K
,
David
J
,
Glémin
S.
Patterns and evolution of nucleotide landscapes in seed plants
.
Plant Cell
.
2012
;
24
(
4
):
1379
1397
.

Sharp
PM
,
Tuohy
TMF
,
Mosurski
KR.
Codon usage in yeast: cluster analysis clearly differentiates highly and lowly expressed genes
.
Nucleic Acids Res
.
1986
;
14
(
13
):
5125
5143
.

Shen
W
,
Le
S
,
Li
Y
,
Hu
F.
SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
.
PLoS One
.
2016
;
11
(
10
):
e0163962
.

Shen
X-X
,
Opulente
DA
,
Kominek
J
,
Zhou
X
,
Steenwyk
JL
,
Buh
KV
,
Haase
MAB
,
Wisecaver
JH
,
Wang
M
,
Doering
DT
, et al. 
Tempo and mode of genome evolution in the budding yeast subphylum
.
Cell
.
2018
;
175
(
6
):
1533
1545.e20
.

Shen
X-X
,
Salichos
L
,
Rokas
A.
A genome-scale investigation of how sequence, function, and tree-based gene properties influence phylogenetic inference
.
Genome Biol Evol
.
2016
;
8
(
8
):
2565
2580
.

Shen
X-X
,
Steenwyk
JL
,
LaBella
AL
,
Opulente
DA
,
Zhou
X
,
Kominek
J
,
Li
Y
,
Groenewald
M
,
Hittinger
CT
,
Rokas
A
, et al. 
Genome-scale phylogeny and contrasting modes of genome evolution in the fungal phylum Ascomycota
.
Sci Adv
.
2020
;
6
(
45
):
eabd0079
.

Shen
X-X
,
Zhou
X
,
Kominek
J
,
Kurtzman
CP
,
Hittinger
CT
,
Rokas
A.
Reconstructing the backbone of the Saccharomycotina yeast phylogeny using genome-scale data
.
G3 (Bethesda)
.
2016
;
6
(
12
):
3927
3939
.

Smith
SA
,
Dunn
CW.
Phyutility: a phyloinformatics tool for trees, alignments and molecular data
.
Bioinformatics
.
2008
;
24
(
5
):
715
716
.

Steenwyk
JL
,
Buida
TJ
,
Labella
AL
,
Li
Y
,
Shen
X-X
,
Rokas
A.
PhyKIT: a broadly applicable UNIX shell toolkit for processing and analyzing phylogenomic data
.
Bioinformatics
.
2021
;
37
(
16
):
2325
2331
.

Steenwyk
JL
,
Buida
TJ
,
Li
Y
,
Shen
X-X
,
Rokas
A.
ClipKIT: a multiple sequence alignment trimming software for accurate phylogenomic inference
.
PLoS Biol
.
2020
;
18
(
12
):
e3001007
.

Steenwyk
JL
,
Goltz
DC
,
Buida
TJ
,
Li
Y
,
Shen
X-X
, Rokas A. orthoSNAP: a tree splitting and pruning algorithm for retrieving single-copy orthologs from gene family trees. bioRxiv 2021.10.30.466607;
2021
.

Steenwyk
JL
,
Rokas
A.
ggpubfigs: colorblind-friendly color palettes and ggplot2 graphic system extensions for publication-quality scientific figures
.
Microbiol Resour Announc
.
2021a
;
10
(
44
):
e00871
21
.

Steenwyk
JL
,
Rokas
A.
orthofisher: a broadly applicable tool for automated gene identification and retrieval
.
2021b
;
11
(
9
):jkab250.

Steenwyk
JL
,
Shen
X-X
,
Lind
AL
,
Goldman
GH
,
Rokas
A.
A robust phylogenomic time tree for biotechnologically and medically important fungi in the Genera Aspergillus and Penicillium
.
MBio
.
2019
;
10
(
4
):
e00925
19
.

Sternke
M
,
Tripp
KW
,
Barrick
D.
Consensus sequence design as a general strategy to create hyperstable, biologically active proteins
.
Proc Natl Acad Sci U S A
.
2019
;
116
(
23
):
11275
11284
.

Struck
TH
,
Golombek
A
,
Weigert
A
,
Franke
FA
,
Westheide
W
,
Purschke
G
,
Bleidorn
C
,
Halanych
KM.
The evolution of Annelids reveals two adaptive routes to the interstitial realm
.
Curr Biol
.
2015
;
25
(
15
):
1993
1999
.

Susko
E
,
Roger
AJ.
On reduced amino acid alphabets for phylogenetic inference
.
Mol Biol Evol
.
2007
;
24
(
9
):
2139
2150
.

Walker
PJ
,
Firth
C
,
Widen
SG
,
Blasdell
KR
,
Guzman
H
,
Wood
TG
,
Paradkar
PN
,
Holmes
EC
,
Tesh
RB
,
Vasilakis
N
, et al. 
Evolution of genome size and complexity in the Rhabdoviridae
.
PLoS Pathog
.
2015
;
11
(
2
):
e1004664
.

Van Der Walt
S
,
Colbert
SC
,
Varoquaux
G.
The NumPy array: a structure for efficient numerical computation
.
Comput Sci Eng
.
2011
;
13
(
2
):
22
30
.

Whelan
NV
,
Kocot
KM
,
Moroz
LL
,
Halanych
KM.
Error, signal, and the placement of Ctenophora sister to all other animals
.
Proc Natl Acad Sci U S A
.
2015
;
112
(
18
):
5773
5778
.

Wickham
H.
ggplot2
.
New York (NY
):
Springer
;
2009
.

Winterton
SL
,
Lemmon
AR
,
Gillung
JP
,
Garzon
IJ
,
Badano
D
,
Bakkes
DK
,
Breitkreuz
LC
,
Engel
MS
,
Lemmon
EM
,
Liu
X
, et al. 
Evolution of lacewings and allied orders using anchored phylogenomics (Neuroptera, Megaloptera, Raphidioptera)
.
Syst Entomol
.
2018
;
43
(
2
):
330
354
.

Woese
CR
,
Achenbach
L
,
Rouviere
P
,
Mandelco
L.
Archaeal phylogeny: reexamination of the phylogenetic position of Archaeoglohus fulgidus in light of certain composition-induced artifacts
.
Syst Appl Microbiol
.
1991
;
14
(
4
):
364
371
.

Wren
JD.
Bioinformatics programs are 31-fold over-represented among the highest impact scientific papers of the past two decades
.
Bioinformatics
.
2016
;
32
(
17
):
2686
2691
.

Xia
X.
DAMBE5: a comprehensive software package for data analysis in molecular biology and evolution
.
Mol Biol Evol
.
2013
;
30
(
7
):
1720
1728
.

Xu
X
,
Liu
Q
,
Fan
L
,
Cui
X
,
Zhou
X.
Analysis of synonymous codon usage and evolution of Begomoviruses
.
J Zhejiang Univ Sci B
.
2008
;
9
(
9
):
667
674
.

Yang
Y
,
Moore
MJ
,
Brockington
SF
,
Soltis
DE
,
Wong
GK-S
,
Carpenter
EJ
,
Zhang
Y
,
Chen
L
,
Yan
Z
,
Xie
Y
, et al. 
Dissecting molecular evolution in the highly diverse plant clade caryophyllales using transcriptome sequencing
.
Mol Biol Evol
.
2015
;
32
(
8
):
2001
2014
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Editor: J Stajich
J Stajich
Editor
Search for other works by this author on: