nasp package¶
Submodules¶
nasp.configuration_parser module¶
nasp.convert_external_genome module¶
nasp.dispatcher module¶
nasp.find_duplicates module¶
nasp.format_fasta module¶
nasp.matrix_DTO module¶
-
class
nasp.matrix_DTO.
NaspFile
(path, name, aligner, snpcaller)¶ Bases:
tuple
-
aligner
¶ Alias for field number 2
-
name
¶ Alias for field number 1
-
path
¶ Alias for field number 0
-
snpcaller
¶ Alias for field number 3
-
nasp.nasp module¶
nasp.nasp_objects module¶
-
class
nasp.nasp_objects.
CollectionStatistics
[source]¶ Bases:
object
Stores a running tally for the statistics for the run. Stats are in two categories: per-contig and per-sample. Contig stats are basic counts, and percentages based on reference length. Sample stats are tallied as x out of y for each position, and then counts for each sample, all samples, and any samples, can be computed automatically. For this reason, the object needs to know when the run moves on to the next position, and the flush_cumulative_stat_cache function does this.
-
flush_cumulative_stat_cache
()[source]¶ Does the math on the p/t ratio for the stat cache for the current position, and writes that to the sample stats for the any/all counts. p > 0: any++ p = t: all++
-
-
class
nasp.nasp_objects.
FastaGenome
[source]¶ Bases:
nasp.nasp_objects.Genome
,nasp.nasp_objects.GenomeMeta
A special type of genome where we know the data came from a fasta file, and so we can omit the depth and proportion filters. Meant to mimic a VCFGenome object, with filter checks hard-coded.
-
class
nasp.nasp_objects.
Genome
[source]¶ Bases:
nasp.nasp_objects.GenomeStatus
A special type of GenomeStatus where the genome information being stored is always actual base calls, as strings.
-
get_call
(first_position, last_position=None, contig_name=None, filler_value='X')[source]¶ Alias of get_value, for code clarity
-
import_fasta_file
(fasta_filename, contig_prefix='')[source]¶ Read in a fasta file.
- Args:
- fasta_filename (str): fasta file to import contig_prefix (str): the prefix will be removed from the parsed contig names
-
static
reverse_complement
(dna_string)[source]¶ - Args:
- dna_string (str): nucleotide sequence to reverse complement
- Returns:
- string: nucleotide sequence reverse complement
-
set_call
(new_data, first_position, missing_range_filler='X', contig_name=None)[source]¶ Alias of set_value, for code clarity
-
static
simple_call
(dna_string, allow_x=False, allow_del=False)[source]¶ Standardizes the DNA call assumed to be the base at position one. Discards insertion data, changes ‘U’ to ‘T’, and changes degeneracies to ‘N’. ‘X’ and deletes are changed to ‘N’ by default.
- Args:
- dna_string (str): only the first position is considered allow_x (bool): allow_del (bool):
- Returns:
- string: ‘A’, ‘C’, ‘G’, ‘T’, or ‘N’ with optional ‘X’ and ‘.’
-
-
class
nasp.nasp_objects.
GenomeCollection
[source]¶ Bases:
nasp.nasp_objects.CollectionStatistics
A “master matrix” object, of sorts. Carries all the data necessary to make a matrix, although some of it is computed on-the-fly as the matrix is actually written, for performance reasons. Stats aren’t available until after the matrix is written, for this reason.
-
add_genome
(genome)[source]¶ Adds the genome to the collection, then makes sure the genome list is properly set and in order.
-
send_to_matrix_handles
(matrix_formats)[source]¶ Writes headers and handles per-matrix logic. Calls _write_matrix_line to handle the per-line computation and analysis.
-
-
class
nasp.nasp_objects.
GenomeMeta
[source]¶ Bases:
object
Stores the metadata associated with a genome.
-
add_generators
(generator_array)[source]¶ - Args:
- generator_array: A list of analysis tools that have been run on input files to produce this data, from earliest to latest.
-
static
generate_nickname_from_filename
(filename)[source]¶ For single-sample input files that don’t carry any sample name metadata within the file, generate a nickname for the sample by removing the extension. If this fails, generate a random name in the format “file_XXXXXXXX” where X is an 8-digit random integer.
- Args:
- filename (str):
- Returns:
- string: filename sans extension or “file_XXXXXXXX” where X is an 8-digit random integer.
-
identifier
()[source]¶ - Returns:
- string: meant to be recognizable to the user to differentiate their sample-analyses. There can be no assumption that this value is unique, because the nickname is often not unique. As a result, this should not be used in-code (EG dictionary keys) to differentiate samples. The best we can do is a ( _file_path, _nickname ) tuple, and even that may not be guaranteed to be unique.
-
-
class
nasp.nasp_objects.
GenomeStatus
[source]¶ Bases:
object
Contains and manipulates any generic data that is per-contig-position. This could be any single type of data, like actual bases, filter data, depth information, insertion data, pileups, etc. In the perl version, this object was originally a hash of lists, and converted to a hash of strings for performance. In this version, it was originally a dictionary of strings, and converted to a dictionary of lists for performance and flexibility. Whether single characters, numbers, boolean values, or a mixture of data types are used does not seem to affect memory and performance. Storing lists per-contig-position with this class is a complicated affair, as several of the manipulation functions assume you mean to manipulate a continuous range of positions instead of a single position when you do that.
-
add_contig
(contig_name)[source]¶ Defines a new empty contig in the genome. By default, if an unrecognized contig is encountered, a new empty contig will be created and then acted upon. Otherwise, add_contig must be called on a new contig first, or an InvalidContigName will be thrown.
- Args:
- contig_name (str): Unique contig description.
- Raises:
- InvalidContigName: If contig_name is undefined.
-
append_contig
(genome_data, contig_name=None)[source]¶ Places the passed-in data at the position following the last defined position on the contig. If passed a list, will give each item in the list its own position.
- Args:
- genome_data (list): List of nucleotide symbols. contig_name (str): Unique contig description.
-
extend_contig
(new_length, missing_range_filler, contig_name=None)[source]¶ Ensures the contig is at least new_length positions long
- Args:
- new_length (int): Minimum contig length. missing_range_filler (str): Placeholder character for undefined areas at the end of the contig. contig_name (str): Unique contig description.
-
get_contig_length
(contig_name=None)[source]¶ - Args:
- contig_name (str): Unique contig description.
- Returns:
- int: Number of positions defined in the contig
-
get_value
(first_position, last_position=None, contig_name=None, filler_value=None)[source]¶ - Args:
- contig_name (str): Unique contig description. first_position (int): 1-indexed first position number. last_position (int): Optional last position to select a range or -1 to specify the end of the contig. filler_value (str): Optional filler for undefined regions beyond the genome data. Does not modify the data.
- Returns:
- Returns the nucleotide at first_position, list of values from first_position to last_position inclusive, or None.
-
send_to_fasta_handle
(output_handle, contig_prefix='', max_chars_per_line=80)[source]¶ Assumes the genome data is in string format or stringifiable and one character per position, and then writes it in to the handle open for writing. The file format is like a typical fasta were the genome data to be base calls (but no checks are performed).
The contigs are sorted by name, not the order they were created.
- Args:
- output_handle (file object): File to append FASTA string. contig_prefix (str): Prefix for all contig names. max_chars_per_line (int): A positive value will limit the max chars per line.
-
set_current_contig
(contig_name, create_contig=True)[source]¶ Sets the most-recently-referenced contig without actually performing any action on the data. Can be called to return the current contig without changing it if given a contig_name of None. Will create the contig if it has not been encountered yet by default, or throw an InvalidContigName otherwise.
- Args:
- contig_name (str): Unique contig description or None to query the current contig name. create_contig (bool): If True and the contig does not exist, an empty contig will be created.
- Returns:
- str: Name of the last accessed contig or None.
- Raises:
- InvalidContigName: If create_contig is False and the contig does not exist.
-
set_value
(new_data, position_number, missing_range_filler='!', contig_name=None)[source]¶ Sets the value at position_number on the contig. If passed a list, will change the continuous range of positions starting at position_number, one position per list item. Will extend the contig with missing_range_filler filling undefined values if the position to set is beyond the end of the contig.
- Args:
- new_data (str or list): Single or list of nucleotide symbols. position_number (int): 1-indexed contig position number. missing_range_filler (str): Filler for undefined regions before the set value. Modifies the data. contig_name (str): Unique contig description
-
write_to_fasta_file
(output_filename, contig_prefix='', max_chars_per_line=80)[source]¶ Opens the passed filename and passes to send_to_fasta_handle. This is a separate function so that unit testing is easier, and file names or open file handles can be used as destinations.
- Args:
- output_filename (str): Output filename. contig_prefix (str): Prefix for all contig names. max_chars_per_line (int): A positive value will limit the max contig chars per line.
-
-
class
nasp.nasp_objects.
IndelList
[source]¶ Bases:
object
For storing indel data separately from the reference-indexed position data. Mostly a relic from when calls were stored as a long string with one character per position. Might still have some use for indel implementation, but might be no longer useful.
-
exception
nasp.nasp_objects.
InvalidContigName
(invalid_contig, contig_list)[source]¶ Bases:
Exception
A contig was referenced that is not known to exist in this run. For the most part, this exception isn’t used because encountering an unknown contig either means we should create it or we should discard the data that follows. However, there are potential times where we need to take more extreme action.
-
exception
nasp.nasp_objects.
MalformedInputFile
(data_file, error_message=None)[source]¶ Bases:
Exception
There was an error parsing the input file. Give the information we can to the user in the stderr stream. Let’s hope they read it.
-
exception
nasp.nasp_objects.
ReferenceCallMismatch
(old_call, new_call, data_file=None, contig_name=None, position_number=None)[source]¶ Bases:
Exception
At two different points in the run, the reference call appeared, and the two calls disagreed. This probably means the user is analyzing two different sets of data as if they belong together, in error. This is a very bad thing.
-
class
nasp.nasp_objects.
ReferenceGenome
[source]¶ Bases:
nasp.nasp_objects.Genome
A special type of genome that is to be used as our reference. Unlike other genomes, we know there will only be one, it carries duplicate region data, and we don’t need to store any metadata about it.
-
class
nasp.nasp_objects.
VCFGenome
[source]¶ Bases:
nasp.nasp_objects.Genome
,nasp.nasp_objects.GenomeMeta
A standard sample for analysis. Has genome data, metadata, and data for the three filters.
-
class
nasp.nasp_objects.
VCFRecord
(file_path)[source]¶ Bases:
object
VCF parser, object representing an input VCF being read.
nasp.vcf_to_matrix module¶
-
nasp.vcf_to_matrix.
determine_file_type
(input_file)[source]¶ Get the file type from the packed input filename string
-
nasp.vcf_to_matrix.
get_file_path
(input_file)[source]¶ Get the file path from the packed input filename string
-
nasp.vcf_to_matrix.
import_external_fasta
(input_file)[source]¶ Create a FastaGenome object, set its metadata, and populate it with the data from a fasta file. Must return the data as the single item in an array, because other file formats potentially contain several genomes.
-
nasp.vcf_to_matrix.
import_reference
(reference, reference_path, dups_path)[source]¶ Take an empty reference object and populate it with the data from a reference file, and a dups file if any. Does not return anything, as the passed-in object is modified.
-
nasp.vcf_to_matrix.
main
()[source]¶ Main flow control function for the script. Is sequential over the following steps: 1. Parse arguments 2. Read in reference 3. Read in all query genomes in parallel 4. Write output matrices 5. Write stats files
-
nasp.vcf_to_matrix.
manage_input_thread
(reference, min_coverage, min_proportion, input_q, output_q)[source]¶ Manage one input file worker thread, for reading the data from the file. Input filenames are pulled one at a time from the input queue, and the genome data from the read-in files is placed on the output queue. When an input filename of “None” appears, we know we’re done and put “None” on the output queue so the controlling thread knows we won’t be adding more data.
-
nasp.vcf_to_matrix.
parse_input_files
(input_files, num_threads, genomes, min_coverage, min_proportion)[source]¶ Use a pool of worker threads to, in parallel, read in the input files. Populate the genome collection with the read-in data. This is the “poison pill” thread management algorithm, where threads are each given a “you can stop now” task once the actual queue of tasks is complete.
-
nasp.vcf_to_matrix.
read_vcf_file
(reference, min_coverage, min_proportion, input_file)[source]¶ Submit VCF to be read in to VCF parser, populate genome data and filter data from the parsed VCF data, return a list of the read-in genomes.
-
nasp.vcf_to_matrix.
set_genome_metadata
(genome, input_file)[source]¶ Get all the metadata from the packed input filename string
-
nasp.vcf_to_matrix.
write_output_matrices
(genomes, matrix_folder, matrix_format_choices)[source]¶ Write matrices from genome collection data. Defines the matrix types for future expansion of custom matrix options. This information eventually should come from the user interface and be included in the XML configuration file, rather than hardcoded here. The matrix_format_choices option comes from the XML to here.