nasp package

Submodules

nasp.configuration_parser module

nasp.configuration_parser.main()[source]
nasp.configuration_parser.parse_config(config_file)[source]
Args:
config_file (str): path to an XML formatted configuration file
Returns:
dict: configuration dictionary
nasp.configuration_parser.write_config(configuration)[source]
Args:
configuration:

nasp.convert_external_genome module

nasp.convert_external_genome.generate_delta_file(nucmer_path, nucmer_args, delta_filter_path, delta_filter_args, external_nickname, reference_path, external_path)[source]
nasp.convert_external_genome.main()[source]
nasp.convert_external_genome.parse_delta_file(delta_filename, franken_genome, external_genome)[source]

nasp.dispatcher module

nasp.dispatcher.begin(configuration)[source]
nasp.dispatcher.main()[source]

nasp.find_duplicates module

nasp.find_duplicates.main()[source]
nasp.find_duplicates.parse_delta_file(delta_filename, dups_data)[source]
nasp.find_duplicates.run_nucmer_on_reference(nucmer_path, reference_path)[source]

nasp.format_fasta module

nasp.format_fasta.format_fasta(inputfasta, outputfasta)[source]
nasp.format_fasta.main()[source]

nasp.matrix_DTO module

class nasp.matrix_DTO.NaspFile(path, name, aligner, snpcaller)

Bases: tuple

aligner

Alias for field number 2

name

Alias for field number 1

path

Alias for field number 0

snpcaller

Alias for field number 3

nasp.matrix_DTO.main()[source]
nasp.matrix_DTO.parse_dto(xml_file)[source]
Args:
xml_file:
Returns:
dict: The xml file as a dictionary.
nasp.matrix_DTO.write_dto(matrix_parms, franken_fastas, vcf_files, xml_file)[source]

nasp.nasp module

nasp.nasp.complete(text, state)[source]

Tab autocomplete for prompts.

Args:
text (str): current user input state (int): index from 0 to n until the function returns a non-string value
Returns:
str: matching file at the current state index
nasp.nasp.gonasp_path()[source]
nasp.nasp.main()[source]

nasp.nasp_objects module

class nasp.nasp_objects.CollectionStatistics[source]

Bases: object

Stores a running tally for the statistics for the run. Stats are in two categories: per-contig and per-sample. Contig stats are basic counts, and percentages based on reference length. Sample stats are tallied as x out of y for each position, and then counts for each sample, all samples, and any samples, can be computed automatically. For this reason, the object needs to know when the run moves on to the next position, and the flush_cumulative_stat_cache function does this.

flush_cumulative_stat_cache()[source]

Does the math on the p/t ratio for the stat cache for the current position, and writes that to the sample stats for the any/all counts. p > 0: any++ p = t: all++

get_contig_stat(stat_id, contig_name=None)[source]
get_cumulative_stat(stat_id, cum_type, sample_nickname=None)[source]
get_sample_stat(stat_id, sample_nickname, sample_identifier, sample_path)[source]
increment_contig_stat(stat_id, contig_name=None)[source]

Increments the stat for the specified contig, and automatically increments the count for the all-contigs tally on the same stat.

record_sample_stat(stat_id, sample_nickname, sample_identifier, sample_path, did_pass)[source]

Updates the sample stat, and then the cumulative stat cache. Increments the cache for all samples, and for all analyses on this sample.

class nasp.nasp_objects.FastaGenome[source]

Bases: nasp.nasp_objects.Genome, nasp.nasp_objects.GenomeMeta

A special type of genome where we know the data came from a fasta file, and so we can omit the depth and proportion filters. Meant to mimic a VCFGenome object, with filter checks hard-coded.

get_coverage_pass(current_pos, contig_name=None)[source]

This filter is not applicable.

get_proportion_pass(current_pos, contig_name=None)[source]

This filter is not applicable.

get_was_called(current_pos, contig_name=None)[source]

A stand-in for the was-called filter that just makes sure the position isn’t an “N”.

class nasp.nasp_objects.Genome[source]

Bases: nasp.nasp_objects.GenomeStatus

A special type of GenomeStatus where the genome information being stored is always actual base calls, as strings.

get_call(first_position, last_position=None, contig_name=None, filler_value='X')[source]

Alias of get_value, for code clarity

import_fasta_file(fasta_filename, contig_prefix='')[source]

Read in a fasta file.

Args:
fasta_filename (str): fasta file to import contig_prefix (str): the prefix will be removed from the parsed contig names
static reverse_complement(dna_string)[source]
Args:
dna_string (str): nucleotide sequence to reverse complement
Returns:
string: nucleotide sequence reverse complement
set_call(new_data, first_position, missing_range_filler='X', contig_name=None)[source]

Alias of set_value, for code clarity

static simple_call(dna_string, allow_x=False, allow_del=False)[source]

Standardizes the DNA call assumed to be the base at position one. Discards insertion data, changes ‘U’ to ‘T’, and changes degeneracies to ‘N’. ‘X’ and deletes are changed to ‘N’ by default.

Args:
dna_string (str): only the first position is considered allow_x (bool): allow_del (bool):
Returns:
string: ‘A’, ‘C’, ‘G’, ‘T’, or ‘N’ with optional ‘X’ and ‘.’
class nasp.nasp_objects.GenomeCollection[source]

Bases: nasp.nasp_objects.CollectionStatistics

A “master matrix” object, of sorts. Carries all the data necessary to make a matrix, although some of it is computed on-the-fly as the matrix is actually written, for performance reasons. Stats aren’t available until after the matrix is written, for this reason.

add_failed_genome(genome_path)[source]
add_genome(genome)[source]

Adds the genome to the collection, then makes sure the genome list is properly set and in order.

get_contigs()[source]
get_dups_call(first_position, last_position=None, contig_name=None)[source]
reference()[source]
send_to_matrix_handles(matrix_formats)[source]

Writes headers and handles per-matrix logic. Calls _write_matrix_line to handle the per-line computation and analysis.

set_current_contig(contig_name)[source]
set_reference(reference)[source]
write_to_matrices(matrix_formats)[source]

Opens files for writing; abstracted for flexibility/testing.

write_to_stats_files(general_filename, sample_filename)[source]

Opens files for writing; abstracted for flexibility/testing.

class nasp.nasp_objects.GenomeMeta[source]

Bases: object

Stores the metadata associated with a genome.

add_generators(generator_array)[source]
Args:
generator_array: A list of analysis tools that have been run on input files to produce this data, from earliest to latest.
file_path()[source]
Returns:
string:
file_type()[source]
Returns:
string:
static generate_nickname_from_filename(filename)[source]

For single-sample input files that don’t carry any sample name metadata within the file, generate a nickname for the sample by removing the extension. If this fails, generate a random name in the format “file_XXXXXXXX” where X is an 8-digit random integer.

Args:
filename (str):
Returns:
string: filename sans extension or “file_XXXXXXXX” where X is an 8-digit random integer.
identifier()[source]
Returns:
string: meant to be recognizable to the user to differentiate their sample-analyses. There can be no assumption that this value is unique, because the nickname is often not unique. As a result, this should not be used in-code (EG dictionary keys) to differentiate samples. The best we can do is a ( _file_path, _nickname ) tuple, and even that may not be guaranteed to be unique.
nickname()[source]
Returns:
string:
static reverse_complement(dna_string)[source]
set_file_path(file_path)[source]
Args:
file_path (str):
set_file_type(type_string)[source]
Args:
type_string (str):
set_nickname(nickname)[source]

This value should be unique per-file, for multi-sample input files, but there is no expectation that this would be unique per run, and nothing to enforce even per-file uniqueness.

Args:
nickname (str):
class nasp.nasp_objects.GenomeStatus[source]

Bases: object

Contains and manipulates any generic data that is per-contig-position. This could be any single type of data, like actual bases, filter data, depth information, insertion data, pileups, etc. In the perl version, this object was originally a hash of lists, and converted to a hash of strings for performance. In this version, it was originally a dictionary of strings, and converted to a dictionary of lists for performance and flexibility. Whether single characters, numbers, boolean values, or a mixture of data types are used does not seem to affect memory and performance. Storing lists per-contig-position with this class is a complicated affair, as several of the manipulation functions assume you mean to manipulate a continuous range of positions instead of a single position when you do that.

add_contig(contig_name)[source]

Defines a new empty contig in the genome. By default, if an unrecognized contig is encountered, a new empty contig will be created and then acted upon. Otherwise, add_contig must be called on a new contig first, or an InvalidContigName will be thrown.

Args:
contig_name (str): Unique contig description.
Raises:
InvalidContigName: If contig_name is undefined.
append_contig(genome_data, contig_name=None)[source]

Places the passed-in data at the position following the last defined position on the contig. If passed a list, will give each item in the list its own position.

Args:
genome_data (list): List of nucleotide symbols. contig_name (str): Unique contig description.
extend_contig(new_length, missing_range_filler, contig_name=None)[source]

Ensures the contig is at least new_length positions long

Args:
new_length (int): Minimum contig length. missing_range_filler (str): Placeholder character for undefined areas at the end of the contig. contig_name (str): Unique contig description.
get_contig_length(contig_name=None)[source]
Args:
contig_name (str): Unique contig description.
Returns:
int: Number of positions defined in the contig
get_contigs()[source]
Returns:
list: Sorted list of contig names.
get_value(first_position, last_position=None, contig_name=None, filler_value=None)[source]
Args:
contig_name (str): Unique contig description. first_position (int): 1-indexed first position number. last_position (int): Optional last position to select a range or -1 to specify the end of the contig. filler_value (str): Optional filler for undefined regions beyond the genome data. Does not modify the data.
Returns:
Returns the nucleotide at first_position, list of values from first_position to last_position inclusive, or None.
send_to_fasta_handle(output_handle, contig_prefix='', max_chars_per_line=80)[source]

Assumes the genome data is in string format or stringifiable and one character per position, and then writes it in to the handle open for writing. The file format is like a typical fasta were the genome data to be base calls (but no checks are performed).

The contigs are sorted by name, not the order they were created.

Args:
output_handle (file object): File to append FASTA string. contig_prefix (str): Prefix for all contig names. max_chars_per_line (int): A positive value will limit the max chars per line.
set_current_contig(contig_name, create_contig=True)[source]

Sets the most-recently-referenced contig without actually performing any action on the data. Can be called to return the current contig without changing it if given a contig_name of None. Will create the contig if it has not been encountered yet by default, or throw an InvalidContigName otherwise.

Args:
contig_name (str): Unique contig description or None to query the current contig name. create_contig (bool): If True and the contig does not exist, an empty contig will be created.
Returns:
str: Name of the last accessed contig or None.
Raises:
InvalidContigName: If create_contig is False and the contig does not exist.
set_value(new_data, position_number, missing_range_filler='!', contig_name=None)[source]

Sets the value at position_number on the contig. If passed a list, will change the continuous range of positions starting at position_number, one position per list item. Will extend the contig with missing_range_filler filling undefined values if the position to set is beyond the end of the contig.

Args:
new_data (str or list): Single or list of nucleotide symbols. position_number (int): 1-indexed contig position number. missing_range_filler (str): Filler for undefined regions before the set value. Modifies the data. contig_name (str): Unique contig description
write_to_fasta_file(output_filename, contig_prefix='', max_chars_per_line=80)[source]

Opens the passed filename and passes to send_to_fasta_handle. This is a separate function so that unit testing is easier, and file names or open file handles can be used as destinations.

Args:
output_filename (str): Output filename. contig_prefix (str): Prefix for all contig names. max_chars_per_line (int): A positive value will limit the max contig chars per line.
class nasp.nasp_objects.IndelList[source]

Bases: object

For storing indel data separately from the reference-indexed position data. Mostly a relic from when calls were stored as a long string with one character per position. Might still have some use for indel implementation, but might be no longer useful.

exception nasp.nasp_objects.InvalidContigName(invalid_contig, contig_list)[source]

Bases: Exception

A contig was referenced that is not known to exist in this run. For the most part, this exception isn’t used because encountering an unknown contig either means we should create it or we should discard the data that follows. However, there are potential times where we need to take more extreme action.

exception nasp.nasp_objects.MalformedInputFile(data_file, error_message=None)[source]

Bases: Exception

There was an error parsing the input file. Give the information we can to the user in the stderr stream. Let’s hope they read it.

exception nasp.nasp_objects.ReferenceCallMismatch(old_call, new_call, data_file=None, contig_name=None, position_number=None)[source]

Bases: Exception

At two different points in the run, the reference call appeared, and the two calls disagreed. This probably means the user is analyzing two different sets of data as if they belong together, in error. This is a very bad thing.

class nasp.nasp_objects.ReferenceGenome[source]

Bases: nasp.nasp_objects.Genome

A special type of genome that is to be used as our reference. Unlike other genomes, we know there will only be one, it carries duplicate region data, and we don’t need to store any metadata about it.

get_dups_call(first_position, last_position=None, contig_name=None)[source]
Args:
first_position (int): last_position (int): contig_name (str):
Returns:
list:
import_dups_file(dups_filename, contig_prefix='')[source]

Wrapper for _import_dups_line for flexibility and testing.

Args:
dups_filename (str): contig_prefix (str):
class nasp.nasp_objects.VCFGenome[source]

Bases: nasp.nasp_objects.Genome, nasp.nasp_objects.GenomeMeta

A standard sample for analysis. Has genome data, metadata, and data for the three filters.

get_coverage_pass(current_pos, contig_name=None)[source]
get_proportion_pass(current_pos, contig_name=None)[source]
get_was_called(current_pos, contig_name=None)[source]
set_coverage_pass(pass_value, current_pos, contig_name=None)[source]
set_proportion_pass(pass_value, current_pos, contig_name=None)[source]
set_was_called(pass_value, current_pos, contig_name=None)[source]
class nasp.nasp_objects.VCFRecord(file_path)[source]

Bases: object

VCF parser, object representing an input VCF being read.

fetch_next_record()[source]

We’re done with the current line, let’s move to the next. Populates all the information at the position we’ll need for parsing parts of the next line.

get_contig()[source]
get_coverage(current_sample)[source]
get_position()[source]
get_proportion(current_sample, sample_coverage, is_a_snp)[source]
get_reference_call()[source]
get_sample_call(current_sample)[source]
get_sample_info(current_sample)[source]
get_samples()[source]
nasp.nasp_objects.main()[source]

nasp.vcf_to_matrix module

nasp.vcf_to_matrix.determine_file_type(input_file)[source]

Get the file type from the packed input filename string

nasp.vcf_to_matrix.get_file_path(input_file)[source]

Get the file path from the packed input filename string

nasp.vcf_to_matrix.import_external_fasta(input_file)[source]

Create a FastaGenome object, set its metadata, and populate it with the data from a fasta file. Must return the data as the single item in an array, because other file formats potentially contain several genomes.

nasp.vcf_to_matrix.import_reference(reference, reference_path, dups_path)[source]

Take an empty reference object and populate it with the data from a reference file, and a dups file if any. Does not return anything, as the passed-in object is modified.

nasp.vcf_to_matrix.main()[source]

Main flow control function for the script. Is sequential over the following steps: 1. Parse arguments 2. Read in reference 3. Read in all query genomes in parallel 4. Write output matrices 5. Write stats files

nasp.vcf_to_matrix.manage_input_thread(reference, min_coverage, min_proportion, input_q, output_q)[source]

Manage one input file worker thread, for reading the data from the file. Input filenames are pulled one at a time from the input queue, and the genome data from the read-in files is placed on the output queue. When an input filename of “None” appears, we know we’re done and put “None” on the output queue so the controlling thread knows we won’t be adding more data.

nasp.vcf_to_matrix.parse_input_files(input_files, num_threads, genomes, min_coverage, min_proportion)[source]

Use a pool of worker threads to, in parallel, read in the input files. Populate the genome collection with the read-in data. This is the “poison pill” thread management algorithm, where threads are each given a “you can stop now” task once the actual queue of tasks is complete.

nasp.vcf_to_matrix.read_vcf_file(reference, min_coverage, min_proportion, input_file)[source]

Submit VCF to be read in to VCF parser, populate genome data and filter data from the parsed VCF data, return a list of the read-in genomes.

nasp.vcf_to_matrix.set_genome_metadata(genome, input_file)[source]

Get all the metadata from the packed input filename string

nasp.vcf_to_matrix.write_output_matrices(genomes, matrix_folder, matrix_format_choices)[source]

Write matrices from genome collection data. Defines the matrix types for future expansion of custom matrix options. This information eventually should come from the user interface and be included in the XML configuration file, rather than hardcoded here. The matrix_format_choices option comes from the XML to here.

nasp.vcf_to_matrix.write_stats_data(genomes, stats_folder)[source]

Write stats data from genome collection to preset filenames

Module contents