Documentation
Encodings
Atomic Number
dna_parser.atomic_encoding(sequences, pad_type= "after", pad_length= -2, n_jobs= 1)
Function Arguments
- sequences (list of str): list of genomic sequences.
- pad_type (str): pad (or trim) "before" or "after" the sequences.
- pad_length (int): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, any positive number for a fixed length.
- n_jobs (int): number of threads used to encode the sequences in parallel. 0 to use all CPUs available.
Output
Numpy array with shape (number of sequences, length of sequences). Each nucleotide is encoded as its atomic number:
- A= 70
- C= 58
- G= 78
- T/U= 66
- Other characters or gaps = 0
import dna_parser as dps
sequences= ["agt","acc"]
encoding= dps.atomic_encoding(sequences)
print(encoding)
print(encoding.shape)
# Output:
#[[70 78 66]
# [70 58 58]]
#
# (2, 3)
Chaos Game
dna_parser.chaos_encoding(sequences, pad_type= "after", pad_length= -2, n_jobs= 1)
Function Arguments:
- sequences (list of str): list of genomic sequences.
- pad_type (str): pad (or trim) "before" or "after" the sequences.
- pad_length (int): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, any positive number for a fixed length.
- n_jobs (int): number of threads used to encode the sequences in parallel. 0 to use all CPUs available.
Output:
Numpy array with shape (number of sequences, length of sequences, 2).
Each sequence is encoded in a square with vertices A: (1,1), C: (-1,-1),
G: (1,-1), T/U: (-1,1). The sequence representation starts at the center of the square in (0,0). The first nucleotide is represented as a point halfway between the starting point and its corresponding vertice.
Each following nucleotide a new point halfaway between the previous point and its corresponding vertice. If a character other than A,C,G,T or U is encountered, the values are not updated and values from the previous point are used.
import dna_parser as dps
sequences= ["agt","acc"]
encoding= dps.chaos_encoding(sequences)
print(encoding)
print(encoding.shape)
# Output:
# x , y
#[[[ 0.5 0.5 ]
# [ 0.75 -0.25 ]
# [-0.125 0.375]]
#
# [[ 0.5 0.5 ]
# [-0.25 -0.25 ]
# [-0.625 -0.625]]]
#(2, 3, 2)
Cross
dna_parser.cross_encoding(sequences, pad_type= "after", pad_length= -2, n_jobs= 1)
Function Arguments
- sequences (list of str): list of genomic sequences.
- pad_type (str): pad (or trim) "before" or "after" the sequences.
- pad_length (int): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, any positive number for a fixed length.
- n_jobs (int): number of threads used to encode the sequences in parallel. 0 to use all CPUs available.
Output:
Numpy array with shape (number of sequences, length of sequences,2). Each nucleotide is encoded as follows:
- A= [0,-1]
- C= [-1,0]
- G= [1,0]
- T/U= [0,1]
- Other characters or gaps = [0,0]
import dna_parser as dps
sequences= ["agt","acc"]
encoding= dps.cross_encoding(sequences)
print(encoding)
print(encoding.shape)
# Output:
#[[[ 1 1]
# [-1 1]
# [ 1 -1]]
#
# [[ 1 1]
# [-1 -1]
# [-1 -1]]]
#
#(2, 3, 2)
DNA Walk
dna_parser.dna_walk(sequences, pad_type= "after", pad_length= -2, n_jobs= 1)
Function Arguments
- sequences (list of str): list of genomic sequences.
- pad_type (str): pad (or trim) "before" or "after" the sequences.
- pad_length (int): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, any positive number for a fixed length.
- n_jobs (int): number of threads used to encode the sequences in parallel. 0 to use all CPUs available.
Output
Numpy array with shape (number of sequences, length of sequences, 2).
Each sequence is represented on a 2D grid, with its representation starting at coordinates (0,0). For each nucleotide in the sequence coordinates are updated to form a path as follows:
- A: x_{n+1}= x_{n}-1; y_{n+1}= y_{n}
- C: x_{n+1}= x_{n}; y_{n+1}= y_{n}-1
- G: x_{n+1}= x_{n}; y_{n+1}= y_{n}+1
- T/U: x_{n+1}= x+1; y_{n+1}= y_{n}
- Other characters or gaps: x_{n+1}= x_{n}; y_{n+1}= y_{n}
import dna_parser as dps
sequences= ["agt","acc"]
encoding= dps.dna_walk(sequences)
print(encoding)
print(encoding.shape)
# Output:
# x , y
#[[[-1 0]
# [-1 1]
# [ 0 1]]
#
# [[-1 0]
# [-1 -1]
# [-1 -2]]]
#(2, 3, 2)
EIIP
dna_parser.eiip_encoding(sequences, pad_type= "after", pad_length= -2, n_jobs= 1)
Function Arguments
- sequences (list of str): list of genomic sequences.
- pad_type (str): pad (or trim) "before" or "after" the sequences.
- pad_length (int): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, any positive number for a fixed length.
- n_jobs (int): number of threads used to encode the sequences in parallel. 0 to use all CPUs available.
Output
Numpy array with shape (number of sequences, length of sequences). Each nucleotide is encoded as its electron-ion interaction pseudopotential (EIIP):
- A= 0.1260
- C= 0.1340
- G= 0.0806
- T/U= 0.1335
- Other characters or gaps = 0.0
import dna_parser as dps
sequences= ["agt","acc"]
encoding= dps.eiip_encoding(sequences)
print(encoding)
print(encoding.shape)
# Output:
#[[0.126 0.0806 0.1335]
# [0.126 0.134 0.134 ]]
#
# (2, 3)
Fickett Score
dna_parser.fickett_score(sequences, n_jobs= 1)
Function Arguments
- sequences (list of str): list of genomic sequences.
- n_jobs (int): number of threads used to encode the sequences in parallel. 0 to use all CPUs available.
Output
Numpy array with shape (number of sequences).
Compute the probability of each sequence to be a coding sequence. See the About section for more details.
import dna_parser as dps
sequences= ["agt","acc"]
encoding= dps.fickett_score(sequences)
print(encoding)
print(encoding.shape)
# Output:
#[0.3203 0.407 ]
#
#(2,)
Onehot (Voss)
dna_parser.onehot_encoding(sequences, pad_type= "after", pad_length= -2, n_jobs= 1)
Function Arguments
- sequences (list of str): list of genomic sequences.
- pad_type (str): pad (or trim) "before" or "after" the sequences.
- pad_length (int): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, any positive number for a fixed length.
- n_jobs (int): number of threads used to encode the sequences in parallel. 0 to use all CPUs available.
Output
Numpy array with shape (number of sequences, length of sequences, 4). Each nucleotide is encoded as follows:
- C= [1,0,0,0]
- G= [0,1,0,0]
- A= [0,0,1,0]
- T/U= [0,0,0,1]
- Other characters or gaps = [0,0,0,0]
import dna_parser as dps
sequences= ["agt","acc"]
encoding= dps.onehot_encoding(sequences)
print(encoding)
print(encoding.shape)
# Output:
#[[[0 0 1 0]
# [0 1 0 0]
# [0 0 0 1]]
# [[0 0 1 0]
# [1 0 0 0]
# [1 0 0 0]]]
# (2, 3, 4)
Real-number (or PAM)
dna_parser.real_encoding(sequences, pad_type= "after", pad_length= -2, n_jobs= 1)
Function Arguments
- sequences (list of str): list of genomic sequences.
- pad_type (str): pad (or trim) "before" or "after" the sequences.
- pad_length (int): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, any positive number for a fixed length.
- n_jobs (int): number of threads used to encode the sequences in parallel. 0 to use all CPUs available.
Output
Numpy array with shape (number of sequences, length of sequences). Each nucleotide is encoded as follows:
- A= -1.5
- G= -0.5
- C= 0.5
- T/U= 1.5
- Other characters or gaps = 0
import dna_parser as dps
sequences= ["agt","acc"]
encoding= dps.real_encoding(sequences)
print(encoding)
print(encoding.shape)
# Output:
#[-1.5 -0.5 1.5]
#[-1.5 0.5 0.5]]
#
# (2, 3)
TF-IDF
#Class
dna_parser.Tfidf(corpus , kmer, vocabulary= None)
Parameters:
- corpus (list of str): list of genomic sequences.
- kmer (int): length to use to generate kmers in the sequences.
- vocabulary (dict(str:int)): Dictionary mapping each kmers to consider for encoding to a unique integer value.
- n_jobs (int): number of threads used to encode the sequences in parallel. 0 to use all CPUs available.
Methods:
add_to_corpus
Tfidf.add_to_corpus(new_corpus)
- new_corpus: (list of str): list of genomic sequences. Adds sequences to the existing corpus.
fit
Tfidf.fit()
Fits the Tfidf instance. Compiles the vocabulary if it is not provided. Computes the Inverse Document Frequency.
fit_transform
Tfidf.fit_transform(sequences= None, normalization= "L2")
- sequences (list of str or None): list of genomic sequences to transform.
- normalization (str): "L2" for L2 normalization. Anything else results in no normalization.
Fits the Tfidf instance. Compiles the vocabulary if it is not provided. Computes the Inverse Document Frequency and transforms the sequences in their TF-IDF representation. If "sequences= None", transforms the corpus.
set_threads
Tfidf.set_threads(n_jobs)
n_jobs (int): number of threads used to encode the sequences in parallel. 0 to use all CPUs available.
Adjusts the number of threads used to encode the sequences in parallel.
set_vocabulary
Tfidf.set_vocabulary(vocabulary):
- vocabulary: (dict(str:int)) Dictionary mapping each kmers to consider for encoding to a unique integer value.
The integer associated with each kmer needs to be unique, continuous, and start at 0.
#this vocabulary is correct:
{"gtc":0, "atg":1, "acg":2}
#these are not correct:
{"gtc":1, "atg":2, "acg":3}
{"gtc":0, "atg":2, "acg":3}
transform
Tfidf.transform(sequences= None, normalization= "L2")
- sequences: (list of str or None): list of genomic sequences to transform.
- normalization: (str): "L2" for L2 normalization. Anything else results in no normalization.
Transforms the sequences in their TF-IDF representation. If "sequences= None", transforms the corpus. The Tfidf instance needs to be fitted with the fit() or fit_transform() function before calling transform().
Attributes
Tfidf.vocabulary # None or dict(str:int)
Tfidf.corpus # List(str)
Tfidf.kmer_size # Int
Tfidf.idf # None or numpy array
Tfidf.is_idf_uptodate # Bool
Tfidf.n_jobs # Int
Examples
import dna_parser as dps
sequences= ["agtcgc","accgtc"]
tfidf= dps.Tfidf(sequences,2)
tfidf.fit()
encoding= tfidf.transform()
print(encoding)
print(encoding.shape)
# Output:
#<Compressed Sparse Row sparse matrix of dtype 'float64'
# with 2 stored elements and shape (2, 5)>
# Coords Values
# (0, 1) -0.1351550360360548
# (1, 1) -0.1351550360360548
#
# (2, 5)
import dna_parser as dps
sequences= ["attcggagt","attctggga"]
tfidf= dps.Tfidf(sequences,3)
encodings= tfidf.fit_transform()
tfidf.add_to_corpus(["agccgcgga"])
encodings2= tfidf.fit_transform(normalization= None)
print(encodings)
print(encodings2)
# Output:
#<Compressed Sparse Row sparse matrix of dtype 'float64'
# with 6 stored elements and shape (2, 5)>
# Coords Values
# (0, 0) 0.0
# (0, 1) 0.7071067811865476
# (0, 2) 0.7071067811865476
# (1, 0) 0.0
# (1, 3) 0.7071067811865476
# (1, 4) 0.7071067811865476
#<Compressed Sparse Row sparse matrix of dtype 'float64'
# with 7 stored elements and shape (3, 5)>
# Coords Values
# (0, 0) 0.4054651081081644
# (0, 1) 1.0986122886681098
# (0, 2) 1.0986122886681098
# (1, 0) 0.4054651081081644
# (1, 3) 1.0986122886681098
# (1, 4) 0.4054651081081644
# (2, 4) 0.4054651081081644
Z-Curve
dna_parser.zcurve_encoding(sequences, pad_type= "after", pad_length= -2, n_jobs= 1)
Function Arguments
- sequences (list of str): list of genomic sequences.
- pad_type (str): pad (or trim) "before" or "after" the sequences.
- pad_length (int): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, any positive number for a fixed length.
- n_jobs (int): number of threads used to encode the sequences in parallel. 0 to use all CPUs available.
Output
Numpy array with shape (number of sequences, length of sequences, 3).
The sequences are encoded within a cube. At each nucleotide position the Z-curve encoding gives the disparity between purines (r) and pyrimidines (y), the disparity between nucleotides with an amino (m) and a keto (k) group, and the disparity between nucleotide with weak (w) and strong (s) bonds.
import dna_parser as dps
sequences= ["agtc","acc"]
encoding= dps.zcurve_encoding(sequences)
print(encoding)
print(encoding.shape)
# Output:
# r-y m-k w-s
#[[[ 1 1 1]
# [ 2 0 0]
# [ 1 -1 1]
# [ 0 0 0]]
# [[ 1 1 1]
# [ 0 2 0]
# [-1 3 -1]
# [-1 3 -1]]]
#
#(2, 3, 3)
Importing Sequences
Importing Fasta Files
dna_parser.load_fasta(path)
Function Arguments
- path (str or list of str): a path or list of paths of files to import.
Output
A list of tuples containing the metadata and sequences of each entry in the fasta file (metadata, sequence).
import dna_parser as dps
sequences= dps.load_fasta("path/to/fasta/file")
print(sequences)
# Output:
#[('>sequence1', 'acgtatgcgtcgtc'), ('>sequence2', 'cccgtga---gtcgat'), ('>sequence3', 'xgtcgycaaatcg-?')]
Importing Sequences Only
dna_parser.load_sequences(path)
Function Arguments
- path (str or list of str): a path or list of paths of files to import.
Output
A list of str containing the sequences imported from fasta files.
import dna_parser as dps
sequences= dps.load_sequences("tests/seq_test.fasta")
print(sequences)
# Output:
# ['acgtatgcgtcgtc', 'cccgtga---gtcgat', 'xgtcgycaaatcg-?']
Importing Metadata Only
dna_parser.load_metadata(path)
Function Arguments
- path (str or list of str): a path or list of paths of files to import.
Output
A list of str containing the metadata imported from fasta files.
import dna_parser as dps
metadata= dps.load_metadata("tests/seq_test.fasta")
print(metadata)
# Output:
# ['>sequence1', '>sequence2', '>sequence3']
Other Functions
Kmers
dna_parser.make_kmers(seq, k)
Function Arguments
- seq (str): a genomic sequence.
- k (int): a number representing the length of kmers.
Output
A new sequence with white spaces inserted to form kmers of length k.
import dna_parser as dps
kmer_seq= dps.make_kmers("agtcgtgcgtggaagagt", 3)
print(kmer_seq)
# Output:
# 'agt cgt gcg tgg aag agt '
Generating Random Sequences
dna_parser.random_seq(length, nb_of_seq, seq_type= "dna", n_jobs= 1)
Function Arguments
- length (int): length of sequences to generate.
- nb_of_seq (int): number of sequences to generate.
- seq_type (str): type of sequence to generate. either "dna", "rna", or "aa" for amino acid
- n_jobs (int): number of threads used to generate sequences in parallel.
Output
A list of str representing the random sequences generated from a uniform probability distribution.
import dna_parser as dps
sequences= dps.random_seq(15,3)
print(sequences)
# Output:
# ['tagtccaaccacttg', 'gcagtactaaactca', 'caaggccatgaggta']