data

Data preprocessing and transform tools for CNN Virus data.

CNN Virus project data structure

There are many different types of files and datasets for this project. All data are located in directory data, under the project root. The following is an overview of the main types of data and where they sit in the directory tree.

Data directory

Let’s create an instance of ProjectFileSystem to access the data directory and its content.

pfs = ProjectFileSystem()
pfs.info()
Running linux on local computer
Device's home directory: /home/vtec
Project file structure:
 - Root ........ /home/vtec/projects/bio/metagentorch 
 - Data Dir .... /home/vtec/projects/bio/metagentorch/nbs-dev/data_dev 
 - Notebooks ... /home/vtec/projects/bio/metagentorch/nbs

Note that we have setup this notebook to use the development data directory metagentorch/nbs-dev/data_dev and not the standard metagentorch/data.

In each directory, a readme.md file or another *.md file can be added to provide a description of the directory content.

These readme.md files can be conveniently accessed using the .readme(path) method available with the class ProjectFileSystem (from core module).

pfs.readme()

ReadMe file for directory nbs-dev/data_dev:


Data directory for this package development

This directory includes all data required to validate and test this package code.

data_dev
 |--- CNN_Virus_data
 |     |--- 50mer_ds_100_seq
 |     |--- 150mer_ds_100_seq
 |     |--- train_short
 |     |--- val_short
 |     |--- weight_of_classes
 |--- ncbi
 |     |--- infer_results
 |     |     |--- cnn_virus
 |     |     |--- csv
 |     |     |--- xlsx
 |     |     |--- testdb.db
 |     |--- refsequences
 |     |     |--- cov
 |     |     |     |--cov_virus_sequence_one_metadata.json
 |     |     |     |--sequences_two_no_matching_rule.fa
 |     |     |     |--another_sequence.fa
 |     |     |     |--cov_virus_sequences_two.fa
 |     |     |     |--cov_virus_sequences_two_metadata.json
 |     |     |     |--cov_virus_sequence_one.fa
 |     |     |     |--single_1seq_150bp
 |     |     |     |    |--single_1seq_150bp.fq
 |     |     |     |    |--single_1seq_150bp.aln
 |     |     |     |--paired_1seq_150bp
 |     |     |     |    |--paired_1seq_150bp2.aln
 |     |     |     |    |--paired_1seq_150bp2.fq
 |     |     |     |    |--paired_1seq_150bp1.fq 
 |     |     |     |    |--paired_1seq_150bp1.aln 
 |     |--- simreads
 |     |     |--- cov
 |     |     |     |--- paired_1seq_50bp
 |     |     |     |      |--- paired_1seq_50bp_1.aln
 |     |     |     |      |--- paired_1seq_50bp_1.fq
 |     |     |     |--- single_1seq_50bp
 |     |     |     |      |--- single_1seq_50bp_1.aln
 |     |     |     |      |--- single_1seq_50bp_1.fq
 |     |     |--- cov
 |     |     |     |--single_1seq_50bp
 |     |     |     |    |--single_1seq_50bp.aln
 |     |     |     |    |--single_1seq_50bp.fq
 |     |     |     |--single_1seq_150bp
 |     |     |     |    |--single_1seq_150bp.fq
 |     |     |     |    |--single_1seq_150bp.aln
 |     |     |     |--paired_1seq_150bp
 |     |     |     |    |--paired_1seq_150bp2.aln
 |     |     |     |    |--paired_1seq_150bp2.fq
 |     |     |     |    |--paired_1seq_150bp1.fq
 |     |     |     |    |--paired_1seq_150bp1.aln
 |--- saved           
 |--- readme.md               

Original datasets

The CNN Virus team provides training and validation/test datasets for their model. These datasets and the pretrained model parameters are are available on their Google drive shared directory here. A set of shortened datasets are also available in the data-dev development data directory for testing the code.

pfs.readme(pfs.data/'CNN_Virus_data')

ReadMe file for directory nbs-dev/data_dev/CNN_Virus_data:


CNN Virus data (development directory version)

This directory includes a set of short data files used to test train and inference with the CNN Virus.

File list and description:

50-mer

50-mer reads and their labels, in text format with one line per sample. Each line consists of three components, separated by tabs: the 50-mer read or sequence, the virus species label and the position label:

'TTACNAGCTCCAGTCTAAGATTGTAACTGGCCTTTTTAAAGATTGCTCTA    94    5\n'

Files:

  • 50mer_ds_100_seq: small dataset with 100 reads
  • 5train_short: small 1000-read subset from the original training dataset for experiments
  • val_short: small 500-read subset from the original validation dataset for experiments
150-mer

150-mer reads and their labels in text format in a similar format as above:

'TTCTTTCACCACCACAACCAGTCGGCCGTGGAGAGGCGTCGCCGCGTCTCGTTCGTCGAGGCCGATCGACTGCCGCATGAGAGCGGGTGGTATTCTTCCGAAGACGACGGAGACCGGGACGGTGATGAGGAAACTGGAGAGAGCCACAAC    6    0\n'

Files:

  • 150mer_ds_100_reads: small subset of 100 reads from original ICTV_150mer_benchmarking file
Other files:
  • virus_name_mapping: mapping between virus species and their numerical label
  • weight_of_classes: weights for each virus species class in the training dataset


source

OriginalLabels

 OriginalLabels (p2mapping:pathlib.Path|None=None)

Converts between labels and species name as per original training dataset

Type Default Details
p2mapping pathlib.Path | None None Path to the mapping file. Uses virus_name_mapping by default

Original data include 187 viruses, with label from 0 to 186.

With the class method .label2species(n) and .species2label(species) we can convert between the label and the species name.

species = OriginalLabels()
for n in [0, 94, 117, 118]:
    print(f"{n:3d} -> {species.label2species(n)}")
  0 -> Variola_virus
 94 -> Middle_East_respiratory_syndrome-related_coronavirus
117 -> Severe_acute_respiratory_syndrome-related_coronavirus
118 -> Yellow_fever_virus
for s in ['Variola_virus', 'Yellow_fever_virus']:
    print(f"{s:20s} -> {species.species2label(s)}")
Variola_virus        -> 0
Yellow_fever_virus   -> 118

When looking for a numerical specie label it is often more convenient to use a partial name, and the method .search(species) because we do not need to know the full specie name.


source

OriginalLabels.search

 OriginalLabels.search (s:str)

Prints all species whose name contains the passed string, with their numerical label

Type Details
s str string to search through all original virus species
species.search('fever')
Sandfly_fever_Naples_phlebovirus. Label: 35
Crimean-Congo_hemorrhagic_fever_orthonairovirus. Label: 76
Yellow_fever_virus. Label: 118
Rift_Valley_fever_phlebovirus. Label: 156

NCBI reference sequences

We simulate reads using many reference sequences from the NCBI GenBank database. We group all reference sequences as well as all reads simulated from these reference sequences under the ncbi directory.

pfs = ProjectFileSystem()
pfs.readme(pfs.data / 'ncbi')

ReadMe file for directory nbs-dev/data_dev/ncbi:


NCBI Data

This directory includes all data related to the work done with reference sequences from NCBI.

The data is organized in the following subfolders:

  • refsequences: reference CoV sequences downloaded from NCBI, and related metadata
  • simreads: all data from simulated reads, using ART Illumina simulator and the reference sequences
  • infer_results: results from the inference using models with the simulated reads
  • ds: datasets in proper format for training or inference/prediction using the CNN Virus model

pfs.readme(pfs.data / 'ncbi/refsequences')

ReadMe file for directory nbs-dev/data_dev/ncbi/refsequences:

No markdown file in this folder
pfs.readme(pfs.data / 'ncbi/simreads')

ReadMe file for directory nbs-dev/data_dev/ncbi/simreads:


NCBI simulated reads

This directory includes all sets of simulated read sequence files generated from NCBI viral sequences using ARC Illumina.

this-directory
    |--cov
    |    |
    |    |--single_10seq_50bp
    |    |    |--single_10seq_50bp.fq
    |    |    |--single_10seq_50bp.alnEnd
    |    |-- ...
    |    |--single_100seq_150bp
    |    |    |--single_100seq_150bp.fq
    |    |    |--single_100seq_150bp.aln
    |    |--paired_100seq_50bp
    |    |    |--paired_100seq_50bp2.aln
    |    |    |--paired_100seq_50bp1.aln
    |    |    |--paired_100seq_50bp2.fq
    |    |    |--paired_100seq_50bp1.fq
    |    |-- ...
    |    |
    |---yf
    |    |
    |    |--yf_AY968064-single-150bp
    |    |    |--yf_AY968064-single-1seq-150bp.fq
    |    |    |--yf_AY968064-single-1seq-150bp.aln
    |    |
    |--mRhiFer1
    |    |--mRhiFer1_v1.p.dna_rm.primary_assembly.1
    |    |    |--mRhiFer1_v1.p.dna_rm.primary_assembly.1.fq
    |    |    |--mRhiFer1_v1.p.dna_rm.primary_assembly.1.aln
    |    |

This directory includes several subdirectories, each for one virus, e.g. cov for corona, yf for yellow fever.

In each virus subdirectory, several simreads directory includes simulated reads with various parameters, named as <method>_<nb-seq>_<nb-bp> where” - <method> is either single or paired depending on the simulation method - <nb-seq> is the number of reference sequences used for simulation, and refers to the fa file used - <nb-bp> is the number of base pairs used to simulate reads

Each sub-directory includes simreads files made using a simulation method and a specific number of reference sequences. - xxx.fq and xxx.aln files when method is single - xxx1.fq, xxx2.fq, xxx1.aln and xxx2.aln files when method is paired.

Example: - paired_10seq_50bp means that the simreads were generated by using the paired method to simulate 50-bp reads, and using the fa file cov_virus_sequences_010-seqs.fa. - single_100seq_50bp means that the simreads were generated by using the single method to simulate 50-bp reads, and using the fa file cov_virus_sequences_100-seqs.fa. Note that this generated 20,660,104 reads !

Simread file formats

Simulated reads information is split between two files: - FASTQ (.fq) files providing the read sequences and their ASCII quality scores - ALN (.aln) files with alignment information

FASTQ (.fq)

FASTQ files generated by ART Illumina have the following structure (showing 5 reads), with 4 lines for each read:

@2591237:ncbi:1-60400
ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG
+
CCCBCGFGBGGGGGGGBGGGGGGGGG>GGG1G=/GGGGGGGGGGGGGGGG
@2591237:ncbi:1-60399
GATCAATGTGGCATCTACAATACAGACAGCATGAAGCACCACCAAAGGAC
+
BCBCCFGGGGGGGG1CGGGG<GGBGGGGGFGCGGGGGGDGGG/GG1GGGG
@2591237:ncbi:1-60398
ATCTACCAGTGGTAGATGGGTTCTTAATAATGAACATTATAGAGCTCTAC
+
CCCCCGGGEGG1GGF1G/GGEGGGGGGGGGGGGFFGGGGGGGGGGDGGDG
@2591237:ncbi:1-60397
CGTAAAGTAGAGGCTGTATGGTAGCTAGCACAAATGCCAGCACCAATAGG
+
BCCCCGGGFGGGGGGFGGGGFGG1GGGGGGG>GG1GGGGGGGGGGE<GGG
@2591237:ncbi:1-60396
GGTATCGGGTATCTCCTGCATCAATGCAAGGTCTTACAAAGATAAATACT
+
CBCCCGGG@CGGGGGGGGGGGG=GFGGGGDGGGFG1GGGGGGGG@GGGGG

The following information can be parsed from the each read sequence in the FASTQ file:

  • Line 1: readid, a unique ID for the read, under for format @readid
  • Line 2: readseq, the sequence of the read
  • Line 3: a separator +
  • Line 4: read_qscores, the base quality scores encoded in ASCII

Example:

@2591237:ncbi:1-60400
ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG
+
CCCBCGFGBGGGGGGGBGGGGGGGGG>GGG1G=/GGGGGGGGGGGGGGGG
  • readid = 2591237:ncbi:1-60400
  • readseq = ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG, a 50 bp read
  • read_qscores = CCCBCGFGBGGGGGGGBGGGGGGGGG>GGG1G=/GGGGGGGGGGGGGGGG

ALN (.aln)

ALN files generated by ART Illumina consist of : - a header with the ART-Ilumina command used for the simulation (@CM) and info on each of the reference sequences used for the simulations (@SQ). Header always starts with ##ART_Illumina and ends with ##Header End : - the body with 3 lines for each read: 1. definition line with readid, - reference sequence identification number refseqid, - the position in the read in the reference sequence aln_start_pos - the strand the read was taken from ref_seq_strand. + for coding strand and - for template strand 2. aligned reference sequence, that is the sequence segment in the original reference corresponding to the read 3. aligned read sequence, that is the simmulated read sequence, where each bp corresponds to the reference sequence bp in the same position.

Example of a ALN file generated by ART Illumina (showing 5 reads):

##ART_Illumina    read_length    50
@CM    /bin/art_illumina -i /home/vtec/projects/bio/metagentools/data/cov_data/cov_virus_sequences_ten.fa -ss HS25 -l 50 -f 100 -o /home/vtec/projects/bio/metagentools/data/cov_simreads/single_10seq_50bp/single_10seq_50bp -rs 1674660835
@SQ    2591237:ncbi:1 1   MK211378    2591237    ncbi    1     Coronavirus BtRs-BetaCoV/YN2018D    30213
@SQ    11128:ncbi:2   2   LC494191    11128    ncbi    2     Bovine coronavirus    30942
@SQ    31631:ncbi:3   3   KY967361    31631    ncbi    3     Human coronavirus OC43        30661
@SQ    277944:ncbi:4  4   LC654455    277944    ncbi    4     Human coronavirus NL63    27516
@SQ    11120:ncbi:5   5   MN987231    11120    ncbi    5     Infectious bronchitis virus    27617
@SQ    28295:ncbi:6   6   KU893866    28295    ncbi    6     Porcine epidemic diarrhea virus    28043
@SQ    28295:ncbi:7   7   KJ645638    28295    ncbi    7     Porcine epidemic diarrhea virus    27998
@SQ    28295:ncbi:8   8   KJ645678    28295    ncbi    8     Porcine epidemic diarrhea virus    27998
@SQ    28295:ncbi:9   9   KR873434    28295    ncbi    9     Porcine epidemic diarrhea virus    28038
@SQ    1699095:ncbi:10 10  KT368904    1699095    ncbi    10     Camel alphacoronavirus    27395
##Header End
>2591237:ncbi:1    2591237:ncbi:1-60400    14770    +
ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG
ACAACTCCTATTCGTAGTTGAAGTTGTTGACAAATACTTTGATTGTTACG
>2591237:ncbi:1    2591237:ncbi:1-60399    17012    -
GATCAATGTGGCATCTACAATACAGACAGCATGAAGCACCACCAAAGGAC
GATCAATGTGGCATCTACAATACAGACAGCATGAAGCACCACCAAAGGAC
>2591237:ncbi:1    2591237:ncbi:1-60398    9188    +
ATCTACCAGTGGTAGATGGGTTCTTAATAATGAACATTATAGAGCTCTAC
ATCTACCAGTGGTAGATGGGTTCTTAATAATGAACATTATAGAGCTCTAC
.....

Parsing sequence files

The following classes make it easier to read and parse files of different formats into their underlying components to generated the training, validation, testing and inference datasets for the model.

Each class inherits from TextFileBaseReader and adds:

  • One or several text parsing method(s) to parse metadata according to a specific format
  • A file parsing method to extract metadata from all elements in the file, returning it as a key:value dictionary and optionally save the metadata as a json file.

FASTA file

Extension of TextFileBaseReader class for fasta sequence files.

Structure of a FASTA sequence file:

>definition line - format varies from dataset to dataset
sequence line: sequence of bases

Example for the NCBI datasets:

>seqid accession taxonomyid source seqnb organism
TATTAGGTTTTCTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAAT ...
>2591237:ncbi:1 MK211378    2591237 ncbi    1 Coronavirus BtRs-BetaCoV/YN2018D
TATTAGGTTTTCTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAAT ...
p2fasta = pfs.data / 'ncbi/refsequences/cov/cov_virus_sequences_two.fa'

fasta = TextFileBaseReader(p2fasta, nlines=1)
for i, t in enumerate(fasta):
    txt = t.replace('\n', '')[:80] + ' ...'
    print(f"{txt}")
>2591237:ncbi:1 1   MK211378    2591237 ncbi    Coronavirus BtRs-BetaCoV/YN2018D ...
TATTAGGTTTTCTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAAT ...
>11128:ncbi:2   2   LC494191    11128   ncbi    Bovine coronavirus ...
CATCCCGCTTCACTGATCTCTTGTTAGATCTTTTCATAATCTAAACTTTATAAAAACATCCACTCCCTGTAGTCTATGCC ...

source

FastaFileReader

 FastaFileReader (path:str|pathlib.Path)

Wrap a FASTA file and retrieve its content in raw format and parsed format

Type Details
path str | pathlib.Path path to the Fasta file

As an iterator, FastaFileReader returns a dict at each step, as follows:

{
    'definition line': 'string in file as the definition line for the sequence',
    'sequence': 'the full sequence'
}

Illustration:

p2fasta = pfs.data / 'ncbi/refsequences/cov/cov_virus_sequences_two.fa'
fasta = FastaFileReader(p2fasta)
iteration_output = next(fasta)

print(iteration_output['definition line'][:80], '...')
print(iteration_output['sequence'][:80], '...')
>2591237:ncbi:1 1   MK211378    2591237 ncbi    Coronavirus BtRs-BetaCoV/YN2018D ...
TATTAGGTTTTCTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAAT ...
print(f"output type :     {type(iteration_output)}")
print(f"keys :            {iteration_output.keys()}")
print(f"definition line : {iteration_output['definition line'][:80]} ...'")
print(f"sequence :       '{iteration_output['sequence'][:100]} ...'")
output type :     <class 'dict'>
keys :            dict_keys(['definition line', 'sequence'])
definition line : >2591237:ncbi:1   1   MK211378    2591237 ncbi    Coronavirus BtRs-BetaCoV/YN2018D ...'
sequence :       'TATTAGGTTTTCTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTAGCTGTCGCTCGGC ...'

The definition line is a string, with tab separated values.

display(iteration_output['definition line'])
'>2591237:ncbi:1\t1\tMK211378\t2591237\tncbi\tCoronavirus BtRs-BetaCoV/YN2018D'

source

FastaFileReader.review

 FastaFileReader.review ()

Prints the first and last sequences and metadata in the fasta file and returns the nb or sequences

nb_seqs = fasta.review()
nb_seqs
There are 2 sequences in this file

First Sequence:
>2591237:ncbi:1 1   MK211378    2591237 ncbi    Coronavirus BtRs-BetaCoV/YN2018D
TATTAGGTTTTCTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAAT ...
{'seqid': '2591237:ncbi:1', 'taxonomyid': '2591237', 'source': 'ncbi', 'seqnb': '1', 'accession': 'MK211378', 'organism': 'Coronavirus BtRs-BetaCoV/YN2018D'}

Last Sequence:
>11128:ncbi:2   2   LC494191    11128   ncbi    Bovine coronavirus
CATCCCGCTTCACTGATCTCTTGTTAGATCTTTTCATAATCTAAACTTTATAAAAACATCCACTCCCTGTAGTCTATGCC ...
{'seqid': '11128:ncbi:2', 'taxonomyid': '11128', 'source': 'ncbi', 'seqnb': '2', 'accession': 'LC494191', 'organism': 'Bovine coronavirus'}
2

source

FastaFileReader.print_first_chunks

 FastaFileReader.print_first_chunks (nchunks:int=3)

Print the first nchunks chunks of text from the file

Type Default Details
nchunks int 3 number of chunks to print out

This is convenient to quickly discover and explore new fasta files in raw text format:

fasta = FastaFileReader(p2fasta)
fasta.print_first_chunks(nchunks=2)

Sequence 1:
>2591237:ncbi:1 1   MK211378    2591237 ncbi    Coronavirus BtRs-BetaCoV/YN2018D
TATTAGGTTTTCTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAAT ...

Sequence 2:
>11128:ncbi:2   2   LC494191    11128   ncbi    Bovine coronavirus
CATCCCGCTTCACTGATCTCTTGTTAGATCTTTTCATAATCTAAACTTTATAAAAACATCCACTCCCTGTAGTCTATGCC ...

Parsing metadata

The class also provides methods to parse metadata from the file content (definition line, headers, …).

A regex pattern is used for parsing metadata fom the definition lines in the reference sequence fasta file.

Below, we parse the data from the definition line of our Corona virus NCBI dataset (rule fasta_ncbi_std):

Sequence 1:

  • Definition Line:
>2591237:ncbi:1 [MK211378]  2591237 ncbi    1 [MK211378] 2591237    Coronavirus YN2018D     scientific name
  • Metadata:
    • seqid = 2591237:ncbi:1
    • taxonomyid = 2591237
    • source = ncbi
    • seqnb = 1
    • accession = MK211378
    • species = Coronavirus BtRs-BetaCoV/YN2018D

Sequence 2:

  • Definition Line
    >11128:ncbi:2 [LC494191]
  • Metadata:
    • seqid = 11128:ncbi:2
    • taxonomyid = 11128
    • source = ncbi
    • seqnb = 2
    • accession = LC494191
    • species = ''

FastaFileReader offers: - parse_text a method to parse the metadata - an option to set a default “parsing rule” for one instance with set_parsing_rules. - parse_file a method to parse the metadata from all sequences in the file and save it as a json file.


source

TextFileBaseReader.parse_text

 TextFileBaseReader.parse_text (txt:str, pattern:str|None=None)

Parse text using regex pattern with groups. Return a metadata dictionary.

Type Default Details
txt str text to parse
pattern str | None None If None, uses standard regex pattern to extract metadata, otherwise, uses passed regex
Returns dict parsed metadata in key/value format

Running the parser function with specifically defined pattern and keys.

fasta = FastaFileReader(p2fasta)
dfn_line, sequence = next(fasta).values()
print(dfn_line.replace('\n', ''))
>2591237:ncbi:1 1   MK211378    2591237 ncbi    Coronavirus BtRs-BetaCoV/YN2018D
pattern = r"^>(?P<seqid>(?P<taxonomyid>\d+):(?P<source>ncbi):(?P<seqnb>\d*))[\s\t]*(?P=seqnb)[\s\t](?P<accession>[\w\d]*)([\s\t]*(?P=taxonomyid)[\s\t]*(?P=source)[\s\t][\s\t]*(?P<organism>[\w\s\-\_\/]*))?"
fasta.parse_text(dfn_line, pattern=pattern)
{'seqid': '2591237:ncbi:1',
 'taxonomyid': '2591237',
 'source': 'ncbi',
 'seqnb': '1',
 'accession': 'MK211378',
 'organism': 'Coronavirus BtRs-BetaCoV/YN2018D'}

When a FastaFileReader instance is created, all existing rules in the file default_parsing_rules.json are tested on the first definition line of the fasta file and the one rule that parses the most matches will be selected automatically and saved in instance attributes re_rule_name, re_pattern and re_keys.

parse_file extract metadata from each definition line in the fasta file and return a dictionary with all metadata.

print(fasta.re_rule_name)
print(fasta.re_pattern)
print(fasta.re_keys)
fasta_ncbi_std
^>(?P<seqid>(?P<taxonomyid>\d+):(?P<source>ncbi):(?P<seqnb>\d*))[\s\t]*(?P=seqnb)[\s\t](?P<accession>[\w\d]*)([\s\t]*(?P=taxonomyid)[\s\t]*(?P=source)[\s\t][\s\t]*(?P<organism>[\w\s\-\_/]*))?
['seqid', 'taxonomyid', 'source', 'seqnb', 'accession', 'organism']
fasta.parse_text(dfn_line)
{'seqid': '2591237:ncbi:1',
 'taxonomyid': '2591237',
 'source': 'ncbi',
 'seqnb': '1',
 'accession': 'MK211378',
 'organism': 'Coronavirus BtRs-BetaCoV/YN2018D'}

When another fasta file, which has another definition line structure, is used, another parsing rule is selected.

p2other = pfs.data / 'ncbi/refsequences/cov/another_sequence.fa'
assert p2other.is_file()

it2 = FastaFileReader(path=p2other)

dfn_line, sequence = next(it2).values()
print(dfn_line.replace('\n', ''))
>1 dna_rm:primary_assembly primary_assembly:mRhiFer1_v1.p:1:1:124933378:1 REF
print(it2.re_rule_name)
print(it2.re_pattern)
print(it2.re_keys)
fasta_rhinolophus_ferrumequinum
^>\d[\s\t](?P<seq_type>dna_rm):(?P<id_type>[\w\_]*)[\s\w](?P=id_type):(?P<assy>[\w\d\_]*)\.(?P<seq_level>[\w]*):\d*:\d*:(?P<taxonomy>\d*):(?P<id>\d*)[\s    ]REF$
['seq_type', 'id_type', 'assy', 'seq_level', 'taxonomy', 'id']
pprint(it2.parse_text(dfn_line))
{'assy': 'mRhiFer1_v1',
 'id': '1',
 'id_type': 'primary_assembly',
 'seq_level': 'p',
 'seq_type': 'dna_rm',
 'taxonomy': '124933378'}

This rule selection is performed by the class method set_parsing_rule. The method can also be called with specific pattern and keys to force parsing rule not yet saved in the json file.


source

TextFileBaseReader.set_parsing_rules

 TextFileBaseReader.set_parsing_rules (pattern:str|None=None,
                                       verbose:bool=False)

*Set the standard regex parsing rule for the file.

Rules can be set:

  1. manually by passing specific custom values for pattern and keys
  2. automatically, by testing all parsing rules saved in parsing_rule.json

Automatic selection of parsing rules works by testing each rule saved in parsing_rule.json on the first definition line of the fasta file, and selecting the one rule that generates the most metadata matches.

Rules consists of two parameters:

  • The regex pattern including one group for each metadata item, e.g (?P<group_name>regex_code)
  • The list of keys, i.e. the list with the name of each regex groups, used as key in the metadata dictionary

This method updates the three following class attributes: re_rule_name, re_pattern, re_keys*

Type Default Details
pattern str | None None regex pattern to apply to parse the text, search in parsing rules json if None
verbose bool False when True, provides information on each rule
Returns None
fasta = FastaFileReader(p2fasta)
dfn_line, sequence = next(fasta).values()
print(f"definition line: '{dfn_line[:-1]}'")
definition line: '>2591237:ncbi:1   1   MK211378    2591237 ncbi    Coronavirus BtRs-BetaCoV/YN2018'

Automatic parsing works by testing each saved rule for the value of definition line in the first sequence in the fasta file.

print(f"key for text to parse: {fasta.text_to_parse_key}\n")
fasta.reset_iterator()
print('Text to parse for testing (extracted from first iteration):')
print(next(fasta)[fasta.text_to_parse_key])
print()
fasta.set_parsing_rules(verbose=True)
key for text to parse: definition line

Text to parse for testing (extracted from first iteration):
>2591237:ncbi:1 1   MK211378    2591237 ncbi    Coronavirus BtRs-BetaCoV/YN2018D

--------------------------------------------------------------------------------
6 matches generated with rule <fasta_ncbi_std> 
--------------------------------------------------------------------------------
^>(?P<seqid>(?P<taxonomyid>\d+):(?P<source>ncbi):(?P<seqnb>\d*))[\s\t]*(?P=seqnb)[\s\t](?P<accession>[\w\d]*)([\s\t]*(?P=taxonomyid)[\s\t]*(?P=source)[\s\t][\s\t]*(?P<organism>[\w\s\-\_/]*))?
['seqid', 'taxonomyid', 'source', 'seqnb', 'accession', 'organism']
{'seqid': '2591237:ncbi:1', 'taxonomyid': '2591237', 'source': 'ncbi', 'seqnb': '1', 'accession': 'MK211378', 'organism': 'Coronavirus BtRs-BetaCoV/YN2018D'}
--------------------------------------------------------------------------------
0 matches generated with rule <fastq_art_illumina_ncbi_std> 
--------------------------------------------------------------------------------
^@(?P<readid>(?P<reftaxonomyid>\d*):(?P<refsource>\w*):(?P<refseqnb>\d*)-(?P<readnb>\d*(\/\d)?))$
['readid', 'reftaxonomyid', 'refsource', 'refseqnb', 'readnb']
{'readid': '', 'reftaxonomyid': '', 'refsource': '', 'refseqnb': '', 'readnb': ''}
--------------------------------------------------------------------------------
0 matches generated with rule <aln_art_illumina_ncbi_std> 
--------------------------------------------------------------------------------
^>(?P<refseqid>(?P<reftaxonomyid>\d*):(?P<refsource>\w*):(?P<refseqnb>\d*))(\s| )*(?P<readid>(?P=reftaxonomyid):(?P=refsource):(?P=refseqnb)-(?P<readnb>\d*(\/\d(-\d)?)?))(\s|\t)(?P<aln_start_pos>\d*)(\s|\t)(?P<refseq_strand>(-|\+))$
['refseqid', 'reftaxonomyid', 'refsource', 'refseqnb', 'readid', 'readnb', 'aln_start_pos', 'refseq_strand']
{'refseqid': '', 'reftaxonomyid': '', 'refsource': '', 'refseqnb': '', 'readid': '', 'readnb': '', 'aln_start_pos': '', 'refseq_strand': ''}
--------------------------------------------------------------------------------
0 matches generated with rule <aln_art_illumina-refseq-ncbi-std> 
--------------------------------------------------------------------------------
^@SQ[\t\s]*(?P<refseqid>(?P<reftaxonomyid>\d*):(?P<refsource>\w*):(?P<refseqnb>\d*))[\t\s]*(?P=refseqnb)[\t\s]*(?P<refseq_accession>[\w\d]*)[\t\s]*(?P=reftaxonomyid)[\t\s]*(?P=refsource)[\t\s](?P<organism>.*)[\t\s](?P<refseq_length>\d*)$
['refseqid', 'reftaxonomyid', 'refsource', 'refseqnb', 'refseq_accession', 'organism', 'refseq_length']
{'refseqid': '', 'reftaxonomyid': '', 'refsource': '', 'refseqnb': '', 'refseq_accession': '', 'organism': '', 'refseq_length': ''}
--------------------------------------------------------------------------------
0 matches generated with rule <fasta_ncbi_cov> 
--------------------------------------------------------------------------------
^>(?P<seqid>(?P<taxonomyid>\d+):(?P<source>ncbi):(?P<seqnb>\d*))[\s\t]*\[(?P<accession>[\w\d]*)\]([\s\t]*(?P=taxonomyid)[\s\t]*(?P=source)[\s\t]*(?P=seqnb)[\s\t]*\[(?P=accession)\][\s\t]*(?P=taxonomyid)[\s\t]*(?P<organism>[\w\s\-\_\/]*))?
['seqid', 'taxonomyid', 'source', 'seqnb', 'accession', 'organism']
{'seqid': '', 'taxonomyid': '', 'source': '', 'seqnb': '', 'accession': '', 'organism': ''}
--------------------------------------------------------------------------------
0 matches generated with rule <fasta_rhinolophus_ferrumequinum> 
--------------------------------------------------------------------------------
^>\d[\s\t](?P<seq_type>dna_rm):(?P<id_type>[\w\_]*)[\s\w](?P=id_type):(?P<assy>[\w\d\_]*)\.(?P<seq_level>[\w]*):\d*:\d*:(?P<taxonomy>\d*):(?P<id>\d*)[\s    ]REF$
['seq_type', 'id_type', 'assy', 'seq_level', 'taxonomy', 'id']
{'seq_type': '', 'id_type': '', 'assy': '', 'seq_level': '', 'taxonomy': '', 'id': ''}
--------------------------------------------------------------------------------
Selected rule with most matches: fasta_ncbi_std

If no saved rule generates a match, re_rule_name, re_pattern and re_keys remain None and a warning message is issued to ask user to add a parsing rule manually.

p2nomatch = pfs.data / 'ncbi/refsequences/cov/sequences_two_no_matching_rule.fa'
fasta2 = FastaFileReader(p2nomatch)
/home/vtec/projects/bio/metagentorch/metagentorch/core.py:651: UserWarning: 
        None of the saved parsing rules were able to extract metadata from the first line in this file.
        You must set a custom rule (pattern + keys) before parsing text, by using:
            `self.set_parsing_rules(custom_pattern)`
                
  warnings.warn(msg, category=UserWarning)
fasta2.re_rule_name is None
True

But we still can set a standard rule manually, by passing a re pattern and the corresponding list of keys.

pat = r"^>(?P<seqid>(?P<taxonomyid>\d+):(?P<source>ncbi):(?P<seqnb>\d*))\s*(?P<text>[\w\s]*)$"
keys = "seqid taxonomyid source seqnb text".split()
fasta2.set_parsing_rules(pattern=pat)

print(fasta2.re_rule_name)
print(fasta2.re_pattern)
print(fasta2.re_keys)
Custom Rule
^>(?P<seqid>(?P<taxonomyid>\d+):(?P<source>ncbi):(?P<seqnb>\d*))\s*(?P<text>[\w\s]*)$
['seqid', 'taxonomyid', 'source', 'seqnb', 'text']
fasta2.reset_iterator()
dfn_line, sequence = next(fasta2).values()
print(f"definition line: '{dfn_line[:-1]}'")
fasta2.parse_text(dfn_line)
definition line: '>2591237:ncbi:1 this sequence does not match any saved parsing rul'
{'seqid': '2591237:ncbi:1',
 'taxonomyid': '2591237',
 'source': 'ncbi',
 'seqnb': '1',
 'text': 'this sequence does not match any saved parsing rule'}

source

FastaFileReader.parse_file

 FastaFileReader.parse_file (add_seq:bool=False, save_json:bool=False)

Read fasta file and return a dictionary with definition line metadata and optionally sequences

Type Default Details
add_seq bool False When True, add the full sequence to the parsed metadata dictionary
save_json bool False When True, save the file metadata as a json file of same stem name
Returns dict Metadata as Key/Values pairs
fasta = FastaFileReader(p2fasta)
pprint(fasta.parse_file())
{'11128:ncbi:2': {'accession': 'LC494191',
                  'organism': 'Bovine coronavirus',
                  'seqid': '11128:ncbi:2',
                  'seqnb': '2',
                  'source': 'ncbi',
                  'taxonomyid': '11128'},
 '2591237:ncbi:1': {'accession': 'MK211378',
                    'organism': 'Coronavirus BtRs-BetaCoV/YN2018D',
                    'seqid': '2591237:ncbi:1',
                    'seqnb': '1',
                    'source': 'ncbi',
                    'taxonomyid': '2591237'}}
fasta.parse_file(save_json=True);
Metadata for 'cov_virus_sequences_two.fa'> saved as <cov_virus_sequences_two_metadata.json> in  
/home/vtec/projects/bio/metagentorch/nbs-dev/data_dev/ncbi/refsequences/cov
with open('../default_parsing_rules.json', 'r') as fp:
    pprint(json.load(fp), width=20)
{'aln_art_illumina-refseq-ncbi-std': {'keys': 'refseqid '
                                              'reftaxonomyid '
                                              'refsource '
                                              'refseqnb '
                                              'refseq_accession '
                                              'organism '
                                              'refseq_length',
                                      'pattern': '^@SQ[\\t\\s]*(?P<refseqid>(?P<reftaxonomyid>\\d*):(?P<refsource>\\w*):(?P<refseqnb>\\d*))[\\t\\s]*(?P=refseqnb)[\\t\\s]*(?P<refseq_accession>[\\w\\d]*)[\\t\\s]*(?P=reftaxonomyid)[\\t\\s]*(?P=refsource)[\\t\\s](?P<organism>.*)[\\t\\s](?P<refseq_length>\\d*)$'},
 'aln_art_illumina_ncbi_std': {'keys': 'refseqid '
                                       'reftaxonomyid '
                                       'refsource '
                                       'refseqnb '
                                       'readid '
                                       'readnb '
                                       'aln_start_pos '
                                       'refseq_strand',
                               'pattern': '^>(?P<refseqid>(?P<reftaxonomyid>\\d*):(?P<refsource>\\w*):(?P<refseqnb>\\d*))(\\s|\t'
                                          ')*(?P<readid>(?P=reftaxonomyid):(?P=refsource):(?P=refseqnb)-(?P<readnb>\\d*(\\/\\d(-\\d)?)?))(\\s|\\t)(?P<aln_start_pos>\\d*)(\\s|\\t)(?P<refseq_strand>(-|\\+))$'},
 'fasta_ncbi_cov': {'keys': 'seqid '
                            'taxonomyid '
                            'source '
                            'accession '
                            'seqnb '
                            'organism',
                    'pattern': '^>(?P<seqid>(?P<taxonomyid>\\d+):(?P<source>ncbi):(?P<seqnb>\\d*))[\\s\\t]*\\[(?P<accession>[\\w\\d]*)\\]([\\s\\t]*(?P=taxonomyid)[\\s\\t]*(?P=source)[\\s\\t]*(?P=seqnb)[\\s\\t]*\\[(?P=accession)\\][\\s\\t]*(?P=taxonomyid)[\\s\\t]*(?P<organism>[\\w\\s\\-\\_\\/]*))?'},
 'fasta_ncbi_std': {'keys': 'seqid '
                            'taxonomyid '
                            'source '
                            'accession '
                            'seqnb '
                            'organism',
                    'pattern': '^>(?P<seqid>(?P<taxonomyid>\\d+):(?P<source>ncbi):(?P<seqnb>\\d*))[\\s\\t]*(?P=seqnb)[\\s\\t](?P<accession>[\\w\\d]*)([\\s\\t]*(?P=taxonomyid)[\\s\\t]*(?P=source)[\\s\\t][\\s\\t]*(?P<organism>[\\w\\s\\-\\_/]*))?'},
 'fasta_rhinolophus_ferrumequinum': {'keys': 'seq_type '
                                             'id_type '
                                             'assy '
                                             'seq_level '
                                             'taxonomy '
                                             'id',
                                     'pattern': '^>\\d[\\s\\t](?P<seq_type>dna_rm):(?P<id_type>[\\w\\_]*)[\\s\\w](?P=id_type):(?P<assy>[\\w\\d\\_]*)\\.(?P<seq_level>[\\w]*):\\d*:\\d*:(?P<taxonomy>\\d*):(?P<id>\\d*)[\\s\t'
                                                ']REF$'},
 'fastq_art_illumina_ncbi_std': {'keys': 'readid '
                                         'reftaxonomyid '
                                         'refsource '
                                         'refseqnb '
                                         'readnb',
                                 'pattern': '^@(?P<readid>(?P<reftaxonomyid>\\d*):(?P<refsource>\\w*):(?P<refseqnb>\\d*)-(?P<readnb>\\d*(\\/\\d)?))$'}}
p2fasta = pfs.data / 'ncbi/refsequences/cov/cov_virus_sequence_one.fa'
fasta = FastaFileReader(p2fasta)
fasta_meta = fasta.parse_file(save_json=True)
pprint(fasta_meta)
Metadata for 'cov_virus_sequence_one.fa'> saved as <cov_virus_sequence_one_metadata.json> in  
/home/vtec/projects/bio/metagentorch/nbs-dev/data_dev/ncbi/refsequences/cov

{'2591237:ncbi:1': {'accession': 'MK211378',
                    'organism': 'Coronavirus BtRs-BetaCoV/YN2018D',
                    'seqid': '2591237:ncbi:1',
                    'seqnb': '1',
                    'source': 'ncbi',
                    'taxonomyid': '2591237'}}

FASTQ file

Extension of TextFileBaseReader class for fastq sequence files.

Structure of a FASTQ sequence file:

p2fastq = pfs.data / 'ncbi/simreads/cov/single_1seq_150bp/single_1seq_150bp.fq'

fastq = TextFileBaseReader(p2fastq, nlines=1)
for i, t in enumerate(fastq):
    txt = t.replace('\n', '')[:80]
    print(f"{txt}")
    if i >= 11: break
@2591237:ncbi:1-40200
TTGTAGATGGTGTTCCTTTTGTTGTTTCAACTGGATACCATTTTCGTGAGTTAGGAGTTGTACATAATCAGGATGTAAAC
+
CCCGGGCGGGGGCJGJJJGJJGJJJGJGGJGJJJGJGGGGGGGGCJGJGGGGGJJJJGCCGGGGGJCGCGJGJCG=GGGG
@2591237:ncbi:1-40199
TCATAGTACTACAGATAGAGACACCAGCTACGGTGCGAGCTCTATTCTTTGCACTAATGGCGTACTTAAGAGTCATTTGA
+
=CCGGGGCGGGGGJJGJJGJGJG=GJJGJCGJJJCJ=JJJJGGJJCJGJGG=JGC1JJGG8GCJCGGGCGG(GCGGCGC=
@2591237:ncbi:1-40198
TAACATAGTGGTTCGTTTATCAAGGATAATCTATCTCCATAGGTTCTTCATCATCTAACTCTGAATATTTATTCTTAGTT
+
C=CGGGGGGGGGGCJJJJ=JJJJJJJJJJJGGJJJJ1GJJ8GJJGGGJGGJJC=JJGGGCCGG88GG=GGGGGGCJGGGG

source

FastqFileReader

 FastqFileReader (path:str|pathlib.Path)

Iterator going through a fastq file’s sequences and return each section + prob error as a dict

Type Details
path str | pathlib.Path path to the fastq file
Returns dict key/value with keys: definition line; sequence; q score; prob error
fastq = FastqFileReader(p2fastq)
iteration_output = next(fastq)

print(type(iteration_output))
print(iteration_output.keys())
print(f"Definition line:  {iteration_output['definition line']}")
print(f"Read sequence:    {iteration_output['sequence']}")
print(f"Q scores (ASCII): {iteration_output['read_qscores']}")
print(f"Prob error:       {','.join([f'{p:.4f}' for p in iteration_output['probs error']])}")
<class 'dict'>
dict_keys(['definition line', 'sequence', 'read_qscores', 'probs error'])
Definition line:  @2591237:ncbi:1-40200
Read sequence:    TTGTAGATGGTGTTCCTTTTGTTGTTTCAACTGGATACCATTTTCGTGAGTTAGGAGTTGTACATAATCAGGATGTAAACTTACATAGCTCGCGTCTCAGTTTCAAGGAACTTTTAGTGTATGCTGCTGATCCAGCCATGCATGCAGCTT
Q scores (ASCII): CCCGGGCGGGGGCJGJJJGJJGJJJGJGGJGJJJGJGGGGGGGGCJGJGGGGGJJJJGCCGGGGGJCGCGJGJCG=GGGG=CGGGGGG1GCGCGGGGCCGJC8GGGGGGGGGGGCGGGGGGGGGGGC8GGGGGGCGGC1GGGCGGGGGCC
Prob error:       0.0004,0.0004,0.0004,0.0002,0.0002,0.0002,0.0004,0.0002,0.0002,0.0002,0.0002,0.0002,0.0004,0.0001,0.0002,0.0001,0.0001,0.0001,0.0002,0.0001,0.0001,0.0002,0.0001,0.0001,0.0001,0.0002,0.0001,0.0002,0.0002,0.0001,0.0002,0.0001,0.0001,0.0001,0.0002,0.0001,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0004,0.0001,0.0002,0.0001,0.0002,0.0002,0.0002,0.0002,0.0002,0.0001,0.0001,0.0001,0.0001,0.0002,0.0004,0.0004,0.0002,0.0002,0.0002,0.0002,0.0002,0.0001,0.0004,0.0002,0.0004,0.0002,0.0001,0.0002,0.0001,0.0004,0.0002,0.0016,0.0002,0.0002,0.0002,0.0002,0.0016,0.0004,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0251,0.0002,0.0004,0.0002,0.0004,0.0002,0.0002,0.0002,0.0002,0.0004,0.0004,0.0002,0.0001,0.0004,0.0050,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0004,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0004,0.0050,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0004,0.0002,0.0002,0.0004,0.0251,0.0002,0.0002,0.0002,0.0004,0.0002,0.0002,0.0002,0.0002,0.0002,0.0004,0.0004

Five largest probabilities of error:

np.sort(iteration_output['probs error'])[-5:]
array([0.00158489, 0.00501187, 0.00501187, 0.02511886, 0.02511886])
np.argsort(iteration_output['probs error'])[-5:]
array([ 80, 127, 102, 138,  88])
dfn_line = iteration_output['definition line']
meta = fastq.parse_text(dfn_line)
meta
{'readid': '2591237:ncbi:1-40200',
 'reftaxonomyid': '2591237',
 'refsource': 'ncbi',
 'refseqnb': '1',
 'readnb': '40200'}
fastq = FastqFileReader(p2fastq)
next(fastq).keys()
dict_keys(['definition line', 'sequence', 'read_qscores', 'probs error'])

source

FastqFileReader.parse_file

 FastqFileReader.parse_file (add_readseq:bool=False,
                             add_qscores:bool=False,
                             add_probs_error:bool=False,
                             save_json:bool=False)

Read fastq file, return a dict with definition line metadata and optionally read sequence and q scores, …

Type Default Details
add_readseq bool False When True, add the full sequence to the parsed metadata dictionary
add_qscores bool False Add the read ASCII Q Scores to the parsed dictionary when True
add_probs_error bool False Add the read probability of error to the parsed dictionary when True
save_json bool False When True, save the file metadata as a json file of same stem name
Returns dict Metadata as Key/Values pairs
parsed = fastq.parse_file(add_readseq=False, add_qscores=False, add_probs_error=False)
for i, (k, v) in enumerate(parsed.items()):
    print(k)
    pprint(v)
    if i >=3: break
2591237:ncbi:1-40200
{'readid': '2591237:ncbi:1-40200',
 'readnb': '40200',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}
2591237:ncbi:1-40199
{'readid': '2591237:ncbi:1-40199',
 'readnb': '40199',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}
2591237:ncbi:1-40198
{'readid': '2591237:ncbi:1-40198',
 'readnb': '40198',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}
2591237:ncbi:1-40197
{'readid': '2591237:ncbi:1-40197',
 'readnb': '40197',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}
metadata = fastq.parse_file(add_readseq=True)
df = pd.DataFrame(metadata).T
df.head(10)
readid reftaxonomyid refsource refseqnb readnb readseq
2591237:ncbi:1-40200 2591237:ncbi:1-40200 2591237 ncbi 1 40200 TTGTAGATGGTGTTCCTTTTGTTGTTTCAACTGGATACCATTTTCG...
2591237:ncbi:1-40199 2591237:ncbi:1-40199 2591237 ncbi 1 40199 TCATAGTACTACAGATAGAGACACCAGCTACGGTGCGAGCTCTATT...
2591237:ncbi:1-40198 2591237:ncbi:1-40198 2591237 ncbi 1 40198 TAACATAGTGGTTCGTTTATCAAGGATAATCTATCTCCATAGGTTC...
2591237:ncbi:1-40197 2591237:ncbi:1-40197 2591237 ncbi 1 40197 TAATCACTGATAGCAGCATTGCCATCCTGAGCAAAGAAGAAGTGTT...
2591237:ncbi:1-40196 2591237:ncbi:1-40196 2591237 ncbi 1 40196 CTAATGTCAGTACGCCTACAATGCCTGCATCACGCATAGCATCGCA...
2591237:ncbi:1-40195 2591237:ncbi:1-40195 2591237 ncbi 1 40195 AAGCTGAAGCATACATAACACAGTCCTTAAGCCGATAACCAGACAA...
2591237:ncbi:1-40194 2591237:ncbi:1-40194 2591237 ncbi 1 40194 AGTGGAAGAACTTCACCGTCAAGATGAAACTCGACGGGGCTCTCCA...
2591237:ncbi:1-40193 2591237:ncbi:1-40193 2591237 ncbi 1 40193 GCGTCTCGAGTGCTTCGAGTTCACCGTTCTTGAGAACAACCTCCTC...
2591237:ncbi:1-40192 2591237:ncbi:1-40192 2591237 ncbi 1 40192 CTGGTAGTATCTAAGGCTCCACTGAAATACTTGTACTTGTTATATA...
2591237:ncbi:1-40191 2591237:ncbi:1-40191 2591237 ncbi 1 40191 GTCTCTATCTGTAGTACTATGACAAATAGACAGTTTCATCAGAAAT...
fastq.set_parsing_rules(verbose=True)
--------------------------------------------------------------------------------
0 matches generated with rule <fasta_ncbi_std> 
--------------------------------------------------------------------------------
^>(?P<seqid>(?P<taxonomyid>\d+):(?P<source>ncbi):(?P<seqnb>\d*))[\s\t]*(?P=seqnb)[\s\t](?P<accession>[\w\d]*)([\s\t]*(?P=taxonomyid)[\s\t]*(?P=source)[\s\t][\s\t]*(?P<organism>[\w\s\-\_/]*))?
['seqid', 'taxonomyid', 'source', 'seqnb', 'accession', 'organism']
{'seqid': '', 'taxonomyid': '', 'source': '', 'seqnb': '', 'accession': '', 'organism': ''}
--------------------------------------------------------------------------------
5 matches generated with rule <fastq_art_illumina_ncbi_std> 
--------------------------------------------------------------------------------
^@(?P<readid>(?P<reftaxonomyid>\d*):(?P<refsource>\w*):(?P<refseqnb>\d*)-(?P<readnb>\d*(\/\d)?))$
['readid', 'reftaxonomyid', 'refsource', 'refseqnb', 'readnb']
{'readid': '2591237:ncbi:1-40200', 'reftaxonomyid': '2591237', 'refsource': 'ncbi', 'refseqnb': '1', 'readnb': '40200'}
--------------------------------------------------------------------------------
0 matches generated with rule <aln_art_illumina_ncbi_std> 
--------------------------------------------------------------------------------
^>(?P<refseqid>(?P<reftaxonomyid>\d*):(?P<refsource>\w*):(?P<refseqnb>\d*))(\s| )*(?P<readid>(?P=reftaxonomyid):(?P=refsource):(?P=refseqnb)-(?P<readnb>\d*(\/\d(-\d)?)?))(\s|\t)(?P<aln_start_pos>\d*)(\s|\t)(?P<refseq_strand>(-|\+))$
['refseqid', 'reftaxonomyid', 'refsource', 'refseqnb', 'readid', 'readnb', 'aln_start_pos', 'refseq_strand']
{'refseqid': '', 'reftaxonomyid': '', 'refsource': '', 'refseqnb': '', 'readid': '', 'readnb': '', 'aln_start_pos': '', 'refseq_strand': ''}
--------------------------------------------------------------------------------
0 matches generated with rule <aln_art_illumina-refseq-ncbi-std> 
--------------------------------------------------------------------------------
^@SQ[\t\s]*(?P<refseqid>(?P<reftaxonomyid>\d*):(?P<refsource>\w*):(?P<refseqnb>\d*))[\t\s]*(?P=refseqnb)[\t\s]*(?P<refseq_accession>[\w\d]*)[\t\s]*(?P=reftaxonomyid)[\t\s]*(?P=refsource)[\t\s](?P<organism>.*)[\t\s](?P<refseq_length>\d*)$
['refseqid', 'reftaxonomyid', 'refsource', 'refseqnb', 'refseq_accession', 'organism', 'refseq_length']
{'refseqid': '', 'reftaxonomyid': '', 'refsource': '', 'refseqnb': '', 'refseq_accession': '', 'organism': '', 'refseq_length': ''}
--------------------------------------------------------------------------------
0 matches generated with rule <fasta_ncbi_cov> 
--------------------------------------------------------------------------------
^>(?P<seqid>(?P<taxonomyid>\d+):(?P<source>ncbi):(?P<seqnb>\d*))[\s\t]*\[(?P<accession>[\w\d]*)\]([\s\t]*(?P=taxonomyid)[\s\t]*(?P=source)[\s\t]*(?P=seqnb)[\s\t]*\[(?P=accession)\][\s\t]*(?P=taxonomyid)[\s\t]*(?P<organism>[\w\s\-\_\/]*))?
['seqid', 'taxonomyid', 'source', 'seqnb', 'accession', 'organism']
{'seqid': '', 'taxonomyid': '', 'source': '', 'seqnb': '', 'accession': '', 'organism': ''}
--------------------------------------------------------------------------------
0 matches generated with rule <fasta_rhinolophus_ferrumequinum> 
--------------------------------------------------------------------------------
^>\d[\s\t](?P<seq_type>dna_rm):(?P<id_type>[\w\_]*)[\s\w](?P=id_type):(?P<assy>[\w\d\_]*)\.(?P<seq_level>[\w]*):\d*:\d*:(?P<taxonomy>\d*):(?P<id>\d*)[\s    ]REF$
['seq_type', 'id_type', 'assy', 'seq_level', 'taxonomy', 'id']
{'seq_type': '', 'id_type': '', 'assy': '', 'seq_level': '', 'taxonomy': '', 'id': ''}
--------------------------------------------------------------------------------
Selected rule with most matches: fastq_art_illumina_ncbi_std

ALN Alignment Files

Extension of TextFileBaseReader class for ALN read/sequence alignment files.

Structure of a ALN sequence file:

p2aln = pfs.data / 'ncbi/simreads/cov/single_1seq_150bp/single_1seq_150bp.aln'
assert p2aln.is_file()

aln = TextFileBaseReader(p2aln, nlines=1)
for i, t in enumerate(aln):
    txt = t.replace('\n', '')[:80]
    print(f"{txt}")
    if i >= 12: break
##ART_Illumina  read_length 150
@CM /usr/bin/art_illumina -i /home/vtec/projects/bio/metagentools/data/ncbi/refs
@SQ 2591237:ncbi:1  1   MK211378    2591237 ncbi    Coronavirus BtRs-BetaCoV/YN2018D    3021
##Header End
>2591237:ncbi:1 2591237:ncbi:1-40200    14370   +
TTGTAGATGGTGTTCCTTTTGTTGTTTCAACTGGATACCATTTTCGTGAGTTAGGAGTTGTACATAATCAGGATGTAAAC
TTGTAGATGGTGTTCCTTTTGTTGTTTCAACTGGATACCATTTTCGTGAGTTAGGAGTTGTACATAATCAGGATGTAAAC
>2591237:ncbi:1 2591237:ncbi:1-40199    15144   -
TCATAGTACTACAGATAGAGACACCAGCTACGGTGCGAGCTCTATTCTTTGCACTAATGGCGTACTTAAGATTCATTTGA
TCATAGTACTACAGATAGAGACACCAGCTACGGTGCGAGCTCTATTCTTTGCACTAATGGCGTACTTAAGAGTCATTTGA
>2591237:ncbi:1 2591237:ncbi:1-40198    2971    -
TAACATAGTGGTTCGTTTATCAAGGATAATCTATCTCCATAGGTTCTTCATCATCTAACTCTGAATATTTATTCTTAGTT
TAACATAGTGGTTCGTTTATCAAGGATAATCTATCTCCATAGGTTCTTCATCATCTAACTCTGAATATTTATTCTTAGTT

source

AlnFileReader

 AlnFileReader (path:str|pathlib.Path)

Iterator going through an ALN file

Type Details
path str | pathlib.Path path to the aln file
Returns dict key/value with keys:
aln = AlnFileReader(p2aln)

AlnFileReader iterator returns elements one by one, as dictionaries with each data line related to the read, accessible through the following keys:

  • key 'definition line': read definition line, including read metadata
  • key 'ref_seq_aligned': aligned reference sequence, that is the sequence segment in the original reference corresponding to the read
  • key 'read_seq_aligned': aligned read, that is the simmulated read sequence, where each bp corresponds to the reference sequence bp in the same position.
one_iteration = next(aln)
one_iteration.keys()
dict_keys(['definition line', 'ref_seq_aligned', 'read_seq_aligned'])
pprint(one_iteration)
{'definition line': '>2591237:ncbi:1\t2591237:ncbi:1-40200\t14370\t+',
 'read_seq_aligned': 'TTGTAGATGGTGTTCCTTTTGTTGTTTCAACTGGATACCATTTTCGTGAGTTAGGAGTTGTACATAATCAGGATGTAAACTTACATAGCTCGCGTCTCAGTTTCAAGGAACTTTTAGTGTATGCTGCTGATCCAGCCATGCATGCAGCTT',
 'ref_seq_aligned': 'TTGTAGATGGTGTTCCTTTTGTTGTTTCAACTGGATACCATTTTCGTGAGTTAGGAGTTGTACATAATCAGGATGTAAACTTACATAGCTCGCGTCTCAGTTTCAAGGAACTTTTAGTGTATGCTGCTGATCCAGCCATGCATGCAGCTT'}
dfn_line, ref_seq_aligned, read_seq_aligned = one_iteration.values()
dfn_line
'>2591237:ncbi:1\t2591237:ncbi:1-40200\t14370\t+'
ref_seq_aligned[:100]
'TTGTAGATGGTGTTCCTTTTGTTGTTTCAACTGGATACCATTTTCGTGAGTTAGGAGTTGTACATAATCAGGATGTAAACTTACATAGCTCGCGTCTCAG'
read_seq_aligned[:100]
'TTGTAGATGGTGTTCCTTTTGTTGTTTCAACTGGATACCATTTTCGTGAGTTAGGAGTTGTACATAATCAGGATGTAAACTTACATAGCTCGCGTCTCAG'
another_iteration = next(aln)
pprint(another_iteration)
{'definition line': '>2591237:ncbi:1\t2591237:ncbi:1-40199\t15144\t-',
 'read_seq_aligned': 'TCATAGTACTACAGATAGAGACACCAGCTACGGTGCGAGCTCTATTCTTTGCACTAATGGCGTACTTAAGAGTCATTTGAGTTATAGTAGGGATGACATTACGCTTAGTATACGCGAAAAGTGCATCTTGATCCTCATAACTCATTGAGT',
 'ref_seq_aligned': 'TCATAGTACTACAGATAGAGACACCAGCTACGGTGCGAGCTCTATTCTTTGCACTAATGGCGTACTTAAGATTCATTTGAGTTATAGTAGGGATGACATTACGCTTAGTATACGCGAAAAGTGCATCTTGATCCTCATAACTCATTGAGT'}
aln.reset_iterator()
for i, d in enumerate(aln):
    print(d['definition line'])
    print(d['ref_seq_aligned'][:80], '...')
    print(d['read_seq_aligned'][:80], '...\n')
    if i >= 3: break
>2591237:ncbi:1 2591237:ncbi:1-40200    14370   +
TTGTAGATGGTGTTCCTTTTGTTGTTTCAACTGGATACCATTTTCGTGAGTTAGGAGTTGTACATAATCAGGATGTAAAC ...
TTGTAGATGGTGTTCCTTTTGTTGTTTCAACTGGATACCATTTTCGTGAGTTAGGAGTTGTACATAATCAGGATGTAAAC ...

>2591237:ncbi:1 2591237:ncbi:1-40199    15144   -
TCATAGTACTACAGATAGAGACACCAGCTACGGTGCGAGCTCTATTCTTTGCACTAATGGCGTACTTAAGATTCATTTGA ...
TCATAGTACTACAGATAGAGACACCAGCTACGGTGCGAGCTCTATTCTTTGCACTAATGGCGTACTTAAGAGTCATTTGA ...

>2591237:ncbi:1 2591237:ncbi:1-40198    2971    -
TAACATAGTGGTTCGTTTATCAAGGATAATCTATCTCCATAGGTTCTTCATCATCTAACTCTGAATATTTATTCTTAGTT ...
TAACATAGTGGTTCGTTTATCAAGGATAATCTATCTCCATAGGTTCTTCATCATCTAACTCTGAATATTTATTCTTAGTT ...

>2591237:ncbi:1 2591237:ncbi:1-40197    15485   -
TAATCACTGATAGCAGCATTGCCATCCTGAGCAAAGAAGAAGTGTTTTAGTTCAACAGAACTTCCTTCCTTAAAGAAACC ...
TAATCACTGATAGCAGCATTGCCATCCTGAGCAAAGAAGAAGTGTTTTAGTTCAACAGAACTTCCTTCCTTAAAGAAACC ...

Once instantiated, the AlnFileReader iterator gives access to the file’s header information through header instance attribute. It is a dictionary with two keys: 'command' and 'reference sequences':

    {'command':             'art-illumina command used to create the reads',
     'reference sequences': ['@SQ metadata on reference sequence 1 used for the reads',
                             '@SQ metadata on reference sequence 2 used for the reads', 
                             ...
                            ]
    }
print(aln.header['command'])
/usr/bin/art_illumina -i /home/vtec/projects/bio/metagentools/data/ncbi/refsequences/cov/cov_refseq_001-seq1.fa -ss HS25 -l 150 -f 200 -o /home/vtec/projects/bio/metagentools/data/ncbi/simreads/cov/single_1seq_150bp/single_1seq_150bp -rs 1723893089
for seq_info in aln.header['reference sequences']:
    print(seq_info)
@SQ 2591237:ncbi:1  1   MK211378    2591237 ncbi    Coronavirus BtRs-BetaCoV/YN2018D    30213

The read definition line includes key metadata, which need to be parsed using the appropriate parsing rule.


source

TextFileBaseReader.parse_text

 TextFileBaseReader.parse_text (txt:str, pattern:str|None=None)

Parse text using regex pattern with groups. Return a metadata dictionary.

Type Default Details
txt str text to parse
pattern str | None None If None, uses standard regex pattern to extract metadata, otherwise, uses passed regex
Returns dict parsed metadata in key/value format
aln.parse_text(dfn_line, pattern)
{'refseqid': '2591237:ncbi:1',
 'reftaxonomyid': '2591237',
 'refsource': 'ncbi',
 'refseqnb': '1',
 'readid': '2591237:ncbi:1-40200',
 'readnb': '40200',
 'aln_start_pos': '14370',
 'refseq_strand': '+'}

source

AlnFileReader.parse_definition_line_with_position

 AlnFileReader.parse_definition_line_with_position (dfn_line:str)

Parse definition line and adds relative position

Type Details
dfn_line str fefinition line string to be parsed
Returns dict parsed metadata in key/value format + relative position of the read

Upon instance creation, AlnFileReader automatically checks the default_parsing_rules.json file for a workable rule among saved rules. Saved rules include the rule for ART Illumina ALN files.

aln.re_rule_name
'aln_art_illumina_ncbi_std'

It is therefore not required to pass a specific pattern and keys parameter.

aln.parse_text(dfn_line)
{'refseqid': '2591237:ncbi:1',
 'reftaxonomyid': '2591237',
 'refsource': 'ncbi',
 'refseqnb': '1',
 'readid': '2591237:ncbi:1-40200',
 'readnb': '40200',
 'aln_start_pos': '14370',
 'refseq_strand': '+'}

ART Ilumina ALN files definition lines consist of:

  • The read ID: readid, e.g. 2591237:ncbi:1-20100
  • the read number (order in the file): readnb, e.g. 20100
  • The read start position in the reference sequence: aln_start_pos, e.g. 23878
  • The reference sequence ID: readid, e.g. 2591237:ncbi:1-20100
  • The reference sequence number: refseqnb, e.g. 1
  • The reference sequence source: refsource, e.g. ncbi
  • The reference sequence taxonomy: reftaxonomyid, e.g. 2591237
  • The reference sequence strand: refseq_strand wich is either + or -,

source

AlnFileReader.parse_file

 AlnFileReader.parse_file (add_ref_seq_aligned:bool=False,
                           add_read_seq_aligned:bool=False)

Read ALN file, return a dict w/ alignment info for each read and optionaly aligned reference sequence & read

Type Default Details
add_ref_seq_aligned bool False Add the reference sequence aligned to the parsed dictionary when True
add_read_seq_aligned bool False Add the read sequence aligned to the parsed dictionary when True
Returns dict
parsed = aln.parse_file()

for i, (k, v) in enumerate(parsed.items()):
    print(k)
    pprint(v)
    if i > 3: break
2591237:ncbi:1-40200
{'aln_start_pos': '14370',
 'readid': '2591237:ncbi:1-40200',
 'readnb': '40200',
 'refseq_strand': '+',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}
2591237:ncbi:1-40199
{'aln_start_pos': '15144',
 'readid': '2591237:ncbi:1-40199',
 'readnb': '40199',
 'refseq_strand': '-',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}
2591237:ncbi:1-40198
{'aln_start_pos': '2971',
 'readid': '2591237:ncbi:1-40198',
 'readnb': '40198',
 'refseq_strand': '-',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}
2591237:ncbi:1-40197
{'aln_start_pos': '15485',
 'readid': '2591237:ncbi:1-40197',
 'readnb': '40197',
 'refseq_strand': '-',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}
2591237:ncbi:1-40196
{'aln_start_pos': '16221',
 'readid': '2591237:ncbi:1-40196',
 'readnb': '40196',
 'refseq_strand': '-',
 'refseqid': '2591237:ncbi:1',
 'refseqnb': '1',
 'refsource': 'ncbi',
 'reftaxonomyid': '2591237'}

source

AlnFileReader.parse_header_reference_sequences

 AlnFileReader.parse_header_reference_sequences (pattern:str|None=None)

Extract metadata from all header reference sequences

Type Default Details
pattern str | None None regex pattern to apply to parse the reference sequence info
Returns dict parsed metadata in key/value format
pprint(aln.parse_header_reference_sequences())
{'2591237:ncbi:1': {'organism': 'Coronavirus BtRs-BetaCoV/YN2018D',
                    'refseq_accession': 'MK211378',
                    'refseq_length': '30213',
                    'refseqid': '2591237:ncbi:1',
                    'refseqnb': '1',
                    'refsource': 'ncbi',
                    'reftaxonomyid': '2591237'}}

source

AlnFileReader.set_header_parsing_rules

 AlnFileReader.set_header_parsing_rules (pattern:str|None=None,
                                         verbose:bool=False)

*Set the regex parsing rule for reference sequence in ALN header.

Updates 3 class attributes: re_header_rule_name, re_header_pattern, re_header_keys

TODO: refactor this and the method in Core: to use a single function for the common part and a parameter for the text_to_parse*

Type Default Details
pattern str | None None regex pattern to apply to parse the text, search in parsing rules json if None
verbose bool False when True, provides information on each rule
Returns None
aln.set_header_parsing_rules(verbose=True)
--------------------------------------------------------------------------------
Rule <fasta_ncbi_std> generated 0 matches
--------------------------------------------------------------------------------
^>(?P<seqid>(?P<taxonomyid>\d+):(?P<source>ncbi):(?P<seqnb>\d*))[\s\t]*(?P=seqnb)[\s\t](?P<accession>[\w\d]*)([\s\t]*(?P=taxonomyid)[\s\t]*(?P=source)[\s\t][\s\t]*(?P<organism>[\w\s\-\_/]*))?
['seqid', 'taxonomyid', 'source', 'accession', 'seqnb', 'organism']
{'seqid': '', 'taxonomyid': '', 'source': '', 'seqnb': '', 'accession': '', 'organism': ''}
--------------------------------------------------------------------------------
Rule <fastq_art_illumina_ncbi_std> generated 0 matches
--------------------------------------------------------------------------------
^@(?P<readid>(?P<reftaxonomyid>\d*):(?P<refsource>\w*):(?P<refseqnb>\d*)-(?P<readnb>\d*(\/\d)?))$
['readid', 'reftaxonomyid', 'refsource', 'refseqnb', 'readnb']
{'readid': '', 'reftaxonomyid': '', 'refsource': '', 'refseqnb': '', 'readnb': ''}
--------------------------------------------------------------------------------
Rule <aln_art_illumina_ncbi_std> generated 0 matches
--------------------------------------------------------------------------------
^>(?P<refseqid>(?P<reftaxonomyid>\d*):(?P<refsource>\w*):(?P<refseqnb>\d*))(\s| )*(?P<readid>(?P=reftaxonomyid):(?P=refsource):(?P=refseqnb)-(?P<readnb>\d*(\/\d(-\d)?)?))(\s|\t)(?P<aln_start_pos>\d*)(\s|\t)(?P<refseq_strand>(-|\+))$
['refseqid', 'reftaxonomyid', 'refsource', 'refseqnb', 'readid', 'readnb', 'aln_start_pos', 'refseq_strand']
{'refseqid': '', 'reftaxonomyid': '', 'refsource': '', 'refseqnb': '', 'readid': '', 'readnb': '', 'aln_start_pos': '', 'refseq_strand': ''}
--------------------------------------------------------------------------------
Rule <aln_art_illumina-refseq-ncbi-std> generated 7 matches
--------------------------------------------------------------------------------
^@SQ[\t\s]*(?P<refseqid>(?P<reftaxonomyid>\d*):(?P<refsource>\w*):(?P<refseqnb>\d*))[\t\s]*(?P=refseqnb)[\t\s]*(?P<refseq_accession>[\w\d]*)[\t\s]*(?P=reftaxonomyid)[\t\s]*(?P=refsource)[\t\s](?P<organism>.*)[\t\s](?P<refseq_length>\d*)$
['refseqid', 'reftaxonomyid', 'refsource', 'refseqnb', 'refseq_accession', 'organism', 'refseq_length']
{'refseqid': '2591237:ncbi:1', 'reftaxonomyid': '2591237', 'refsource': 'ncbi', 'refseqnb': '1', 'refseq_accession': 'MK211378', 'organism': 'Coronavirus BtRs-BetaCoV/YN2018D', 'refseq_length': '30213'}
--------------------------------------------------------------------------------
Rule <fasta_ncbi_cov> generated 0 matches
--------------------------------------------------------------------------------
^>(?P<seqid>(?P<taxonomyid>\d+):(?P<source>ncbi):(?P<seqnb>\d*))[\s\t]*\[(?P<accession>[\w\d]*)\]([\s\t]*(?P=taxonomyid)[\s\t]*(?P=source)[\s\t]*(?P=seqnb)[\s\t]*\[(?P=accession)\][\s\t]*(?P=taxonomyid)[\s\t]*(?P<organism>[\w\s\-\_\/]*))?
['seqid', 'taxonomyid', 'source', 'accession', 'seqnb', 'organism']
{'seqid': '', 'taxonomyid': '', 'source': '', 'seqnb': '', 'accession': '', 'organism': ''}
--------------------------------------------------------------------------------
Rule <fasta_rhinolophus_ferrumequinum> generated 0 matches
--------------------------------------------------------------------------------
^>\d[\s\t](?P<seq_type>dna_rm):(?P<id_type>[\w\_]*)[\s\w](?P=id_type):(?P<assy>[\w\d\_]*)\.(?P<seq_level>[\w]*):\d*:\d*:(?P<taxonomy>\d*):(?P<id>\d*)[\s    ]REF$
['seq_type', 'id_type', 'assy', 'seq_level', 'taxonomy', 'id']
{'seq_type': '', 'id_type': '', 'assy': '', 'seq_level': '', 'taxonomy': '', 'id': ''}
--------------------------------------------------------------------------------
Selected rule with most matches: aln_art_illumina-refseq-ncbi-std
print(aln.re_header_rule_name)
print(aln.re_header_pattern)
print(aln.re_header_keys)
aln_art_illumina-refseq-ncbi-std
^@SQ[\t\s]*(?P<refseqid>(?P<reftaxonomyid>\d*):(?P<refsource>\w*):(?P<refseqnb>\d*))[\t\s]*(?P=refseqnb)[\t\s]*(?P<refseq_accession>[\w\d]*)[\t\s]*(?P=reftaxonomyid)[\t\s]*(?P=refsource)[\t\s](?P<organism>.*)[\t\s](?P<refseq_length>\d*)$
['refseqid', 'reftaxonomyid', 'refsource', 'refseqnb', 'refseq_accession', 'organism', 'refseq_length']

Build datasets

Reads are provided in various formats (plain text for original data, fastq + aln for simulated reads). The CNN_Virus model requires inputs in the form of three tensors:

  • read_seq_batch: a batch of 50-mer read sequences, in “base-hot-encoded” format (BHE)
  • label_batch: a batch of the species’ labels, in one-hot-encoded format (187 classes) (OHE)
  • pos_batch: a batch of the read’s relative positions in the reference sequence, in one-hot-encoded format (10 classes) (OHE)

The classes below are custom pytorch Dataset used to load the reads from their file and transform it into BHE or OHE tensors.

Plain text based data

The datasets provided by CNN Virus team are in plain text format, one sequence per line, with format as follows:

AAAAAGATTTTGAGAGAGGTCGACCTGTCCTCCTAAAACGTTTACAAAAG  71  0
CATGTAACGCAGCTTAGTCCGATCGTGGCTATAATCCGTCTTTCGATTTG  1   7
AACAACATCTTGTTGATGATAACCGTCAAAGTGTTTTGGGTCTGGAGGGA  158 6
AGTACCTGGAGAGCGTTAAGAAACACAAACGGCTGGATGTAGTGCCGCGC  6   7
CCACGTCGATGAAGCTCCGACGAGAGTCGGCGCTGAGCCCGCGCACCTCC  71  6
AGCTCGTGGATCTCCCCTCCTTCTGCAGTTTCAACATCAGAAGCCCTGAA  87  1

The first column is the k-base (k-mer) read, followed byt the specie label and the relative position.

When using this type of data, the data pipeline consist of:

  • Creating a TextFileDataset instance to load the data

  • Using the created dataset to in a Dataloader used for training or inference.


source

TextFileDataset

 TextFileDataset (p2file:str|pathlib.Path)

Load data from text file and yield (BHE sequence tensor, (label OHE tensor, position OHE tensor))

Type Details
p2file str | pathlib.Path path to the file to read

This IterableDataset reads the file line by line and yields a tupple (seq_bhe, (lbl_ohe, pos_ohe)):

  • seq_bhe: tensor of shape [k, 5] with the k-mer read in base-hot-encoded format
  • lbl_ohe: tensor of shape [187] with the specie label in one-hot-encoded format
  • pos_ohe: tensor of shape [10] with the relative position label in one-hot-encoded format

The dataset is then used with a pytorch DataLoader to handle batching. When using the GPU, the pin_memory=True should be used to make transfer of data to the GPU faster.

Let’s use a small data file to illustrate the process.

p2file = Path('data_dev/CNN_Virus_data/50mer_ds_100_seq')
assert p2file.is_file()

ds = TextFileDataset(p2file)
dl = DataLoader(ds, batch_size=8, num_workers=2, pin_memory=False)

for seq_batch, (lbl_batch, pos_batch) in dl:
    print(seq_batch.shape, lbl_batch.shape, pos_batch.shape)
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([8, 50, 5]) torch.Size([8, 187]) torch.Size([8, 10])
torch.Size([4, 50, 5]) torch.Size([4, 187]) torch.Size([4, 10])
torch.Size([4, 50, 5]) torch.Size([4, 187]) torch.Size([4, 10])

The read sequence tensor is BHE:

seq_batch[:2, :3, :]
tensor([[[1, 0, 0, 0, 0],
         [1, 0, 0, 0, 0],
         [0, 1, 0, 0, 0]],

        [[1, 0, 0, 0, 0],
         [0, 0, 0, 1, 0],
         [1, 0, 0, 0, 0]]])
lbl_batch[:2, :], lbl_batch.argmax(dim=1)[:2]
(tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0.]]),
 tensor([94, 18]))

ALN file based

All simulated reads are stored in a .fastq and .aln file. We use the .aln file because it includes all metadata information required to retrieve both the specie and the relative position of each read.

AlnFileDataset allows to load reads and metadata stored in ART Illumina simulated ALN file.

p2aln = pfs.data / 'ncbi/simreads/cov/single_1seq_150bp/single_1seq_150bp.aln'
aln = AlnFileReader(p2aln)

for i, d in enumerate(aln):
    pass
print(f"{i+1:,d} reads in this ALN file")
40,200 reads in this ALN file

source

AlnFileDataset

 AlnFileDataset (p2file:str|pathlib.Path, label:int=118,
                 return_metadata:bool=False, refseqid:str='')

*Load data and metadata from ALN file, yield BHE sequence, OHE label, OHE position tensors + metadata

The iterator yield tupple (read tensor,(label tensor, position tensor)):

  • kmer read tensor in base hot encoded format (shape [k, 5])
  • label tensor in one hot encoded format (shape [187])
  • position tensor in one hote encoded format (shape [10])

It also optionally returns a dictionary of the read metadata available in the ALN file.*

Type Default Details
p2file str | pathlib.Path path to the file to read
label int 118 label for this batch (assuming all reads are from the same species)
return_metadata bool False yield each read metadata as a dictionary when True
refseqid str refseqid to filter reference sequence (default is no filtering)

Create a dataset from a ALN file, then pass it to the class Dataloader to create batches of data

ds = AlnFileDataset(p2aln, label=118)

dl = DataLoader(ds, batch_size=16, shuffle=False)

for i, (s, (lbl,pos)) in enumerate(dl):
    print(f"Sample {i+1:000d}:")
    print(f"  Shapes of each data tensor:")
    print(f"    seq: {s.shape}, label: {lbl.shape}, pos: {pos.shape}")
    if i+1 >= 3:break
Sample 1:
  Shapes of each data tensor:
    seq: torch.Size([16, 150, 5]), label: torch.Size([16, 187]), pos: torch.Size([16, 10])
Sample 2:
  Shapes of each data tensor:
    seq: torch.Size([16, 150, 5]), label: torch.Size([16, 187]), pos: torch.Size([16, 10])
Sample 3:
  Shapes of each data tensor:
    seq: torch.Size([16, 150, 5]), label: torch.Size([16, 187]), pos: torch.Size([16, 10])

When an AlnFileDataset instance with return_metadata set as True is passed to DataLoader, the dataloader will yield (read_tensor,(lbl_tensor, pos_tensor), metadata) where:

  • read_tensor is the batch of kmer read tensor in base hot encoded format (shape [bs, k, 5])
  • lbl_tensor is the batch of label tensor in one hot encoded format (shape [bs, 187])
  • pos_tensor is the batch of position tensor in one hote encoded format (shape [bs, 10])
  • metadata is a dictionary of form key:list like:
{
    'aln_start_pos': ['14370', '15144', '2971'], 
    'readid': ['2591237:ncbi:1-40200', '2591237:ncbi:1-40199', '2591237:ncbi:1-40198'], 
    'readnb': ['40200', '40199', '40198'], 
    'refseq_strand': ['+', '-', '-'], 
    'refseqid': ['2591237:ncbi:1', '2591237:ncbi:1', '2591237:ncbi:1'], 
    'refseqnb': ['1', '1', '1'], 
    'refsource': ['ncbi', 'ncbi', 'ncbi'], 
    'reftaxonomyid': ['2591237', '2591237', '2591237'], 
    'read_pos': tensor([ 5,  6,  1])
}
ds = AlnFileDataset(p2aln, label=118, return_metadata=True)
dl = DataLoader(ds, batch_size=4, shuffle=False)

for i, (s, (lbl,pos), d) in enumerate(dl):
    print(f"Sample {i+1:000d}:")
    print(f"  Shapes of each data tensor:")
    print(f"    seq: {s.shape}, label: {lbl.shape}, pos: {pos.shape}")
    print(f"  Metadata dictionary for the batch")
    print(f"    readid:   {d['readid']} \n    refseqid: {d['refseqid']}")
    if i+1 >= 3:break
Sample 1:
  Shapes of each data tensor:
    seq: torch.Size([4, 150, 5]), label: torch.Size([4, 187]), pos: torch.Size([4, 10])
  Metadata dictionary for the batch
    readid:   ['2591237:ncbi:1-40200', '2591237:ncbi:1-40199', '2591237:ncbi:1-40198', '2591237:ncbi:1-40197'] 
    refseqid: ['2591237:ncbi:1', '2591237:ncbi:1', '2591237:ncbi:1', '2591237:ncbi:1']
Sample 2:
  Shapes of each data tensor:
    seq: torch.Size([4, 150, 5]), label: torch.Size([4, 187]), pos: torch.Size([4, 10])
  Metadata dictionary for the batch
    readid:   ['2591237:ncbi:1-40196', '2591237:ncbi:1-40195', '2591237:ncbi:1-40194', '2591237:ncbi:1-40193'] 
    refseqid: ['2591237:ncbi:1', '2591237:ncbi:1', '2591237:ncbi:1', '2591237:ncbi:1']
Sample 3:
  Shapes of each data tensor:
    seq: torch.Size([4, 150, 5]), label: torch.Size([4, 187]), pos: torch.Size([4, 10])
  Metadata dictionary for the batch
    readid:   ['2591237:ncbi:1-40192', '2591237:ncbi:1-40191', '2591237:ncbi:1-40190', '2591237:ncbi:1-40189'] 
    refseqid: ['2591237:ncbi:1', '2591237:ncbi:1', '2591237:ncbi:1', '2591237:ncbi:1']

AlnFileDataset also allows to filter reads to match one specific reference sequence. If this option is selected, only simulated reads generated from that specific reference sequence will be loaded and yielded.

p2aln_2seq = pfs.data / 'ncbi/simreads/cov/single_1seq_150bp/single_2seq_150bp.aln'
assert p2aln_2seq.is_file()

ds = AlnFileDataset(p2aln_2seq, label=118, return_metadata=True, refseqid='2591237:ncbi:2')
dl = DataLoader(ds, batch_size=3, shuffle=False)

for i, (s, (lbl,pos), d) in enumerate(dl):
    print(f"Sample {i+1:000d}:")
    print(f"  Shapes of each data tensor:")
    print(f"    seq: {s.shape}, label: {lbl.shape}, pos: {pos.shape}")
    print(f"  Metadata dictionary for the batch")
    print(f"    readid:   {d['readid']} \n    refseqid: {d['refseqid']}")
    if i+1 >= 3:break
Filtering reads to only keep those generated from refseqid 2591237:ncbi:2
Sample 1:
  Shapes of each data tensor:
    seq: torch.Size([3, 150, 5]), label: torch.Size([3, 187]), pos: torch.Size([3, 10])
  Metadata dictionary for the batch
    readid:   ['2591237:ncbi:2-40199', '2591237:ncbi:2-40196', '2591237:ncbi:2-40194'] 
    refseqid: ['2591237:ncbi:2', '2591237:ncbi:2', '2591237:ncbi:2']
Sample 2:
  Shapes of each data tensor:
    seq: torch.Size([3, 150, 5]), label: torch.Size([3, 187]), pos: torch.Size([3, 10])
  Metadata dictionary for the batch
    readid:   ['2591237:ncbi:2-40193', '2591237:ncbi:2-40185', '2591237:ncbi:2-40183'] 
    refseqid: ['2591237:ncbi:2', '2591237:ncbi:2', '2591237:ncbi:2']
Sample 3:
  Shapes of each data tensor:
    seq: torch.Size([1, 150, 5]), label: torch.Size([1, 187]), pos: torch.Size([1, 10])
  Metadata dictionary for the batch
    readid:   ['2591237:ncbi:2-30794'] 
    refseqid: ['2591237:ncbi:2']

Handle long reads

Preprocessing

CNN Virus handles 50-mer reads only. Longer reads need to be split into 50-mer reads by sliding a window of size 50 along the read sequence. The set of multiple 50-mer reads is then used as input to the model. After prediction on the set of 50-mer reads, the final prediction is obtained by filtering 50-mer prediction that yield a hight enough probability and then voting for the most predicted species.

A k-mer (150-mer) read will be split in k-49 (101) 50-mer reads which will yield k-49 (101) probability tensors. The final prediction will be the species with the highest number of votes, as long as their respective probabilities > 90%.


source

split_kmer_batch_into_50mers

 split_kmer_batch_into_50mers (kmer_b:torch.Tensor,
                               labels:tuple[torch.Tensor,torch.Tensor]|Non
                               e=None)

*Convert a batch of k-mer reads into 50-mer reads, by shifting the k-mer one base at a time.

Shapes:

  • kmer_b: [b,k,5] -> [b * (k - 49), 50, 5]
  • label/position_b: [b, nb_class] -> [b * (k - 49), nb_class]

Technical Note: we use advanced indexing of the tensor to create the 50-mer and roll them, with no loop.*

Type Default Details
kmer_b Tensor tensor representing a batch of k-mer reads, BHE format, shape [b, k, 5]
labels tuple[torch.Tensor, torch.Tensor] | None None optional tuple with tensor for label and position batches
Returns tuple batch tensor for 50-mer reads, (optional) labels and positions batch tensor

We can test the function, with a test tensor designed to make the validation easier.

We create a tensor with a batch of b k-mer reads in the following format:

read 1: 10001, 10002, 10003, .... 10149
read 2: 20001, 20002, 20003, .... 20149
read 3: 30001, 30002, 30003, .... 30149
    ...

We then add an additional dimension to simulate the BHE encoding, by repeating the same value 5 times.

b, k = 4, 150

def _create_kmer_batch(b:int, k:int)-> torch.Tensor:
    """Create a batch of kmer reads, with a fake encoding in 5 dimensions (like BHE)"""
    # Build a tensor of shape (b,k) with the pattern described above
    kmer_b = torch.stack([(i+1) * 10000 + torch.arange(k) for i in range(b)], dim=0)
    # simulate the BHE buy duplicating the 5 dimensions
    kmer_b = torch.stack([kmer_b]*5, dim=2)
    return kmer_b

kmer_b = _create_kmer_batch(b,k)
print(kmer_b.shape)
# Slice the tensor to only show the first and last three bases, and only one of the 5 BHE dimensions
idx_bases = np.array([0,1,2,147,148,149])
kmer_b[:, idx_bases, 0]
torch.Size([4, 150, 5])
tensor([[10000, 10001, 10002, 10147, 10148, 10149],
        [20000, 20001, 20002, 20147, 20148, 20149],
        [30000, 30001, 30002, 30147, 30148, 30149],
        [40000, 40001, 40002, 40147, 40148, 40149]])

We see that the function split_kmer_batch_into_50mers returns a tensor with shape [b, k-49, 50, 5] as expected.

split_50mer_read_b, (split_50mer_label_b, split_50mer_position_b) = split_kmer_batch_into_50mers(kmer_b)
print(kmer_b.shape)
print(split_50mer_read_b.shape)
torch.Size([4, 150, 5])
torch.Size([404, 50, 5])

Lets slice only the first and last two 50-mer reads for each batch and a selection of bases to confirm that the batch is properly split.

dim0_idxs = [0,1,99,100,101,102,200,201,202,203,301,302]
dim1_idxs = [0,1,25,26,48,49]
dim2_idxs = [0,1,2,3,4]
print(split_50mer_read_b[np.ix_(dim0_idxs, dim1_idxs, dim2_idxs)][:,:,0])
tensor([[10000, 10001, 10025, 10026, 10048, 10049],
        [10001, 10002, 10026, 10027, 10049, 10050],
        [10099, 10100, 10124, 10125, 10147, 10148],
        [10100, 10101, 10125, 10126, 10148, 10149],
        [20000, 20001, 20025, 20026, 20048, 20049],
        [20001, 20002, 20026, 20027, 20049, 20050],
        [20099, 20100, 20124, 20125, 20147, 20148],
        [20100, 20101, 20125, 20126, 20148, 20149],
        [30000, 30001, 30025, 30026, 30048, 30049],
        [30001, 30002, 30026, 30027, 30049, 30050],
        [30099, 30100, 30124, 30125, 30147, 30148],
        [30100, 30101, 30125, 30126, 30148, 30149]])

Postprocessing

The model CNN Virus requires post inderence processing of the predictions: - before inference, each k-mer was split into \(k-49\) 50-mer, where were presented to the model for inference. Each k-mer led to \(k-49\) predictions (probability tensors) - now, we need to combine these $n = k-49 $ probability tensor into a single prediction for the original k-mer. This is done by first filtering all 50-reads that gave a max probability lower than a specific threshold (0.9 by default) and then combining the prediction by a simple vote

Now we can build the function combine_predictions step by step.

First, lets create a test probabilities tensor representing a batch of bs k-mers, each split into k-49 50-mers, with a probability tensor of shape [bs, k-49, nb_class].

def create_test_probs(bs=4, k=150, nb_class=187):
    print(f"Creating a batch of probabilities for {bs} {k}-mer reads with {nb_class} classes:")

    probs = torch.from_numpy(np.random.rand(bs * n ,nb_class)).reshape(-1,n,nb_class)
    print(f"probs {probs.shape}")
    return probs

bs, k, nb_class = 4, 55, 10
n = k - 49
probs = create_test_probs(bs, k, nb_class)
Creating a batch of probabilities for 4 55-mer reads with 10 classes:
probs torch.Size([4, 6, 10])

Step 1: Extract the predictions from probabilities for each 50-mer read, then filter the predictions with a probability lower than a threshold.

threshold = 0.8
preds = probs.argmax(dim=2)
print(f"preds {preds.shape}:\n",preds)

INVALID = 9999
invalid_labels_filter = probs.max(dim=2).values <= threshold
preds[invalid_labels_filter] = INVALID
print(f"preds {preds.shape}:\n",preds)
preds torch.Size([4, 6]):
 tensor([[9, 4, 8, 0, 9, 9],
        [2, 6, 7, 3, 5, 9],
        [9, 3, 4, 9, 0, 4],
        [7, 1, 9, 9, 6, 7]])
preds torch.Size([4, 6]):
 tensor([[   9,    4, 9999,    0,    9,    9],
        [   2,    6,    7,    3, 9999, 9999],
        [   9,    3,    4,    9,    0, 9999],
        [   7, 9999,    9,    9,    6,    7]])

Step 2: Extract the prediction unique values and their counts.

Code explanation note:

We do not want to use a python loop. This is why we use torch.unique(preds) with return_inverse. The function returns a flat tensor with the unique values accross the entire preds and a tensor of the same shape as preds with the indices of the unique values in preds.

    x = torch.tensor([[10, 30, 20, 30],[20,20,20,50]], dtype=torch.long)

    uniques, inverse = torch.unique(x, sorted=True, return_inverse=True)

    > unique:  tensor([10, 20, 30, 50])
      inverse: tensor([[0, 2, 1, 2],
                       [1, 1, 1, 3]]))
# Get unique values and their counts for the entire tensor
unique_values, inverse_indices = torch.unique(preds, return_inverse=True)
inverse_indices = inverse_indices.view(preds.shape)
print(f"unique_values {unique_values.shape}:\n",unique_values)
print(f"inverse_indices {inverse_indices.shape}:\n",inverse_indices)
unique_values torch.Size([8]):
 tensor([   0,    2,    3,    4,    6,    7,    9, 9999])
inverse_indices torch.Size([4, 6]):
 tensor([[6, 3, 7, 0, 6, 6],
        [1, 4, 5, 2, 7, 7],
        [6, 2, 3, 6, 0, 7],
        [5, 7, 6, 6, 4, 5]])

Step 3: Compute the counts of unique values in each 50-mer read and store it in a tensor of shape (batch size, nb of unique values in the the shole tensor preds).

# Create a tensor to hold the counts (shape (bs, nb_unique_values_across_the_batch))
counts = torch.zeros((preds.shape[0], unique_values.shape[0]), dtype=torch.int64)
# Count occurrences of each unique value per row
counts.scatter_add_(dim=1, index=inverse_indices, src=torch.ones_like(inverse_indices, dtype=torch.int64))
print(f"unique_values {unique_values.shape}:\n", unique_values)
print(f"counts {counts.shape}:\n", counts)
unique_values torch.Size([8]):
 tensor([   0,    2,    3,    4,    6,    7,    9, 9999])
counts torch.Size([4, 8]):
 tensor([[1, 0, 0, 1, 0, 0, 3, 1],
        [0, 1, 1, 0, 1, 1, 0, 2],
        [1, 0, 1, 1, 0, 0, 2, 1],
        [0, 0, 0, 0, 1, 2, 2, 1]])

Step 4: Get the index of the most frequent value for each 50-mer read, excluding the INVALID placeholder, and extract the corresponding values into a tensor of shape (bs).

# get value most voted per 50-read (vertical tensor), excluding the placeholder INVALID
most_voted_idxs = counts[:, :-1].argmax(dim=1)
most_voted_value = unique_values[most_voted_idxs][:, None]
print(f"most_voted_value {most_voted_value.shape}:\n", most_voted_value)
most_voted_value torch.Size([4, 1]):
 tensor([[9],
        [2],
        [9],
        [7]])

We put it all in the function combine_predictions


source

combine_predictions

 combine_predictions (label_probs:torch.Tensor, pos_probs:torch.Tensor,
                      threshold:float=0.9)

*Combine a batch of 50-mer probabilities into one batch of final prediction for label and position

Note: the input must be of shape (batch_size, n, c) where n is k-49 and c is the nb of labels or positions*

Type Default Details
label_probs Tensor Probabilities for the labels classes for each 50-mer (shape: [bs, k-49,187])
pos_probs Tensor Probabilities for the position classes for each 50-mer (shape: [bs, k-49,10])
threshold float 0.9 Threshold to consider a prediction as valid
Returns tuple Predicted labels and positions for each read (shape: [bs, 187], [bs, 10])

Testing the function with the same test tensor

combine_predictions(probs, probs[:, :,:10])
(tensor([0, 2, 9, 7]), tensor([0, 2, 9, 7]))

Testing the function with test tensor correponding to real probabilities

bs, k = 1, 150
label_probs = create_test_probs(bs, k, 187)
pos_probs = create_test_probs(bs, k, 10)

print('\ncombine for a batch of data')
p = combine_predictions(label_probs,pos_probs)
print(p)
print('\ncombine for a single sample (not a batch)')
p = combine_predictions(label_probs[0,:,:],pos_probs[0,:,:])
print(p)
Creating a batch of probabilities for 1 150-mer reads with 187 classes:
probs torch.Size([1, 6, 187])
Creating a batch of probabilities for 1 150-mer reads with 10 classes:
probs torch.Size([1, 6, 10])

combine for a batch of data
(tensor(41), tensor(6))

combine for a single sample (not a batch)
Converting probability tensors to 3 dimensions
(tensor(41), tensor(6))

Deprecated Items

When any of the following classes and functions is called, it will raise an exception with an error message indicating how to handle the required code refactoring.

Example:

---------------------------------------------------------------------------
DeprecationWarning                        Traceback (most recent call last)
Input In [484], in <cell line: 1>()
----> 1 combine_prediction_batch(label_probs, pos_probs)

Input In [481], in combine_prediction_batch(*args, **kwargs)
      3 """Deprecated"""
      4 msg = """
      5 `combine_prediction_batch` is deprecated. 
      6 Use `combine_predictions` instead, with same capabilities and more."""
----> 7 raise DeprecationWarning(msg)

DeprecationWarning: 
    `combine_prediction_batch` is deprecated. 
    Use `combine_predictions` instead, with same capabilities and more.

source

combine_prediction_batch

 combine_prediction_batch (*args, **kwargs)

Deprecated