core

Base classes, functions and other objects used across the package.

This module includes all base classes, functions and other objects that are used across the package. It is imported by all other modules in the package.

core includes utility classes and functions to make it easier to work with the complex file systems adopted for the project, as well as base classes such as a file reader with additional functionality.

Utility Classes and Functions

Handling files and file structure

Utility classes to represent


source

ProjectFileSystem

 ProjectFileSystem (*args, **kwargs)

*Represent a project file system, return paths to key directories, provide methods to manage the file system.

  • Paths to key directories are based on whether the code is running locally or in the cloud.
  • First time it is used on a local computer, it must be registered as local and a project root path must be set.
  • A user configuration file is created in the user’s home directory to store the project root path and whether the machine is local or not.

Technical note: ProjectFileSystem is a simpleton class*

Reference Project File System:

This project adopts a unified file structure to make coding and colaboration easier. In addition, we can run the code locally (from a project-root directory) or in the cloud (colab, kaggle, others).

The unified file structure when running localy is:

    project-root   
        |--- data
        |      |--- CNN_Virus_data  (all data from CNN Virus original paper)
        |      |--- saved           (trained and finetuned models, saved preprocessed datasets)
        |      |--- ....            (raw or pre-processed data from various sources, results, ... )  
        |      
        |--- nbs  (all reference and work notebooks)
        |      |--- cnn_virus
        |      |        |--- notebooks.ipynb

When running on google colab, it is assumed that a google drive is mounted on the colab server instance, and that this google drive root includes a shortcut named Metagenomics and pointing to the project shared directory. The project shared directory is accessible here if you are an authorized project member.

ProjectFileSystem at work:

If you use this class for the first time on a local computer, read the two Important Notes below.

pfs = ProjectFileSystem()

Once created, the instance of ProjectFileSystem gives access to key directories’ paths:

  • project root: Path to the project root directory
  • data: Path to the data directory
  • nbs: Path to the notebooks directory

It also provides additional information regarding the computer on which the code is running:

  • os: a string providing the name of the operating system the code is running on
  • is_colab: True if the code is running on google colab
  • is_kaggle: True if the code is running on kaggle server (NOT IMPLEMENTED YET)
  • is_local: True if the code is running on a computer registered as local
for p in [pfs.project_root, pfs.data, pfs.nbs]:
    print(p)
/home/vtec/projects/bio/metagentorch
/home/vtec/projects/bio/metagentorch/data
/home/vtec/projects/bio/metagentorch/nbs
print(f"Operating System: {pfs.os}")
print(f"Local Computer: {pfs.is_local}, Colab: {pfs.is_colab}, Kaggle: {pfs.is_kaggle}")
Operating System: linux
Local Computer: True, Colab: False, Kaggle: False

source

ProjectFileSystem.info

 ProjectFileSystem.info ()

Print basic info on the file system and the device

pfs.info()
Running linux on local computer
Device's home directory: /home/vtec
Project file structure:
 - Root ........ /home/vtec/projects/bio/metagentorch 
 - Data Dir .... /home/vtec/projects/bio/metagentorch/data 
 - Notebooks ... /home/vtec/projects/bio/metagentorch/nbs

source

ProjectFileSystem.readme

 ProjectFileSystem.readme (dir_path:pathlib.Path|None=None)

*Display readme.md file or any other .md file in dir_path.

This provides a convenient way to get information on each direcotry content*

Type Default Details
dir_path pathlib.Path | None None Path to the directory to inquire. If None, display readme file from project_root.
Returns None
pfs.readme(Path('data_dev'))

ReadMe file for directory /home/vtec/projects/bio/metagentorch/nbs-dev/data_dev:


Data directory for this package development

This directory includes all data required to validate and test this package code.

data_dev
 |--- CNN_Virus_data
 |     |--- 50mer_ds_100_seq
 |     |--- 150mer_ds_100_seq
 |     |--- train_short
 |     |--- val_short
 |     |--- weight_of_classes
 |--- ncbi
 |     |--- infer_results
 |     |     |--- cnn_virus
 |     |     |--- csv
 |     |     |--- xlsx
 |     |     |--- testdb.db
 |     |--- refsequences
 |     |     |--- cov
 |     |     |     |--cov_virus_sequence_one_metadata.json
 |     |     |     |--sequences_two_no_matching_rule.fa
 |     |     |     |--another_sequence.fa
 |     |     |     |--cov_virus_sequences_two.fa
 |     |     |     |--cov_virus_sequences_two_metadata.json
 |     |     |     |--cov_virus_sequence_one.fa
 |     |     |     |--single_1seq_150bp
 |     |     |     |    |--single_1seq_150bp.fq
 |     |     |     |    |--single_1seq_150bp.aln
 |     |     |     |--paired_1seq_150bp
 |     |     |     |    |--paired_1seq_150bp2.aln
 |     |     |     |    |--paired_1seq_150bp2.fq
 |     |     |     |    |--paired_1seq_150bp1.fq 
 |     |     |     |    |--paired_1seq_150bp1.aln 
 |     |--- simreads
 |     |     |--- cov
 |     |     |     |--- paired_1seq_50bp
 |     |     |     |      |--- paired_1seq_50bp_1.aln
 |     |     |     |      |--- paired_1seq_50bp_1.fq
 |     |     |     |--- single_1seq_50bp
 |     |     |     |      |--- single_1seq_50bp_1.aln
 |     |     |     |      |--- single_1seq_50bp_1.fq
 |     |     |--- cov
 |     |     |     |--single_1seq_50bp
 |     |     |     |    |--single_1seq_50bp.aln
 |     |     |     |    |--single_1seq_50bp.fq
 |     |     |     |--single_1seq_150bp
 |     |     |     |    |--single_1seq_150bp.fq
 |     |     |     |    |--single_1seq_150bp.aln
 |     |     |     |--paired_1seq_150bp
 |     |     |     |    |--paired_1seq_150bp2.aln
 |     |     |     |    |--paired_1seq_150bp2.fq
 |     |     |     |    |--paired_1seq_150bp1.fq
 |     |     |     |    |--paired_1seq_150bp1.aln
 |--- saved           
 |--- readme.md               

Important Note 1:

When using the package on a local computer for the first time, you must register the computer as a local computer. Otherwise, ProjectFileSystem will raise an error. Once registered, the configuration file will be updated and ProjectFileSystem will detect that and run without error.


source

ProjectFileSystem.register_as_local

 ProjectFileSystem.register_as_local ()

Update the configuration file to register the machine as local machine

cfg = pfs.register_as_local()

Important Note 2:

When using the package on a local computer for the first time, it is also required to set the project root directory. This is necessary to allow users to locate their local project folder anywhere they want. Once set, the path to the project root will be saved in the configuratin file.


source

ProjectFileSystem.set_project_root

 ProjectFileSystem.set_project_root (p2project:str|pathlib.Path,
                                     data_dir:str='data')

Update the configuration file to set the project root

Type Default Details
p2project str | pathlib.Path string or Path to the project directory. Can be absolute or relative to home
data_dir str data Directory name for data under project root
Returns ConfigParser
pfs.set_project_root('/home/vtec/projects/bio/metagentorch/');
Project Root set to:   /home/vtec/projects/bio/metagentorch
Data directory set to: /home/vtec/projects/bio/metagentorch/data

source

ProjectFileSystem.read_config

 ProjectFileSystem.read_config ()

Read config from the configuration file if it exists and return an empty config if does not

cfg = pfs.read_config()
cfg['Infra']['registered_as_local']
'True'
cfg['Infra']['project_root']
'/home/vtec/projects/bio/metagentorch'
cfg['Infra']['data_dir']
'data'

Technical Note for Developpers

The current notebook and all other development notebooks use a minimum set of data that comes with the repository under nbs-dev/data_dev instead of the standard data directory which is much too large for testing and developing.

Therefore, when creating the instance of ProjectFileSystem, use the parameter config_file to pass a specific development configuration, also coming with the repository.

p2dev_cfg = PACKAGE_ROOT / 'nbs-dev/metagentorch-dev.cfg'
pfs = ProjectFileSystem(config_fname=p2dev_cfg)
pfs.info()
Running linux on local computer
Device's home directory: /home/vtec
Project file structure:
 - Root ........ /home/vtec/projects/bio/metagentorch 
 - Data Dir .... /home/vtec/projects/bio/metagentorch/nbs-dev/data_dev 
 - Notebooks ... /home/vtec/projects/bio/metagentorch/nbs

SQlite Database Helper Class


source

SqliteDatabase

 SqliteDatabase (p2db:pathlib.Path)

*Manage a SQLite db file, execute SQL queries, return results, provide context manager functionality.

Example usage as a context manager

db_path = Path('your_database.db')
db = SqliteDb(db_path)

with db as database:
    result = database.get_result("SELECT * FROM your_table")
    print(result)
```*


::: {#cell-31 .cell}
``` {.python .cell-code}
p2db = pfs.data / 'ncbi/infer_results/cov-ncbi/testdb.db'
db = SqliteDatabase(p2db)

:::

db.print_schema()
predictions (table)
 columns: id, readid, refseqid, refsource, refseq_strand, taxonomyid, lbl_true, lbl_pred, pos_true, pos_pred, top_5_lbl_pred_0, top_5_lbl_pred_1, top_5_lbl_pred_2, top_5_lbl_pred_3, top_5_lbl_pred_4
 index: idx_preds
   indexed columns: readid, refseqid, pos_true

label_probabilities (table)
 columns: id, read_kmer_id, read_50mer_nb, prob_000, prob_001, prob_002, prob_003, prob_004, prob_005, prob_006, prob_007, prob_008, prob_009, prob_010, prob_011, prob_012, prob_013, prob_014, prob_015, prob_016, prob_017, prob_018, prob_019, prob_020, prob_021, prob_022, prob_023, prob_024, prob_025, prob_026, prob_027, prob_028, prob_029, prob_030, prob_031, prob_032, prob_033, prob_034, prob_035, prob_036, prob_037, prob_038, prob_039, prob_040, prob_041, prob_042, prob_043, prob_044, prob_045, prob_046, prob_047, prob_048, prob_049, prob_050, prob_051, prob_052, prob_053, prob_054, prob_055, prob_056, prob_057, prob_058, prob_059, prob_060, prob_061, prob_062, prob_063, prob_064, prob_065, prob_066, prob_067, prob_068, prob_069, prob_070, prob_071, prob_072, prob_073, prob_074, prob_075, prob_076, prob_077, prob_078, prob_079, prob_080, prob_081, prob_082, prob_083, prob_084, prob_085, prob_086, prob_087, prob_088, prob_089, prob_090, prob_091, prob_092, prob_093, prob_094, prob_095, prob_096, prob_097, prob_098, prob_099, prob_100, prob_101, prob_102, prob_103, prob_104, prob_105, prob_106, prob_107, prob_108, prob_109, prob_110, prob_111, prob_112, prob_113, prob_114, prob_115, prob_116, prob_117, prob_118, prob_119, prob_120, prob_121, prob_122, prob_123, prob_124, prob_125, prob_126, prob_127, prob_128, prob_129, prob_130, prob_131, prob_132, prob_133, prob_134, prob_135, prob_136, prob_137, prob_138, prob_139, prob_140, prob_141, prob_142, prob_143, prob_144, prob_145, prob_146, prob_147, prob_148, prob_149, prob_150, prob_151, prob_152, prob_153, prob_154, prob_155, prob_156, prob_157, prob_158, prob_159, prob_160, prob_161, prob_162, prob_163, prob_164, prob_165, prob_166, prob_167, prob_168, prob_169, prob_170, prob_171, prob_172, prob_173, prob_174, prob_175, prob_176, prob_177, prob_178, prob_179, prob_180, prob_181, prob_182, prob_183, prob_184, prob_185, prob_186
 index: idx_probs
   indexed columns: read_kmer_id, read_50mer_nb

preds_probs (view)
 columns: refseqid,lbl_true,lbl_pred,pos_true,pos_pred,top_5_lbl_pred_0,top_5_lbl_pred_1,top_5_lbl_pred_2,top_5_lbl_pred_3,top_5_lbl_pred_4,top_5_lbl_pred_0:1,top_5_lbl_pred_1:1,top_5_lbl_pred_2:1,top_5_lbl_pred_3:1,top_5_lbl_pred_4:1,top_5_lbl_pred_0:2,top_5_lbl_pred_1:2,top_5_lbl_pred_2:2,top_5_lbl_pred_3:2,top_5_lbl_pred_4:2,top_5_lbl_pred_0:3,top_5_lbl_pred_1:3,top_5_lbl_pred_2:3,top_5_lbl_pred_3:3,top_5_lbl_pred_4:3,top_5_lbl_pred_0:4,top_5_lbl_pred_1:4,top_5_lbl_pred_2:4,top_5_lbl_pred_3:4,top_5_lbl_pred_4:4,read_kmer_id,read_50mer_nb,prob_000,prob_001,prob_002,prob_003,prob_004,prob_005,prob_006,prob_007,prob_008,prob_009,prob_010,prob_011,prob_012,prob_013,prob_014,prob_015,prob_016,prob_017,prob_018,prob_019,prob_020,prob_021,prob_022,prob_023,prob_024,prob_025,prob_026,prob_027,prob_028,prob_029,prob_030,prob_031,prob_032,prob_033,prob_034,prob_035,prob_036,prob_037,prob_038,prob_039,prob_040,prob_041,prob_042,prob_043,prob_044,prob_045,prob_046,prob_047,prob_048,prob_049,prob_050,prob_051,prob_052,prob_053,prob_054,prob_055,prob_056,prob_057,prob_058,prob_059,prob_060,prob_061,prob_062,prob_063,prob_064,prob_065,prob_066,prob_067,prob_068,prob_069,prob_070,prob_071,prob_072,prob_073,prob_074,prob_075,prob_076,prob_077,prob_078,prob_079,prob_080,prob_081,prob_082,prob_083,prob_084,prob_085,prob_086,prob_087,prob_088,prob_089,prob_090,prob_091,prob_092,prob_093,prob_094,prob_095,prob_096,prob_097,prob_098,prob_099,prob_100,prob_101,prob_102,prob_103,prob_104,prob_105,prob_106,prob_107,prob_108,prob_109,prob_110,prob_111,prob_112,prob_113,prob_114,prob_115,prob_116,prob_117,prob_118,prob_119,prob_120,prob_121,prob_122,prob_123,prob_124,prob_125,prob_126,prob_127,prob_128,prob_129,prob_130,prob_131,prob_132,prob_133,prob_134,prob_135,prob_136,prob_137,prob_138,prob_139,prob_140,prob_141,prob_142,prob_143,prob_144,prob_145,prob_146,prob_147,prob_148,prob_149,prob_150,prob_151,prob_152,prob_153,prob_154,prob_155,prob_156,prob_157,prob_158,prob_159,prob_160,prob_161,prob_162,prob_163,prob_164,prob_165,prob_166,prob_167,prob_168,prob_169,prob_170,prob_171,prob_172,prob_173,prob_174,prob_175,prob_176,prob_177,prob_178,prob_179,prob_180,prob_181,prob_182,prob_183,prob_184,prob_185,prob_186

Other utility classes


source

JsonDict

 JsonDict (p2json:str|pathlib.Path, dictionary:dict|None=None)

*Dictionary whose current value is mirrored in a json file and can be initated from a json file

JsonDict requires a path to json file at creation. An optional dict can be passed as argument.

Behavior at creation:

  • JsonDict(p2json, dict) will create a JsonDict with key-values from dict, and mirrored in p2json
  • JsonDict(p2json) will create a JsonDict with empty dictionary and load json content if file exists

Once created, JsonDict instances behave exactly as a dictionary*

Type Default Details
p2json str | pathlib.Path path to the json file to mirror with the dictionary
dictionary dict | None None optional dictionary to initialize the JsonDict

Create a new dictionary mirrored to a JSON file:

d = {'a': 1, 'b': 2, 'c': 3}
p2json = pfs.data / 'jsondict-test.json'
jsondict = JsonDict(p2json, d)
jsondict
dict mirrored in /home/vtec/projects/bio/metagentorch/nbs-dev/data_dev/jsondict-test.json
{'a': 1, 'b': 2, 'c': 3}

Once created, the JsonFile instance behaves exactly like a dictionary, with the added benefit that any change to the dictionary is automatically saved to the JSON file.

jsondict['a'], jsondict['b'], jsondict['c']
(1, 2, 3)
for k, v in jsondict.items():
    print(f"key: {k}; value: {v}")
key: a; value: 1
key: b; value: 2
key: c; value: 3

Adding or removing a value from the dictionary works in the same way as for a normal dictionary. But the json file is automatically updated.

jsondict['d'] = 4
jsondict
dict mirrored in /home/vtec/projects/bio/metagentorch/nbs-dev/data_dev/jsondict-test.json
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
with open(p2json, 'r') as fp:
    print(fp.read())
{
    "a": 1,
    "b": 2,
    "c": 3,
    "d": 4
}
del jsondict['a']
jsondict
dict mirrored in /home/vtec/projects/bio/metagentorch/nbs-dev/data_dev/jsondict-test.json
{'b': 2, 'c': 3, 'd': 4}
with open(p2json, 'r') as fp:
    print(fp.read())
{
    "b": 2,
    "c": 3,
    "d": 4
}

source

JsonFileReader

 JsonFileReader (path:str|pathlib.Path)

Mirror a JSON file and a dictionary

Type Details
path str | pathlib.Path path to the json file
jd = JsonFileReader(pfs.data / 'test.json')
pprint(jd.d)
{'item 1': {'keys': 'key key key key', 'pattern': 'pattern 1'},
 'item 2': {'keys': 'key key key key', 'pattern': 'pattern 2'},
 'item 3': {'keys': 'key key key key', 'pattern': 'pattern 3'}}

Now we can add an item to the dictionary/json

new_item = {'keys': 'key key key key', 'pattern': 'another pattern'}
jd.add_item(key='another item', item=new_item)
{'item 1': {'keys': 'key key key key', 'pattern': 'pattern 1'},
 'item 2': {'keys': 'key key key key', 'pattern': 'pattern 2'},
 'item 3': {'keys': 'key key key key', 'pattern': 'pattern 3'},
 'another item': {'keys': 'key key key key', 'pattern': 'another pattern'}}

After saving the updated JSON file, we can load it again and see the changes.

jd.save_to_file()
jd = JsonFileReader(pfs.data / 'test.json')
pprint(jd.d)
{'another item': {'keys': 'key key key key', 'pattern': 'another pattern'},
 'item 1': {'keys': 'key key key key', 'pattern': 'pattern 1'},
 'item 2': {'keys': 'key key key key', 'pattern': 'pattern 2'},
 'item 3': {'keys': 'key key key key', 'pattern': 'pattern 3'}}

Other utility functions


source

list_available_devices

 list_available_devices ()

Base Classes

File Readers

Base classes to be extended in order to create readers for specific file formats.


source

TextFileBaseReader

 TextFileBaseReader (path:str|pathlib.Path, nlines:int=1)

*Iterator going through a text file by chunks of nlines lines. Iterator can be reset to file start.

The class is mainly intented to be extended, as it is for handling sequence files of various formats such as FastaFileReader.*

Type Default Details
path str | pathlib.Path path to the file
nlines int 1 number of lines on one chunk

Once initialized, the iterator runs over each chunk of line(s) in the text file, sequentially.

pfs.data
Path('/home/vtec/projects/bio/metagentorch/nbs-dev/data_dev')
p2textfile = pfs.data / 'CNN_Virus_data/train_short'
it = TextFileBaseReader(path=p2textfile, nlines=3)

one_iteration = next(it)

print(one_iteration)
TCAAAATAATCAGAAATGTTGAACCTAGGGTTGGACACATAATGACCAGC  76  0
ATTGTTTAACAATTTGTGCTCGTCCCGGTCACCCGCATCCAATCTTGATG  4   9
AATCTTGTCCTATCCTACCCGCAGGGGAATTGATGATAGANGTGCTTTTA  181 0

Let’s create a new instance of the file reader, and get several iterations.

reader = TextFileBaseReader(path=p2textfile, nlines=3)

one_iteration = next(reader)
print(one_iteration)
TCAAAATAATCAGAAATGTTGAACCTAGGGTTGGACACATAATGACCAGC  76  0
ATTGTTTAACAATTTGTGCTCGTCCCGGTCACCCGCATCCAATCTTGATG  4   9
AATCTTGTCCTATCCTACCCGCAGGGGAATTGATGATAGANGTGCTTTTA  181 0
another_iteration = next(it)
print(another_iteration)
one_more_iteration = next(it)
print(one_more_iteration)
GGAGCGGAGCCAACCCCTATGCTCACTTGCAACCCAAGGGGCGTTCCAGT  74  3
TGGATCCTGCGCGGGACGTCCTTTGTCTACGTCCCGTCGGCGCATCCCGC  60  3
GAGAGACTTACTAAAAAGCTGGCACTTACCATCAGTGTTTCACCTACATG  44  0

ACACACGACACTAGAGATAATGTGTCAGTGGATTATAAACAAACCAAGTT  43  7
TTGTAGCATAAGAACTGGTCTTCGCTGAAATTCTTGTCTTGATCTCATCT  35  2
TGGCCCTGCGGTCTGGGGCCCAGAAGCATATGTCAAGTCCTTTGAGAAGT  73  4

If we want to access the start of the file again, we need to re-initialize the file handle.


source

TextFileBaseReader.reset_iterator

 TextFileBaseReader.reset_iterator ()

Reset the iterator to point to the first line in the file.

reader.reset_iterator()
one_iteration = next(it)
print(one_iteration)
another_iteration = next(it)
print(another_iteration)
TAGATTTAGTGGTTAGGTAGTAAGGCTACAATGTAAACACGTAGTGGCAA  11  6
AACCCCTGGGGCTATAAAAGGCGCGGTCTGTGCACGGGGACTTCGGTNGG  7   7
AGAATGGATAGTAAGGCAGACAGTAATAGGGGAGGCAATGAAGGAAACCA  9   2

GATCCTAAGGTCCGTCCCCGGGGTCGCTTACCACTCCCCTGAAGCATGTC  131 7
ACAAGTCTAAAACCCTTCAGGACNTGATGTTTATAAATTCTACCTGTTAT  18  6
AGCCGGTGAACAACGTTTTTCAAGAGGGGGCCGTTCCTGGAGGACGGACA  59  7

source

TextFileBaseReader.print_first_chunks

 TextFileBaseReader.print_first_chunks (nchunks:int=3)

*Print the first nchunk chunks of text from the file.

After printing, the iterator is reset again to its first line.*

Type Default Details
nchunks int 3 number of chunks to print
Returns None
reader = TextFileBaseReader(path=p2textfile, nlines=3)

reader.print_first_chunks(nchunks=3)
3-line chunk 1
TCAAAATAATCAGAAATGTTGAACCTAGGGTTGGACACATAATGACCAGC  76  0
ATTGTTTAACAATTTGTGCTCGTCCCGGTCACCCGCATCCAATCTTGATG  4   9
AATCTTGTCCTATCCTACCCGCAGGGGAATTGATGATAGANGTGCTTTTA  181 0

3-line chunk 2
GGAGCGGAGCCAACCCCTATGCTCACTTGCAACCCAAGGGGCGTTCCAGT  74  3
TGGATCCTGCGCGGGACGTCCTTTGTCTACGTCCCGTCGGCGCATCCCGC  60  3
GAGAGACTTACTAAAAAGCTGGCACTTACCATCAGTGTTTCACCTACATG  44  0

3-line chunk 3
ACACACGACACTAGAGATAATGTGTCAGTGGATTATAAACAAACCAAGTT  43  7
TTGTAGCATAAGAACTGGTCTTCGCTGAAATTCTTGTCTTGATCTCATCT  35  2
TGGCCCTGCGGTCTGGGGCCCAGAAGCATATGTCAAGTCCTTTGAGAAGT  73  4

source

TextFileBaseReader.parse_text

 TextFileBaseReader.parse_text (txt:str, pattern:str|None=None)

Parse text using regex pattern with groups. Return a metadata dictionary.

Type Default Details
txt str text to parse
pattern str | None None If None, uses standard regex pattern to extract metadata, otherwise, uses passed regex
Returns dict parsed metadata in key/value format
text = '>2591237:ncbi:1'
pattern = r"^>(?P<id>\d+):(?P<source>ncbi):(?P<nb>\d*)"

reader.parse_text(text, pattern)
{'id': '2591237', 'source': 'ncbi', 'nb': '1'}

Extending the base class

TextFileBaseReader is a base class, intended to be extended into specific file format readers.

The following methods will typically be extended to match data file and other structured text files formats:

  • __next__ method in order to customize how the iterator parses files into “elements”. For instance, in a FASTA file, one element consists of two lines: a “definition line” and the sequence itself. Extending TextFileBaseReader allows to read pairs of lines sequentially and return an element as a dictionary. For instance, FastaFileReader iterates over each pairs of lines in a Fasta file and return each pair as a dictionary as follows:
    {
    'definition line': '>2591237:ncbi:1 [MK211378]\t2591237\tncbi\t1 [MK211378] '
                       '2591237\tCoronavirus BtRs-BetaCoV/YN2018D\t\tscientific '
                       'name\n',
    'sequence':        'TATTAGGTTTTCTACCTACCCAGGA'
    }
  • Methods for parsing metadata from the file. For instance, parse_file method will handle how the reader will iterate over the full file and return a dictionary for the entire file.
  • Extended classes will also define a specific attributes (text_to_parse_key, re_pattern, re_keys, …)

source

TextFileBaseReader.set_parsing_rules

 TextFileBaseReader.set_parsing_rules (pattern:str|None=None,
                                       verbose:bool=False)

*Set the standard regex parsing rule for the file.

Rules can be set:

  1. manually by passing specific custom values for pattern and keys
  2. automatically, by testing all parsing rules saved in parsing_rule.json

Automatic selection of parsing rules works by testing each rule saved in parsing_rule.json on the first definition line of the fasta file, and selecting the one rule that generates the most metadata matches.

Rules consists of two parameters:

  • The regex pattern including one group for each metadata item, e.g (?P<group_name>regex_code)
  • The list of keys, i.e. the list with the name of each regex groups, used as key in the metadata dictionary

This method updates the three following class attributes: re_rule_name, re_pattern, re_keys*

Type Default Details
pattern str | None None regex pattern to apply to parse the text, search in parsing rules json if None
verbose bool False when True, provides information on each rule
Returns None

Important Note to Developpers

Method set_parsing_rules is there to allow TextFileBaseReader’s descendant classes to automatically select parsing rule by applying rules saved in a json file to a string extracted from the first element in the file.

It assumes that the iterator returns its elements as dictionaries {section_name:section, ...} and not as a pure string. The key self.text_to_parse_key will then be used to extract the text to parse for testing the rules. The base class iterator returns a simple string and self.text_to_parse_key is set to None.

To make setting up a default parsing rule for the reader instance, the iterator must return a dictionary and self.text_to_parse_key must be set to the key in the dictionary corresponding the the text to parse.

See implementation in FastaFileReader.

Calling set_parsing_rules on a class that does not satisfy with these characteristics will do nothing and return a warning.

reader.set_parsing_rules()
/tmp/ipykernel_8720/3621596767.py:133: UserWarning: 
            `text_to_parse_key` is not defined in this class. 
            It is not possible to set a parsing rule. Must be define, e.g. 'definition line'
            
  warnings.warn(msg, category=UserWarning)

Deprecated Items

When any of the following classes and functions is called, it will raise an exception with an error message indicating how to handle the required code refactoring.

Example:

DeprecationWarning                        Traceback (most recent call last)
Input In [140], in <cell line: 1>()
----> 1 TextFileBaseIterator(p2textfile)

Input In [139], in TextFileBaseIterator.__init__(self, *args, **kwargs)
      4 def __init__(self, *args, **kwargs):
      5     msg = \"\"\"
      6     `TextFileBaseIterator` is deprecated. 
      7     Use `TextFileBaseReader` instead, with same capabilities and more.\"\"\"
----> 8     raise DeprecationWarning(msg)

DeprecationWarning: 
        `TextFileBaseIterator` is deprecated. 
        Use `TextFileBaseReader` instead, with same capabilities and more." 

source

TextFileBaseIterator

 TextFileBaseIterator (*args, **kwargs)

TextFileBaseIterator is a deprecated class, to be replaced by TextFileBaseReader