pfs = ProjectFileSystem()core
This module includes all base classes, functions and other objects that are used across the package. It is imported by all other modules in the package.
core includes utility classes and functions to make it easier to work with the complex file systems adopted for the project, as well as base classes such as a file reader with additional functionality.
Utility Classes and Functions
Handling files and file structure
Utility classes to represent
ProjectFileSystem
ProjectFileSystem (*args, **kwargs)
*Represent a project file system, return paths to key directories, provide methods to manage the file system.
- Paths to key directories are based on whether the code is running locally or in the cloud.
- First time it is used on a local computer, it must be registered as local and a project root path must be set.
- A user configuration file is created in the user’s home directory to store the project root path and whether the machine is local or not.
Technical note:
ProjectFileSystemis a simpleton class*
Reference Project File System:
This project adopts a unified file structure to make coding and colaboration easier. In addition, we can run the code locally (from a project-root directory) or in the cloud (colab, kaggle, others).
The unified file structure when running localy is:
project-root
|--- data
| |--- CNN_Virus_data (all data from CNN Virus original paper)
| |--- saved (trained and finetuned models, saved preprocessed datasets)
| |--- .... (raw or pre-processed data from various sources, results, ... )
|
|--- nbs (all reference and work notebooks)
| |--- cnn_virus
| | |--- notebooks.ipynb
When running on google colab, it is assumed that a google drive is mounted on the colab server instance, and that this google drive root includes a shortcut named Metagenomics and pointing to the project shared directory. The project shared directory is accessible here if you are an authorized project member.
ProjectFileSystem at work:
If you use this class for the first time on a local computer, read the two Important Notes below.
Once created, the instance of ProjectFileSystem gives access to key directories’ paths:
project root:Pathto the project root directorydata:Pathto the data directorynbs:Pathto the notebooks directory
It also provides additional information regarding the computer on which the code is running:
os: a string providing the name of the operating system the code is running onis_colab: True if the code is running on google colabis_kaggle: True if the code is running on kaggle server (NOT IMPLEMENTED YET)is_local: True if the code is running on a computer registered as local
for p in [pfs.project_root, pfs.data, pfs.nbs]:
print(p)/home/vtec/projects/bio/metagentorch
/home/vtec/projects/bio/metagentorch/data
/home/vtec/projects/bio/metagentorch/nbs
print(f"Operating System: {pfs.os}")
print(f"Local Computer: {pfs.is_local}, Colab: {pfs.is_colab}, Kaggle: {pfs.is_kaggle}")Operating System: linux
Local Computer: True, Colab: False, Kaggle: False
ProjectFileSystem.info
ProjectFileSystem.info ()
Print basic info on the file system and the device
pfs.info()Running linux on local computer
Device's home directory: /home/vtec
Project file structure:
- Root ........ /home/vtec/projects/bio/metagentorch
- Data Dir .... /home/vtec/projects/bio/metagentorch/data
- Notebooks ... /home/vtec/projects/bio/metagentorch/nbs
ProjectFileSystem.readme
ProjectFileSystem.readme (dir_path:pathlib.Path|None=None)
*Display readme.md file or any other .md file in dir_path.
This provides a convenient way to get information on each direcotry content*
| Type | Default | Details | |
|---|---|---|---|
| dir_path | pathlib.Path | None | None | Path to the directory to inquire. If None, display readme file from project_root. |
| Returns | None |
pfs.readme(Path('data_dev'))ReadMe file for directory /home/vtec/projects/bio/metagentorch/nbs-dev/data_dev:
Data directory for this package development
This directory includes all data required to validate and test this package code.
data_dev
|--- CNN_Virus_data
| |--- 50mer_ds_100_seq
| |--- 150mer_ds_100_seq
| |--- train_short
| |--- val_short
| |--- weight_of_classes
|--- ncbi
| |--- infer_results
| | |--- cnn_virus
| | |--- csv
| | |--- xlsx
| | |--- testdb.db
| |--- refsequences
| | |--- cov
| | | |--cov_virus_sequence_one_metadata.json
| | | |--sequences_two_no_matching_rule.fa
| | | |--another_sequence.fa
| | | |--cov_virus_sequences_two.fa
| | | |--cov_virus_sequences_two_metadata.json
| | | |--cov_virus_sequence_one.fa
| | | |--single_1seq_150bp
| | | | |--single_1seq_150bp.fq
| | | | |--single_1seq_150bp.aln
| | | |--paired_1seq_150bp
| | | | |--paired_1seq_150bp2.aln
| | | | |--paired_1seq_150bp2.fq
| | | | |--paired_1seq_150bp1.fq
| | | | |--paired_1seq_150bp1.aln
| |--- simreads
| | |--- cov
| | | |--- paired_1seq_50bp
| | | | |--- paired_1seq_50bp_1.aln
| | | | |--- paired_1seq_50bp_1.fq
| | | |--- single_1seq_50bp
| | | | |--- single_1seq_50bp_1.aln
| | | | |--- single_1seq_50bp_1.fq
| | |--- cov
| | | |--single_1seq_50bp
| | | | |--single_1seq_50bp.aln
| | | | |--single_1seq_50bp.fq
| | | |--single_1seq_150bp
| | | | |--single_1seq_150bp.fq
| | | | |--single_1seq_150bp.aln
| | | |--paired_1seq_150bp
| | | | |--paired_1seq_150bp2.aln
| | | | |--paired_1seq_150bp2.fq
| | | | |--paired_1seq_150bp1.fq
| | | | |--paired_1seq_150bp1.aln
|--- saved
|--- readme.md
Important Note 1:
When using the package on a local computer for the first time, you must register the computer as a local computer. Otherwise,
ProjectFileSystemwill raise an error. Once registered, the configuration file will be updated andProjectFileSystemwill detect that and run without error.
ProjectFileSystem.register_as_local
ProjectFileSystem.register_as_local ()
Update the configuration file to register the machine as local machine
cfg = pfs.register_as_local()Important Note 2:
When using the package on a local computer for the first time, it is also required to set the project root directory. This is necessary to allow users to locate their local project folder anywhere they want. Once set, the path to the project root will be saved in the configuratin file.
ProjectFileSystem.set_project_root
ProjectFileSystem.set_project_root (p2project:str|pathlib.Path, data_dir:str='data')
Update the configuration file to set the project root
| Type | Default | Details | |
|---|---|---|---|
| p2project | str | pathlib.Path | string or Path to the project directory. Can be absolute or relative to home | |
| data_dir | str | data | Directory name for data under project root |
| Returns | ConfigParser |
pfs.set_project_root('/home/vtec/projects/bio/metagentorch/');Project Root set to: /home/vtec/projects/bio/metagentorch
Data directory set to: /home/vtec/projects/bio/metagentorch/data
ProjectFileSystem.read_config
ProjectFileSystem.read_config ()
Read config from the configuration file if it exists and return an empty config if does not
cfg = pfs.read_config()
cfg['Infra']['registered_as_local']'True'
cfg['Infra']['project_root']'/home/vtec/projects/bio/metagentorch'
cfg['Infra']['data_dir']'data'
Technical Note for Developpers
The current notebook and all other development notebooks use a minimum set of data that comes with the repository under nbs-dev/data_dev instead of the standard data directory which is much too large for testing and developing.
Therefore, when creating the instance of ProjectFileSystem, use the parameter config_file to pass a specific development configuration, also coming with the repository.
p2dev_cfg = PACKAGE_ROOT / 'nbs-dev/metagentorch-dev.cfg'
pfs = ProjectFileSystem(config_fname=p2dev_cfg)
pfs.info()Running linux on local computer
Device's home directory: /home/vtec
Project file structure:
- Root ........ /home/vtec/projects/bio/metagentorch
- Data Dir .... /home/vtec/projects/bio/metagentorch/nbs-dev/data_dev
- Notebooks ... /home/vtec/projects/bio/metagentorch/nbs
SQlite Database Helper Class
SqliteDatabase
SqliteDatabase (p2db:pathlib.Path)
*Manage a SQLite db file, execute SQL queries, return results, provide context manager functionality.
Example usage as a context manager
db_path = Path('your_database.db')
db = SqliteDb(db_path)
with db as database:
result = database.get_result("SELECT * FROM your_table")
print(result)
```*
::: {#cell-31 .cell}
``` {.python .cell-code}
p2db = pfs.data / 'ncbi/infer_results/cov-ncbi/testdb.db'
db = SqliteDatabase(p2db):::
db.print_schema()predictions (table)
columns: id, readid, refseqid, refsource, refseq_strand, taxonomyid, lbl_true, lbl_pred, pos_true, pos_pred, top_5_lbl_pred_0, top_5_lbl_pred_1, top_5_lbl_pred_2, top_5_lbl_pred_3, top_5_lbl_pred_4
index: idx_preds
indexed columns: readid, refseqid, pos_true
label_probabilities (table)
columns: id, read_kmer_id, read_50mer_nb, prob_000, prob_001, prob_002, prob_003, prob_004, prob_005, prob_006, prob_007, prob_008, prob_009, prob_010, prob_011, prob_012, prob_013, prob_014, prob_015, prob_016, prob_017, prob_018, prob_019, prob_020, prob_021, prob_022, prob_023, prob_024, prob_025, prob_026, prob_027, prob_028, prob_029, prob_030, prob_031, prob_032, prob_033, prob_034, prob_035, prob_036, prob_037, prob_038, prob_039, prob_040, prob_041, prob_042, prob_043, prob_044, prob_045, prob_046, prob_047, prob_048, prob_049, prob_050, prob_051, prob_052, prob_053, prob_054, prob_055, prob_056, prob_057, prob_058, prob_059, prob_060, prob_061, prob_062, prob_063, prob_064, prob_065, prob_066, prob_067, prob_068, prob_069, prob_070, prob_071, prob_072, prob_073, prob_074, prob_075, prob_076, prob_077, prob_078, prob_079, prob_080, prob_081, prob_082, prob_083, prob_084, prob_085, prob_086, prob_087, prob_088, prob_089, prob_090, prob_091, prob_092, prob_093, prob_094, prob_095, prob_096, prob_097, prob_098, prob_099, prob_100, prob_101, prob_102, prob_103, prob_104, prob_105, prob_106, prob_107, prob_108, prob_109, prob_110, prob_111, prob_112, prob_113, prob_114, prob_115, prob_116, prob_117, prob_118, prob_119, prob_120, prob_121, prob_122, prob_123, prob_124, prob_125, prob_126, prob_127, prob_128, prob_129, prob_130, prob_131, prob_132, prob_133, prob_134, prob_135, prob_136, prob_137, prob_138, prob_139, prob_140, prob_141, prob_142, prob_143, prob_144, prob_145, prob_146, prob_147, prob_148, prob_149, prob_150, prob_151, prob_152, prob_153, prob_154, prob_155, prob_156, prob_157, prob_158, prob_159, prob_160, prob_161, prob_162, prob_163, prob_164, prob_165, prob_166, prob_167, prob_168, prob_169, prob_170, prob_171, prob_172, prob_173, prob_174, prob_175, prob_176, prob_177, prob_178, prob_179, prob_180, prob_181, prob_182, prob_183, prob_184, prob_185, prob_186
index: idx_probs
indexed columns: read_kmer_id, read_50mer_nb
preds_probs (view)
columns: refseqid,lbl_true,lbl_pred,pos_true,pos_pred,top_5_lbl_pred_0,top_5_lbl_pred_1,top_5_lbl_pred_2,top_5_lbl_pred_3,top_5_lbl_pred_4,top_5_lbl_pred_0:1,top_5_lbl_pred_1:1,top_5_lbl_pred_2:1,top_5_lbl_pred_3:1,top_5_lbl_pred_4:1,top_5_lbl_pred_0:2,top_5_lbl_pred_1:2,top_5_lbl_pred_2:2,top_5_lbl_pred_3:2,top_5_lbl_pred_4:2,top_5_lbl_pred_0:3,top_5_lbl_pred_1:3,top_5_lbl_pred_2:3,top_5_lbl_pred_3:3,top_5_lbl_pred_4:3,top_5_lbl_pred_0:4,top_5_lbl_pred_1:4,top_5_lbl_pred_2:4,top_5_lbl_pred_3:4,top_5_lbl_pred_4:4,read_kmer_id,read_50mer_nb,prob_000,prob_001,prob_002,prob_003,prob_004,prob_005,prob_006,prob_007,prob_008,prob_009,prob_010,prob_011,prob_012,prob_013,prob_014,prob_015,prob_016,prob_017,prob_018,prob_019,prob_020,prob_021,prob_022,prob_023,prob_024,prob_025,prob_026,prob_027,prob_028,prob_029,prob_030,prob_031,prob_032,prob_033,prob_034,prob_035,prob_036,prob_037,prob_038,prob_039,prob_040,prob_041,prob_042,prob_043,prob_044,prob_045,prob_046,prob_047,prob_048,prob_049,prob_050,prob_051,prob_052,prob_053,prob_054,prob_055,prob_056,prob_057,prob_058,prob_059,prob_060,prob_061,prob_062,prob_063,prob_064,prob_065,prob_066,prob_067,prob_068,prob_069,prob_070,prob_071,prob_072,prob_073,prob_074,prob_075,prob_076,prob_077,prob_078,prob_079,prob_080,prob_081,prob_082,prob_083,prob_084,prob_085,prob_086,prob_087,prob_088,prob_089,prob_090,prob_091,prob_092,prob_093,prob_094,prob_095,prob_096,prob_097,prob_098,prob_099,prob_100,prob_101,prob_102,prob_103,prob_104,prob_105,prob_106,prob_107,prob_108,prob_109,prob_110,prob_111,prob_112,prob_113,prob_114,prob_115,prob_116,prob_117,prob_118,prob_119,prob_120,prob_121,prob_122,prob_123,prob_124,prob_125,prob_126,prob_127,prob_128,prob_129,prob_130,prob_131,prob_132,prob_133,prob_134,prob_135,prob_136,prob_137,prob_138,prob_139,prob_140,prob_141,prob_142,prob_143,prob_144,prob_145,prob_146,prob_147,prob_148,prob_149,prob_150,prob_151,prob_152,prob_153,prob_154,prob_155,prob_156,prob_157,prob_158,prob_159,prob_160,prob_161,prob_162,prob_163,prob_164,prob_165,prob_166,prob_167,prob_168,prob_169,prob_170,prob_171,prob_172,prob_173,prob_174,prob_175,prob_176,prob_177,prob_178,prob_179,prob_180,prob_181,prob_182,prob_183,prob_184,prob_185,prob_186
Other utility classes
JsonDict
JsonDict (p2json:str|pathlib.Path, dictionary:dict|None=None)
*Dictionary whose current value is mirrored in a json file and can be initated from a json file
JsonDict requires a path to json file at creation. An optional dict can be passed as argument.
Behavior at creation:
JsonDict(p2json, dict)will create aJsonDictwith key-values fromdict, and mirrored inp2jsonJsonDict(p2json)will create aJsonDictwith empty dictionary and load json content if file exists
Once created, JsonDict instances behave exactly as a dictionary*
| Type | Default | Details | |
|---|---|---|---|
| p2json | str | pathlib.Path | path to the json file to mirror with the dictionary | |
| dictionary | dict | None | None | optional dictionary to initialize the JsonDict |
Create a new dictionary mirrored to a JSON file:
d = {'a': 1, 'b': 2, 'c': 3}
p2json = pfs.data / 'jsondict-test.json'
jsondict = JsonDict(p2json, d)
jsondictdict mirrored in /home/vtec/projects/bio/metagentorch/nbs-dev/data_dev/jsondict-test.json
{'a': 1, 'b': 2, 'c': 3}
Once created, the JsonFile instance behaves exactly like a dictionary, with the added benefit that any change to the dictionary is automatically saved to the JSON file.
jsondict['a'], jsondict['b'], jsondict['c'](1, 2, 3)
for k, v in jsondict.items():
print(f"key: {k}; value: {v}")key: a; value: 1
key: b; value: 2
key: c; value: 3
Adding or removing a value from the dictionary works in the same way as for a normal dictionary. But the json file is automatically updated.
jsondict['d'] = 4
jsondictdict mirrored in /home/vtec/projects/bio/metagentorch/nbs-dev/data_dev/jsondict-test.json
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
with open(p2json, 'r') as fp:
print(fp.read()){
"a": 1,
"b": 2,
"c": 3,
"d": 4
}
del jsondict['a']
jsondictdict mirrored in /home/vtec/projects/bio/metagentorch/nbs-dev/data_dev/jsondict-test.json
{'b': 2, 'c': 3, 'd': 4}
with open(p2json, 'r') as fp:
print(fp.read()){
"b": 2,
"c": 3,
"d": 4
}
JsonFileReader
JsonFileReader (path:str|pathlib.Path)
Mirror a JSON file and a dictionary
| Type | Details | |
|---|---|---|
| path | str | pathlib.Path | path to the json file |
jd = JsonFileReader(pfs.data / 'test.json')
pprint(jd.d){'item 1': {'keys': 'key key key key', 'pattern': 'pattern 1'},
'item 2': {'keys': 'key key key key', 'pattern': 'pattern 2'},
'item 3': {'keys': 'key key key key', 'pattern': 'pattern 3'}}
Now we can add an item to the dictionary/json
new_item = {'keys': 'key key key key', 'pattern': 'another pattern'}
jd.add_item(key='another item', item=new_item){'item 1': {'keys': 'key key key key', 'pattern': 'pattern 1'},
'item 2': {'keys': 'key key key key', 'pattern': 'pattern 2'},
'item 3': {'keys': 'key key key key', 'pattern': 'pattern 3'},
'another item': {'keys': 'key key key key', 'pattern': 'another pattern'}}
After saving the updated JSON file, we can load it again and see the changes.
jd.save_to_file()jd = JsonFileReader(pfs.data / 'test.json')
pprint(jd.d){'another item': {'keys': 'key key key key', 'pattern': 'another pattern'},
'item 1': {'keys': 'key key key key', 'pattern': 'pattern 1'},
'item 2': {'keys': 'key key key key', 'pattern': 'pattern 2'},
'item 3': {'keys': 'key key key key', 'pattern': 'pattern 3'}}
Other utility functions
list_available_devices
list_available_devices ()
Base Classes
File Readers
Base classes to be extended in order to create readers for specific file formats.
TextFileBaseReader
TextFileBaseReader (path:str|pathlib.Path, nlines:int=1)
*Iterator going through a text file by chunks of nlines lines. Iterator can be reset to file start.
The class is mainly intented to be extended, as it is for handling sequence files of various formats such as FastaFileReader.*
| Type | Default | Details | |
|---|---|---|---|
| path | str | pathlib.Path | path to the file | |
| nlines | int | 1 | number of lines on one chunk |
Once initialized, the iterator runs over each chunk of line(s) in the text file, sequentially.
pfs.dataPath('/home/vtec/projects/bio/metagentorch/nbs-dev/data_dev')
p2textfile = pfs.data / 'CNN_Virus_data/train_short'
it = TextFileBaseReader(path=p2textfile, nlines=3)
one_iteration = next(it)
print(one_iteration)TCAAAATAATCAGAAATGTTGAACCTAGGGTTGGACACATAATGACCAGC 76 0
ATTGTTTAACAATTTGTGCTCGTCCCGGTCACCCGCATCCAATCTTGATG 4 9
AATCTTGTCCTATCCTACCCGCAGGGGAATTGATGATAGANGTGCTTTTA 181 0
Let’s create a new instance of the file reader, and get several iterations.
reader = TextFileBaseReader(path=p2textfile, nlines=3)
one_iteration = next(reader)
print(one_iteration)TCAAAATAATCAGAAATGTTGAACCTAGGGTTGGACACATAATGACCAGC 76 0
ATTGTTTAACAATTTGTGCTCGTCCCGGTCACCCGCATCCAATCTTGATG 4 9
AATCTTGTCCTATCCTACCCGCAGGGGAATTGATGATAGANGTGCTTTTA 181 0
another_iteration = next(it)
print(another_iteration)
one_more_iteration = next(it)
print(one_more_iteration)GGAGCGGAGCCAACCCCTATGCTCACTTGCAACCCAAGGGGCGTTCCAGT 74 3
TGGATCCTGCGCGGGACGTCCTTTGTCTACGTCCCGTCGGCGCATCCCGC 60 3
GAGAGACTTACTAAAAAGCTGGCACTTACCATCAGTGTTTCACCTACATG 44 0
ACACACGACACTAGAGATAATGTGTCAGTGGATTATAAACAAACCAAGTT 43 7
TTGTAGCATAAGAACTGGTCTTCGCTGAAATTCTTGTCTTGATCTCATCT 35 2
TGGCCCTGCGGTCTGGGGCCCAGAAGCATATGTCAAGTCCTTTGAGAAGT 73 4
If we want to access the start of the file again, we need to re-initialize the file handle.
TextFileBaseReader.reset_iterator
TextFileBaseReader.reset_iterator ()
Reset the iterator to point to the first line in the file.
reader.reset_iterator()
one_iteration = next(it)
print(one_iteration)
another_iteration = next(it)
print(another_iteration)TAGATTTAGTGGTTAGGTAGTAAGGCTACAATGTAAACACGTAGTGGCAA 11 6
AACCCCTGGGGCTATAAAAGGCGCGGTCTGTGCACGGGGACTTCGGTNGG 7 7
AGAATGGATAGTAAGGCAGACAGTAATAGGGGAGGCAATGAAGGAAACCA 9 2
GATCCTAAGGTCCGTCCCCGGGGTCGCTTACCACTCCCCTGAAGCATGTC 131 7
ACAAGTCTAAAACCCTTCAGGACNTGATGTTTATAAATTCTACCTGTTAT 18 6
AGCCGGTGAACAACGTTTTTCAAGAGGGGGCCGTTCCTGGAGGACGGACA 59 7
TextFileBaseReader.print_first_chunks
TextFileBaseReader.print_first_chunks (nchunks:int=3)
*Print the first nchunk chunks of text from the file.
After printing, the iterator is reset again to its first line.*
| Type | Default | Details | |
|---|---|---|---|
| nchunks | int | 3 | number of chunks to print |
| Returns | None |
reader = TextFileBaseReader(path=p2textfile, nlines=3)
reader.print_first_chunks(nchunks=3)3-line chunk 1
TCAAAATAATCAGAAATGTTGAACCTAGGGTTGGACACATAATGACCAGC 76 0
ATTGTTTAACAATTTGTGCTCGTCCCGGTCACCCGCATCCAATCTTGATG 4 9
AATCTTGTCCTATCCTACCCGCAGGGGAATTGATGATAGANGTGCTTTTA 181 0
3-line chunk 2
GGAGCGGAGCCAACCCCTATGCTCACTTGCAACCCAAGGGGCGTTCCAGT 74 3
TGGATCCTGCGCGGGACGTCCTTTGTCTACGTCCCGTCGGCGCATCCCGC 60 3
GAGAGACTTACTAAAAAGCTGGCACTTACCATCAGTGTTTCACCTACATG 44 0
3-line chunk 3
ACACACGACACTAGAGATAATGTGTCAGTGGATTATAAACAAACCAAGTT 43 7
TTGTAGCATAAGAACTGGTCTTCGCTGAAATTCTTGTCTTGATCTCATCT 35 2
TGGCCCTGCGGTCTGGGGCCCAGAAGCATATGTCAAGTCCTTTGAGAAGT 73 4
TextFileBaseReader.parse_text
TextFileBaseReader.parse_text (txt:str, pattern:str|None=None)
Parse text using regex pattern with groups. Return a metadata dictionary.
| Type | Default | Details | |
|---|---|---|---|
| txt | str | text to parse | |
| pattern | str | None | None | If None, uses standard regex pattern to extract metadata, otherwise, uses passed regex |
| Returns | dict | parsed metadata in key/value format |
text = '>2591237:ncbi:1'
pattern = r"^>(?P<id>\d+):(?P<source>ncbi):(?P<nb>\d*)"
reader.parse_text(text, pattern){'id': '2591237', 'source': 'ncbi', 'nb': '1'}
Extending the base class
TextFileBaseReader is a base class, intended to be extended into specific file format readers.
The following methods will typically be extended to match data file and other structured text files formats:
__next__method in order to customize how the iterator parses files into “elements”. For instance, in a FASTA file, one element consists of two lines: a “definition line” and the sequence itself. ExtendingTextFileBaseReaderallows to read pairs of lines sequentially and return an element as a dictionary. For instance,FastaFileReaderiterates over each pairs of lines in a Fasta file and return each pair as a dictionary as follows:
{
'definition line': '>2591237:ncbi:1 [MK211378]\t2591237\tncbi\t1 [MK211378] '
'2591237\tCoronavirus BtRs-BetaCoV/YN2018D\t\tscientific '
'name\n',
'sequence': 'TATTAGGTTTTCTACCTACCCAGGA'
}
- Methods for parsing metadata from the file. For instance,
parse_filemethod will handle how the reader will iterate over the full file and return a dictionary for the entire file. - Extended classes will also define a specific attributes (
text_to_parse_key,re_pattern,re_keys, …)
TextFileBaseReader.set_parsing_rules
TextFileBaseReader.set_parsing_rules (pattern:str|None=None, verbose:bool=False)
*Set the standard regex parsing rule for the file.
Rules can be set:
- manually by passing specific custom values for
patternandkeys - automatically, by testing all parsing rules saved in
parsing_rule.json
Automatic selection of parsing rules works by testing each rule saved in parsing_rule.json on the first definition line of the fasta file, and selecting the one rule that generates the most metadata matches.
Rules consists of two parameters:
- The regex pattern including one
groupfor each metadata item, e.g(?P<group_name>regex_code) - The list of keys, i.e. the list with the name of each regex groups, used as key in the metadata dictionary
This method updates the three following class attributes: re_rule_name, re_pattern, re_keys*
| Type | Default | Details | |
|---|---|---|---|
| pattern | str | None | None | regex pattern to apply to parse the text, search in parsing rules json if None |
| verbose | bool | False | when True, provides information on each rule |
| Returns | None |
Important Note to Developpers
Method
set_parsing_rulesis there to allowTextFileBaseReader’s descendant classes to automatically select parsing rule by applying rules saved in a json file to a string extracted from the first element in the file.It assumes that the iterator returns its elements as dictionaries
{section_name:section, ...}and not as a pure string. The keyself.text_to_parse_keywill then be used to extract the text to parse for testing the rules. The base class iterator returns a simple string andself.text_to_parse_keyis set toNone.To make setting up a default parsing rule for the reader instance, the iterator must return a dictionary and
self.text_to_parse_keymust be set to the key in the dictionary corresponding the the text to parse.See implementation in
FastaFileReader.Calling
set_parsing_ruleson a class that does not satisfy with these characteristics will do nothing and return a warning.
reader.set_parsing_rules()/tmp/ipykernel_8720/3621596767.py:133: UserWarning:
`text_to_parse_key` is not defined in this class.
It is not possible to set a parsing rule. Must be define, e.g. 'definition line'
warnings.warn(msg, category=UserWarning)
Deprecated Items
When any of the following classes and functions is called, it will raise an exception with an error message indicating how to handle the required code refactoring.
Example:
DeprecationWarning Traceback (most recent call last)
Input In [140], in <cell line: 1>()
----> 1 TextFileBaseIterator(p2textfile)
Input In [139], in TextFileBaseIterator.__init__(self, *args, **kwargs)
4 def __init__(self, *args, **kwargs):
5 msg = \"\"\"
6 `TextFileBaseIterator` is deprecated.
7 Use `TextFileBaseReader` instead, with same capabilities and more.\"\"\"
----> 8 raise DeprecationWarning(msg)
DeprecationWarning:
`TextFileBaseIterator` is deprecated.
Use `TextFileBaseReader` instead, with same capabilities and more." TextFileBaseIterator
TextFileBaseIterator (*args, **kwargs)
TextFileBaseIterator is a deprecated class, to be replaced by TextFileBaseReader