= ProjectFileSystem() pfs
core
This module includes all base classes, functions and other objects that are used across the package. It is imported by all other modules in the package.
core
includes utility classes and functions to make it easier to work with the complex file systems adopted for the project, as well as base classes such as a file reader with additional functionality.
Utility Classes and Functions
Handling files and file structure
Utility classes to represent
ProjectFileSystem
ProjectFileSystem (*args, **kwargs)
*Represent a project file system, return paths to key directories, provide methods to manage the file system.
- Paths to key directories are based on whether the code is running locally or in the cloud.
- First time it is used on a local computer, it must be registered as local and a project root path must be set.
- A user configuration file is created in the user’s home directory to store the project root path and whether the machine is local or not.
Technical note:
ProjectFileSystem
is a simpleton class*
Reference Project File System:
This project adopts a unified file structure to make coding and colaboration easier. In addition, we can run the code locally (from a project-root
directory) or in the cloud (colab, kaggle, others).
The unified file structure when running localy is:
project-root
|--- data
| |--- CNN_Virus_data (all data from CNN Virus original paper)
| |--- saved (trained and finetuned models, saved preprocessed datasets)
| |--- .... (raw or pre-processed data from various sources, results, ... )
|
|--- nbs (all reference and work notebooks)
| |--- cnn_virus
| | |--- notebooks.ipynb
When running on google colab, it is assumed that a google drive is mounted on the colab server instance, and that this google drive root includes a shortcut named Metagenomics
and pointing to the project shared directory. The project shared directory is accessible here if you are an authorized project member.
ProjectFileSystem
at work:
If you use this class for the first time on a local computer, read the two Important Notes below.
Once created, the instance of ProjectFileSystem
gives access to key directories’ paths:
project root
:Path
to the project root directorydata
:Path
to the data directorynbs
:Path
to the notebooks directory
It also provides additional information regarding the computer on which the code is running:
os
: a string providing the name of the operating system the code is running onis_colab
: True if the code is running on google colabis_kaggle
: True if the code is running on kaggle server (NOT IMPLEMENTED YET)is_local
: True if the code is running on a computer registered as local
for p in [pfs.project_root, pfs.data, pfs.nbs]:
print(p)
/home/vtec/projects/bio/metagentorch
/home/vtec/projects/bio/metagentorch/data
/home/vtec/projects/bio/metagentorch/nbs
print(f"Operating System: {pfs.os}")
print(f"Local Computer: {pfs.is_local}, Colab: {pfs.is_colab}, Kaggle: {pfs.is_kaggle}")
Operating System: linux
Local Computer: True, Colab: False, Kaggle: False
ProjectFileSystem.info
ProjectFileSystem.info ()
Print basic info on the file system and the device
pfs.info()
Running linux on local computer
Device's home directory: /home/vtec
Project file structure:
- Root ........ /home/vtec/projects/bio/metagentorch
- Data Dir .... /home/vtec/projects/bio/metagentorch/data
- Notebooks ... /home/vtec/projects/bio/metagentorch/nbs
ProjectFileSystem.readme
ProjectFileSystem.readme (dir_path:pathlib.Path|None=None)
*Display readme.md
file or any other .md
file in dir_path
.
This provides a convenient way to get information on each direcotry content*
Type | Default | Details | |
---|---|---|---|
dir_path | pathlib.Path | None | None | Path to the directory to inquire. If None, display readme file from project_root. |
Returns | None |
'data_dev')) pfs.readme(Path(
ReadMe file for directory /home/vtec/projects/bio/metagentorch/nbs-dev/data_dev
:
Data directory for this package development
This directory includes all data required to validate and test this package code.
data_dev
|--- CNN_Virus_data
| |--- 50mer_ds_100_seq
| |--- 150mer_ds_100_seq
| |--- train_short
| |--- val_short
| |--- weight_of_classes
|--- ncbi
| |--- infer_results
| | |--- cnn_virus
| | |--- csv
| | |--- xlsx
| | |--- testdb.db
| |--- refsequences
| | |--- cov
| | | |--cov_virus_sequence_one_metadata.json
| | | |--sequences_two_no_matching_rule.fa
| | | |--another_sequence.fa
| | | |--cov_virus_sequences_two.fa
| | | |--cov_virus_sequences_two_metadata.json
| | | |--cov_virus_sequence_one.fa
| | | |--single_1seq_150bp
| | | | |--single_1seq_150bp.fq
| | | | |--single_1seq_150bp.aln
| | | |--paired_1seq_150bp
| | | | |--paired_1seq_150bp2.aln
| | | | |--paired_1seq_150bp2.fq
| | | | |--paired_1seq_150bp1.fq
| | | | |--paired_1seq_150bp1.aln
| |--- simreads
| | |--- cov
| | | |--- paired_1seq_50bp
| | | | |--- paired_1seq_50bp_1.aln
| | | | |--- paired_1seq_50bp_1.fq
| | | |--- single_1seq_50bp
| | | | |--- single_1seq_50bp_1.aln
| | | | |--- single_1seq_50bp_1.fq
| | |--- cov
| | | |--single_1seq_50bp
| | | | |--single_1seq_50bp.aln
| | | | |--single_1seq_50bp.fq
| | | |--single_1seq_150bp
| | | | |--single_1seq_150bp.fq
| | | | |--single_1seq_150bp.aln
| | | |--paired_1seq_150bp
| | | | |--paired_1seq_150bp2.aln
| | | | |--paired_1seq_150bp2.fq
| | | | |--paired_1seq_150bp1.fq
| | | | |--paired_1seq_150bp1.aln
|--- saved
|--- readme.md
Important Note 1:
When using the package on a local computer for the first time, you must register the computer as a local computer. Otherwise,
ProjectFileSystem
will raise an error. Once registered, the configuration file will be updated andProjectFileSystem
will detect that and run without error.
ProjectFileSystem.register_as_local
ProjectFileSystem.register_as_local ()
Update the configuration file to register the machine as local machine
= pfs.register_as_local() cfg
Important Note 2:
When using the package on a local computer for the first time, it is also required to set the project root directory. This is necessary to allow users to locate their local project folder anywhere they want. Once set, the path to the project root will be saved in the configuratin file.
ProjectFileSystem.set_project_root
ProjectFileSystem.set_project_root (p2project:str|pathlib.Path, data_dir:str='data')
Update the configuration file to set the project root
Type | Default | Details | |
---|---|---|---|
p2project | str | pathlib.Path | string or Path to the project directory. Can be absolute or relative to home | |
data_dir | str | data | Directory name for data under project root |
Returns | ConfigParser |
'/home/vtec/projects/bio/metagentorch/'); pfs.set_project_root(
Project Root set to: /home/vtec/projects/bio/metagentorch
Data directory set to: /home/vtec/projects/bio/metagentorch/data
ProjectFileSystem.read_config
ProjectFileSystem.read_config ()
Read config from the configuration file if it exists and return an empty config if does not
= pfs.read_config()
cfg 'Infra']['registered_as_local'] cfg[
'True'
'Infra']['project_root'] cfg[
'/home/vtec/projects/bio/metagentorch'
'Infra']['data_dir'] cfg[
'data'
Technical Note for Developpers
The current notebook and all other development notebooks use a minimum set of data that comes with the repository under nbs-dev/data_dev
instead of the standard data
directory which is much too large for testing and developing.
Therefore, when creating the instance of ProjectFileSystem
, use the parameter config_file
to pass a specific development configuration, also coming with the repository.
= PACKAGE_ROOT / 'nbs-dev/metagentorch-dev.cfg'
p2dev_cfg = ProjectFileSystem(config_fname=p2dev_cfg)
pfs pfs.info()
Running linux on local computer
Device's home directory: /home/vtec
Project file structure:
- Root ........ /home/vtec/projects/bio/metagentorch
- Data Dir .... /home/vtec/projects/bio/metagentorch/nbs-dev/data_dev
- Notebooks ... /home/vtec/projects/bio/metagentorch/nbs
SQlite Database Helper Class
SqliteDatabase
SqliteDatabase (p2db:pathlib.Path)
*Manage a SQLite db file, execute SQL queries, return results, provide context manager functionality.
Example usage as a context manager
= Path('your_database.db')
db_path = SqliteDb(db_path)
db
with db as database:
= database.get_result("SELECT * FROM your_table")
result print(result)
*
```
#cell-31 .cell}
::: {-code}
``` {.python .cell= pfs.data / 'ncbi/infer_results/cov-ncbi/testdb.db'
p2db = SqliteDatabase(p2db) db
:::
db.print_schema()
predictions (table)
columns: id, readid, refseqid, refsource, refseq_strand, taxonomyid, lbl_true, lbl_pred, pos_true, pos_pred, top_5_lbl_pred_0, top_5_lbl_pred_1, top_5_lbl_pred_2, top_5_lbl_pred_3, top_5_lbl_pred_4
index: idx_preds
indexed columns: readid, refseqid, pos_true
label_probabilities (table)
columns: id, read_kmer_id, read_50mer_nb, prob_000, prob_001, prob_002, prob_003, prob_004, prob_005, prob_006, prob_007, prob_008, prob_009, prob_010, prob_011, prob_012, prob_013, prob_014, prob_015, prob_016, prob_017, prob_018, prob_019, prob_020, prob_021, prob_022, prob_023, prob_024, prob_025, prob_026, prob_027, prob_028, prob_029, prob_030, prob_031, prob_032, prob_033, prob_034, prob_035, prob_036, prob_037, prob_038, prob_039, prob_040, prob_041, prob_042, prob_043, prob_044, prob_045, prob_046, prob_047, prob_048, prob_049, prob_050, prob_051, prob_052, prob_053, prob_054, prob_055, prob_056, prob_057, prob_058, prob_059, prob_060, prob_061, prob_062, prob_063, prob_064, prob_065, prob_066, prob_067, prob_068, prob_069, prob_070, prob_071, prob_072, prob_073, prob_074, prob_075, prob_076, prob_077, prob_078, prob_079, prob_080, prob_081, prob_082, prob_083, prob_084, prob_085, prob_086, prob_087, prob_088, prob_089, prob_090, prob_091, prob_092, prob_093, prob_094, prob_095, prob_096, prob_097, prob_098, prob_099, prob_100, prob_101, prob_102, prob_103, prob_104, prob_105, prob_106, prob_107, prob_108, prob_109, prob_110, prob_111, prob_112, prob_113, prob_114, prob_115, prob_116, prob_117, prob_118, prob_119, prob_120, prob_121, prob_122, prob_123, prob_124, prob_125, prob_126, prob_127, prob_128, prob_129, prob_130, prob_131, prob_132, prob_133, prob_134, prob_135, prob_136, prob_137, prob_138, prob_139, prob_140, prob_141, prob_142, prob_143, prob_144, prob_145, prob_146, prob_147, prob_148, prob_149, prob_150, prob_151, prob_152, prob_153, prob_154, prob_155, prob_156, prob_157, prob_158, prob_159, prob_160, prob_161, prob_162, prob_163, prob_164, prob_165, prob_166, prob_167, prob_168, prob_169, prob_170, prob_171, prob_172, prob_173, prob_174, prob_175, prob_176, prob_177, prob_178, prob_179, prob_180, prob_181, prob_182, prob_183, prob_184, prob_185, prob_186
index: idx_probs
indexed columns: read_kmer_id, read_50mer_nb
preds_probs (view)
columns: refseqid,lbl_true,lbl_pred,pos_true,pos_pred,top_5_lbl_pred_0,top_5_lbl_pred_1,top_5_lbl_pred_2,top_5_lbl_pred_3,top_5_lbl_pred_4,top_5_lbl_pred_0:1,top_5_lbl_pred_1:1,top_5_lbl_pred_2:1,top_5_lbl_pred_3:1,top_5_lbl_pred_4:1,top_5_lbl_pred_0:2,top_5_lbl_pred_1:2,top_5_lbl_pred_2:2,top_5_lbl_pred_3:2,top_5_lbl_pred_4:2,top_5_lbl_pred_0:3,top_5_lbl_pred_1:3,top_5_lbl_pred_2:3,top_5_lbl_pred_3:3,top_5_lbl_pred_4:3,top_5_lbl_pred_0:4,top_5_lbl_pred_1:4,top_5_lbl_pred_2:4,top_5_lbl_pred_3:4,top_5_lbl_pred_4:4,read_kmer_id,read_50mer_nb,prob_000,prob_001,prob_002,prob_003,prob_004,prob_005,prob_006,prob_007,prob_008,prob_009,prob_010,prob_011,prob_012,prob_013,prob_014,prob_015,prob_016,prob_017,prob_018,prob_019,prob_020,prob_021,prob_022,prob_023,prob_024,prob_025,prob_026,prob_027,prob_028,prob_029,prob_030,prob_031,prob_032,prob_033,prob_034,prob_035,prob_036,prob_037,prob_038,prob_039,prob_040,prob_041,prob_042,prob_043,prob_044,prob_045,prob_046,prob_047,prob_048,prob_049,prob_050,prob_051,prob_052,prob_053,prob_054,prob_055,prob_056,prob_057,prob_058,prob_059,prob_060,prob_061,prob_062,prob_063,prob_064,prob_065,prob_066,prob_067,prob_068,prob_069,prob_070,prob_071,prob_072,prob_073,prob_074,prob_075,prob_076,prob_077,prob_078,prob_079,prob_080,prob_081,prob_082,prob_083,prob_084,prob_085,prob_086,prob_087,prob_088,prob_089,prob_090,prob_091,prob_092,prob_093,prob_094,prob_095,prob_096,prob_097,prob_098,prob_099,prob_100,prob_101,prob_102,prob_103,prob_104,prob_105,prob_106,prob_107,prob_108,prob_109,prob_110,prob_111,prob_112,prob_113,prob_114,prob_115,prob_116,prob_117,prob_118,prob_119,prob_120,prob_121,prob_122,prob_123,prob_124,prob_125,prob_126,prob_127,prob_128,prob_129,prob_130,prob_131,prob_132,prob_133,prob_134,prob_135,prob_136,prob_137,prob_138,prob_139,prob_140,prob_141,prob_142,prob_143,prob_144,prob_145,prob_146,prob_147,prob_148,prob_149,prob_150,prob_151,prob_152,prob_153,prob_154,prob_155,prob_156,prob_157,prob_158,prob_159,prob_160,prob_161,prob_162,prob_163,prob_164,prob_165,prob_166,prob_167,prob_168,prob_169,prob_170,prob_171,prob_172,prob_173,prob_174,prob_175,prob_176,prob_177,prob_178,prob_179,prob_180,prob_181,prob_182,prob_183,prob_184,prob_185,prob_186
Other utility classes
JsonDict
JsonDict (p2json:str|pathlib.Path, dictionary:dict|None=None)
*Dictionary whose current value is mirrored in a json file and can be initated from a json file
JsonDict
requires a path to json file at creation. An optional dict can be passed as argument.
Behavior at creation:
JsonDict(p2json, dict)
will create aJsonDict
with key-values fromdict
, and mirrored inp2json
JsonDict(p2json)
will create aJsonDict
with empty dictionary and load json content if file exists
Once created, JsonDict
instances behave exactly as a dictionary*
Type | Default | Details | |
---|---|---|---|
p2json | str | pathlib.Path | path to the json file to mirror with the dictionary | |
dictionary | dict | None | None | optional dictionary to initialize the JsonDict |
Create a new dictionary mirrored to a JSON file:
= {'a': 1, 'b': 2, 'c': 3}
d = pfs.data / 'jsondict-test.json'
p2json = JsonDict(p2json, d)
jsondict jsondict
dict mirrored in /home/vtec/projects/bio/metagentorch/nbs-dev/data_dev/jsondict-test.json
{'a': 1, 'b': 2, 'c': 3}
Once created, the JsonFile
instance behaves exactly like a dictionary, with the added benefit that any change to the dictionary is automatically saved to the JSON file.
'a'], jsondict['b'], jsondict['c'] jsondict[
(1, 2, 3)
for k, v in jsondict.items():
print(f"key: {k}; value: {v}")
key: a; value: 1
key: b; value: 2
key: c; value: 3
Adding or removing a value from the dictionary works in the same way as for a normal dictionary. But the json file is automatically updated.
'd'] = 4
jsondict[ jsondict
dict mirrored in /home/vtec/projects/bio/metagentorch/nbs-dev/data_dev/jsondict-test.json
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
with open(p2json, 'r') as fp:
print(fp.read())
{
"a": 1,
"b": 2,
"c": 3,
"d": 4
}
del jsondict['a']
jsondict
dict mirrored in /home/vtec/projects/bio/metagentorch/nbs-dev/data_dev/jsondict-test.json
{'b': 2, 'c': 3, 'd': 4}
with open(p2json, 'r') as fp:
print(fp.read())
{
"b": 2,
"c": 3,
"d": 4
}
JsonFileReader
JsonFileReader (path:str|pathlib.Path)
Mirror a JSON file and a dictionary
Type | Details | |
---|---|---|
path | str | pathlib.Path | path to the json file |
= JsonFileReader(pfs.data / 'test.json')
jd pprint(jd.d)
{'item 1': {'keys': 'key key key key', 'pattern': 'pattern 1'},
'item 2': {'keys': 'key key key key', 'pattern': 'pattern 2'},
'item 3': {'keys': 'key key key key', 'pattern': 'pattern 3'}}
Now we can add an item to the dictionary/json
= {'keys': 'key key key key', 'pattern': 'another pattern'}
new_item ='another item', item=new_item) jd.add_item(key
{'item 1': {'keys': 'key key key key', 'pattern': 'pattern 1'},
'item 2': {'keys': 'key key key key', 'pattern': 'pattern 2'},
'item 3': {'keys': 'key key key key', 'pattern': 'pattern 3'},
'another item': {'keys': 'key key key key', 'pattern': 'another pattern'}}
After saving the updated JSON file, we can load it again and see the changes.
jd.save_to_file()
= JsonFileReader(pfs.data / 'test.json')
jd pprint(jd.d)
{'another item': {'keys': 'key key key key', 'pattern': 'another pattern'},
'item 1': {'keys': 'key key key key', 'pattern': 'pattern 1'},
'item 2': {'keys': 'key key key key', 'pattern': 'pattern 2'},
'item 3': {'keys': 'key key key key', 'pattern': 'pattern 3'}}
Other utility functions
list_available_devices
list_available_devices ()
Base Classes
File Readers
Base classes to be extended in order to create readers for specific file formats.
TextFileBaseReader
TextFileBaseReader (path:str|pathlib.Path, nlines:int=1)
*Iterator going through a text file by chunks of nlines
lines. Iterator can be reset to file start.
The class is mainly intented to be extended, as it is for handling sequence files of various formats such as FastaFileReader
.*
Type | Default | Details | |
---|---|---|---|
path | str | pathlib.Path | path to the file | |
nlines | int | 1 | number of lines on one chunk |
Once initialized, the iterator runs over each chunk of line(s) in the text file, sequentially.
pfs.data
Path('/home/vtec/projects/bio/metagentorch/nbs-dev/data_dev')
= pfs.data / 'CNN_Virus_data/train_short'
p2textfile = TextFileBaseReader(path=p2textfile, nlines=3)
it
= next(it)
one_iteration
print(one_iteration)
TCAAAATAATCAGAAATGTTGAACCTAGGGTTGGACACATAATGACCAGC 76 0
ATTGTTTAACAATTTGTGCTCGTCCCGGTCACCCGCATCCAATCTTGATG 4 9
AATCTTGTCCTATCCTACCCGCAGGGGAATTGATGATAGANGTGCTTTTA 181 0
Let’s create a new instance of the file reader, and get several iterations.
= TextFileBaseReader(path=p2textfile, nlines=3)
reader
= next(reader)
one_iteration print(one_iteration)
TCAAAATAATCAGAAATGTTGAACCTAGGGTTGGACACATAATGACCAGC 76 0
ATTGTTTAACAATTTGTGCTCGTCCCGGTCACCCGCATCCAATCTTGATG 4 9
AATCTTGTCCTATCCTACCCGCAGGGGAATTGATGATAGANGTGCTTTTA 181 0
= next(it)
another_iteration print(another_iteration)
= next(it)
one_more_iteration print(one_more_iteration)
GGAGCGGAGCCAACCCCTATGCTCACTTGCAACCCAAGGGGCGTTCCAGT 74 3
TGGATCCTGCGCGGGACGTCCTTTGTCTACGTCCCGTCGGCGCATCCCGC 60 3
GAGAGACTTACTAAAAAGCTGGCACTTACCATCAGTGTTTCACCTACATG 44 0
ACACACGACACTAGAGATAATGTGTCAGTGGATTATAAACAAACCAAGTT 43 7
TTGTAGCATAAGAACTGGTCTTCGCTGAAATTCTTGTCTTGATCTCATCT 35 2
TGGCCCTGCGGTCTGGGGCCCAGAAGCATATGTCAAGTCCTTTGAGAAGT 73 4
If we want to access the start of the file again, we need to re-initialize the file handle.
TextFileBaseReader.reset_iterator
TextFileBaseReader.reset_iterator ()
Reset the iterator to point to the first line in the file.
reader.reset_iterator()= next(it)
one_iteration print(one_iteration)
= next(it)
another_iteration print(another_iteration)
TAGATTTAGTGGTTAGGTAGTAAGGCTACAATGTAAACACGTAGTGGCAA 11 6
AACCCCTGGGGCTATAAAAGGCGCGGTCTGTGCACGGGGACTTCGGTNGG 7 7
AGAATGGATAGTAAGGCAGACAGTAATAGGGGAGGCAATGAAGGAAACCA 9 2
GATCCTAAGGTCCGTCCCCGGGGTCGCTTACCACTCCCCTGAAGCATGTC 131 7
ACAAGTCTAAAACCCTTCAGGACNTGATGTTTATAAATTCTACCTGTTAT 18 6
AGCCGGTGAACAACGTTTTTCAAGAGGGGGCCGTTCCTGGAGGACGGACA 59 7
TextFileBaseReader.print_first_chunks
TextFileBaseReader.print_first_chunks (nchunks:int=3)
*Print the first nchunk
chunks of text from the file.
After printing, the iterator is reset again to its first line.*
Type | Default | Details | |
---|---|---|---|
nchunks | int | 3 | number of chunks to print |
Returns | None |
= TextFileBaseReader(path=p2textfile, nlines=3)
reader
=3) reader.print_first_chunks(nchunks
3-line chunk 1
TCAAAATAATCAGAAATGTTGAACCTAGGGTTGGACACATAATGACCAGC 76 0
ATTGTTTAACAATTTGTGCTCGTCCCGGTCACCCGCATCCAATCTTGATG 4 9
AATCTTGTCCTATCCTACCCGCAGGGGAATTGATGATAGANGTGCTTTTA 181 0
3-line chunk 2
GGAGCGGAGCCAACCCCTATGCTCACTTGCAACCCAAGGGGCGTTCCAGT 74 3
TGGATCCTGCGCGGGACGTCCTTTGTCTACGTCCCGTCGGCGCATCCCGC 60 3
GAGAGACTTACTAAAAAGCTGGCACTTACCATCAGTGTTTCACCTACATG 44 0
3-line chunk 3
ACACACGACACTAGAGATAATGTGTCAGTGGATTATAAACAAACCAAGTT 43 7
TTGTAGCATAAGAACTGGTCTTCGCTGAAATTCTTGTCTTGATCTCATCT 35 2
TGGCCCTGCGGTCTGGGGCCCAGAAGCATATGTCAAGTCCTTTGAGAAGT 73 4
TextFileBaseReader.parse_text
TextFileBaseReader.parse_text (txt:str, pattern:str|None=None)
Parse text using regex pattern with groups. Return a metadata dictionary.
Type | Default | Details | |
---|---|---|---|
txt | str | text to parse | |
pattern | str | None | None | If None, uses standard regex pattern to extract metadata, otherwise, uses passed regex |
Returns | dict | parsed metadata in key/value format |
= '>2591237:ncbi:1'
text = r"^>(?P<id>\d+):(?P<source>ncbi):(?P<nb>\d*)"
pattern
reader.parse_text(text, pattern)
{'id': '2591237', 'source': 'ncbi', 'nb': '1'}
Extending the base class
TextFileBaseReader
is a base class, intended to be extended into specific file format readers.
The following methods will typically be extended to match data file and other structured text files formats:
__next__
method in order to customize how the iterator parses files into “elements”. For instance, in a FASTA file, one element consists of two lines: a “definition line” and the sequence itself. ExtendingTextFileBaseReader
allows to read pairs of lines sequentially and return an element as a dictionary. For instance,FastaFileReader
iterates over each pairs of lines in a Fasta file and return each pair as a dictionary as follows:
{
'definition line': '>2591237:ncbi:1 [MK211378]\t2591237\tncbi\t1 [MK211378] '
'2591237\tCoronavirus BtRs-BetaCoV/YN2018D\t\tscientific '
'name\n',
'sequence': 'TATTAGGTTTTCTACCTACCCAGGA'
}
- Methods for parsing metadata from the file. For instance,
parse_file
method will handle how the reader will iterate over the full file and return a dictionary for the entire file. - Extended classes will also define a specific attributes (
text_to_parse_key
,re_pattern
,re_keys
, …)
TextFileBaseReader.set_parsing_rules
TextFileBaseReader.set_parsing_rules (pattern:str|None=None, verbose:bool=False)
*Set the standard regex parsing rule for the file.
Rules can be set:
- manually by passing specific custom values for
pattern
andkeys
- automatically, by testing all parsing rules saved in
parsing_rule.json
Automatic selection of parsing rules works by testing each rule saved in parsing_rule.json
on the first definition line of the fasta file, and selecting the one rule that generates the most metadata matches.
Rules consists of two parameters:
- The regex pattern including one
group
for each metadata item, e.g(?P<group_name>regex_code)
- The list of keys, i.e. the list with the name of each regex groups, used as key in the metadata dictionary
This method updates the three following class attributes: re_rule_name
, re_pattern
, re_keys
*
Type | Default | Details | |
---|---|---|---|
pattern | str | None | None | regex pattern to apply to parse the text, search in parsing rules json if None |
verbose | bool | False | when True, provides information on each rule |
Returns | None |
Important Note to Developpers
Method
set_parsing_rules
is there to allowTextFileBaseReader
’s descendant classes to automatically select parsing rule by applying rules saved in a json file to a string extracted from the first element in the file.It assumes that the iterator returns its elements as dictionaries
{section_name:section, ...}
and not as a pure string. The keyself.text_to_parse_key
will then be used to extract the text to parse for testing the rules. The base class iterator returns a simple string andself.text_to_parse_key
is set toNone
.To make setting up a default parsing rule for the reader instance, the iterator must return a dictionary and
self.text_to_parse_key
must be set to the key in the dictionary corresponding the the text to parse.See implementation in
FastaFileReader
.Calling
set_parsing_rules
on a class that does not satisfy with these characteristics will do nothing and return a warning.
reader.set_parsing_rules()
/tmp/ipykernel_8720/3621596767.py:133: UserWarning:
`text_to_parse_key` is not defined in this class.
It is not possible to set a parsing rule. Must be define, e.g. 'definition line'
warnings.warn(msg, category=UserWarning)
Deprecated Items
When any of the following classes and functions is called, it will raise an exception with an error message indicating how to handle the required code refactoring.
Example:
DeprecationWarning Traceback (most recent call last)
140], in <cell line: 1>()
Input In [----> 1 TextFileBaseIterator(p2textfile)
139], in TextFileBaseIterator.__init__(self, *args, **kwargs)
Input In [4 def __init__(self, *args, **kwargs):
5 msg = \"\"\"
6 `TextFileBaseIterator` is deprecated.
7 Use `TextFileBaseReader` instead, with same capabilities and more.\"\"\"
----> 8 raise DeprecationWarning(msg)
DeprecationWarning:
is deprecated.
`TextFileBaseIterator` with same capabilities and more." Use `TextFileBaseReader` instead,
TextFileBaseIterator
TextFileBaseIterator (*args, **kwargs)
TextFileBaseIterator
is a deprecated class, to be replaced by TextFileBaseReader