= Path('/bin/art_illumina')
p2art assert p2art.exists()
= Path('data_dev/ncbi/refsequences/cov')
p2data assert p2data.exists()
art
Use ART, next gen read simulation tool, from within a python notebook
ART
is an open source package simmulation next generation read of genomes, available on the website of the National Institute of Environmental Health Sciences here. It is a command line interface package. This module makes the package accessible from a jupyter notebook
Typical usage
- read simulation with paired reads:
art_illumina -ss HS25 -sam -i file.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_seq_1
- read simulation with single reads:
art_illumina -ss HS25 -sam -i file.fa -l 150 -f 10 -o single_seq_1
Where the parameters are:
-f --fcov the fold of read coverage to be simulated or number of reads/read pairs generated for each amplicon
-i --in the filename of input DNA/RNA reference
-l --len the length of reads to be simulated
-m --mflen the mean size of DNA/RNA fragments for paired-end simulations
-o --out the prefix of output filename
-p --paired indicate a paired-end read simulation or to generate reads from both ends of amplicons
NOTE: art will automatically switch to a mate-pair simulation if the given mean fragment size >= 2000
-s --sdev the standard deviation of DNA/RNA fragment size for paired-end simulations.
-sam --samout indicate to generate SAM alignment file
-ss --seqSys The name of Illumina sequencing system of the built-in profile used for simulation
NOTE: sequencing system ID names are:
GA1 - GenomeAnalyzer I (36bp,44bp), GA2 - GenomeAnalyzer II (50bp, 75bp)
HS10 - HiSeq 1000 (100bp), HS20 - HiSeq 2000 (100bp), HS25 - HiSeq 2500 (125bp, 150bp)
HSXn - HiSeqX PCR free (150bp), HSXt - HiSeqX TruSeq (150bp), MinS - MiniSeq TruSeq (50bp)
MSv1 - MiSeq v1 (250bp), MSv3 - MiSeq v3 (250bp), NS50 - NextSeq500 v2 (75bp)
Notes:
- For single-end simulation, ART requires input sequence file, output file prefix, read length, and read count/fold coverage.
- For paired-end simulation (except for amplicon sequencing), ART also requires the parameter values of the mean and standard deviation of DNA/RNA fragment lengths
ArtIllumina
ArtIllumina (path2app:str|pathlib.Path, input_dir:str|pathlib.Path, output_dir:str|pathlib.Path=None, app_in_system_path:bool=False)
Class to handle all aspects of simulating sequencing with art_illumina
Type | Default | Details | |
---|---|---|---|
path2app | str | pathlib.Path | full path to art_illumina application on the system | |
input_dir | str | pathlib.Path | full path to dir where input files are | |
output_dir | str | pathlib.Path | None | full path to dir where to save output files, if different from input_dir |
app_in_system_path | bool | False | whether art_illumina is in the system path or not |
Usage
- Create an instance of
ArtIllumina
- Run a simulation
- Export output files
Create an instance of ArtIllumina
with: - the path to the application on the local system - the directories for input and output files (optional)
= ArtIllumina(
art =p2art,
path2app=p2data,
input_dir )
Ready to operate with art: /bin/art_illumina
Input files from : /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov
Output files to : /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov
nbdev.show_doc(ArtIllumina.sim_reads)
ArtIllumina.sim_reads
ArtIllumina.sim_reads (input_file:str, output_seed:str, sim_type:str='single', read_length:int=150, fold:int=10, mean_read:int=None, std_read:int=None, ss:str='HS25', overwrite:bool=False, print_output:bool=True)
Simulates reads with art_illumina. Output files saved in a separate directory
Type | Default | Details | |
---|---|---|---|
input_file | str | name of the fasta file to use as input | |
output_seed | str | seed to use for the output files | |
sim_type | str | single | type of read simmulation: ‘single’ or ‘paired’ |
read_length | int | 150 | length of the read in bp |
fold | int | 10 | fold |
mean_read | int | None | mean length of the read for paired reads |
std_read | int | None | std of the read length, for paired reads |
ss | str | HS25 | quality profile to use for simulation, |
overwrite | bool | False | overwrite existing output files if true, raise error if false |
print_output | bool | True | if True, prints art ilumina’s CLI output |
Run a single read simulations
- Provide an input file and a seed for the names of the output files
- Prints out the log messages issued by
art_illumina
= 'cov_virus_sequence_one.fa'
input_fname = 'single_1seq_150bp'
output_seed
art.sim_reads(=input_fname,
input_file=output_seed,
output_seed="single",
sim_type=150,
read_length=100,
fold=True
overwrite )
return code: 0
====================ART====================
ART_Illumina (2008-2016)
Q Version 2.5.8 (June 6, 2016)
Contact: Weichun Huang <whduke@gmail.com>
-------------------------------------------
Single-end Simulation
Total CPU time used: 0.436844
The random seed for the run: 1738405239
Parameters used during run
Read Length: 150
Genome masking 'N' cutoff frequency: 1 in 150
Fold Coverage: 100X
Profile Type: Combined
ID Tag:
Quality Profile(s)
First Read: HiSeq 2500 Length 150 R1 (built-in profile)
Output files
FASTQ Sequence File:
/home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/single_1seq_150bp/single_1seq_150bp.fq
ALN Alignment File:
/home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/single_1seq_150bp/single_1seq_150bp.aln
Run a paired read simulations with the input file.
art.print_last_output_file_excerpts()
========================================================================================================================
File Name: single_1seq_150bp.fq.
--------------------------------------------------------------------------------
@2591237:ncbi:1-20100
GTACCACAGATGTGCACTTTACGTCAGACATTTTAGACTGTACAGTAGCAACCTTGATACATGGTTTACCTCCAATACCTAACAACTTAATGTTAAGCTTGAAAGCATCAATACTACTCTTAGGAGGCAAAAGCCCCTGGGAGTTCATAT
+
CCCGGGGGG1GGCGJJGJJGJJGJGJJJJJJJGJJ=GGJGGJJGJJGCCJGJGGGJGCGCC=GGJCGCGJGJGGCC=GGCGGGGGGGGG8GG=GG8GGCCJGCCCGGCCCGG=CGGGGCGGCGGGGGGGGGGGGGGCGGGGCCGGGGGCG
@2591237:ncbi:1-20099
TACACCCTTTGCCAGCTCGCTATGAGCTGTAGCAACGAGTACCTTAAGTTTTTCCATAGGAACACTAAAAGTTGCTGAAAAGGTGTCGACATAAGCATCAAACATCTTAACAGAAACTTCAGTACTATCTCCAACATCTGATACGAGAGC
+
=CCG=GGGGGGGGJJGJJJCGJGJJJJJJGGJJJJJJGJJJJCJJGJCGGGJGGGGJGJJG(J=JGGGCG=G=CGGGGG=GGGCG8GGGGGGGC8C=GGCJ8G=CGGGGGGGGG=GGGG=1G8G==GCGGGGGCGGGGGGGGCCGCCCGC
@2591237:ncbi:1-20098
ATGTCCTGCCTGTCAAGACCCAGAGATTGGACCTGAGCATAGTGTTGCAGATTATCACAACCACTCAAACATTGAAACTCGACTCCGCAAGGGAGGTAGGACTAGATGTTTTGGAGGCTGTGTGTTTGCCTATGTCGGCTGCTATAACAA
+
CCC1CGG1GGGGGJJJJCGJJ1JGJGJGJJGJGGJGJGGJJGJGJGJJGCGJCJ=JJGCGGGCJG1CGCGC=GCGGGCGCGG=GGCGGGGCG8GCGGGCGCCG=GCGGGGG(GGGCGGGGG=CGGGCGCGGC8CGGGCGCGCCGGGGGGG
= 'cov_virus_sequence_one.fa'
input_fname
art.sim_reads(=input_fname,
input_file='paired_1seq_150bp',
output_seed="paired",
sim_type=150,
read_length=100,
fold=200,
mean_read=10,
std_read=True
overwrite )
return code: 0
====================ART====================
ART_Illumina (2008-2016)
Q Version 2.5.8 (June 6, 2016)
Contact: Weichun Huang <whduke@gmail.com>
-------------------------------------------
Paired-end sequencing simulation
Total CPU time used: 0.437712
The random seed for the run: 1738405243
Parameters used during run
Read Length: 150
Genome masking 'N' cutoff frequency: 1 in 150
Fold Coverage: 100X
Mean Fragment Length: 200
Standard Deviation: 10
Profile Type: Combined
ID Tag:
Quality Profile(s)
First Read: HiSeq 2500 Length 150 R1 (built-in profile)
First Read: HiSeq 2500 Length 150 R2 (built-in profile)
Output files
FASTQ Sequence Files:
the 1st reads: /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/paired_1seq_150bp/paired_1seq_150bp1.fq
the 2nd reads: /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/paired_1seq_150bp/paired_1seq_150bp2.fq
ALN Alignment Files:
the 1st reads: /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/paired_1seq_150bp/paired_1seq_150bp1.aln
the 2nd reads: /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/paired_1seq_150bp/paired_1seq_150bp2.aln
art.print_last_output_file_excerpts()
========================================================================================================================
File Name: paired_1seq_150bp2.fq.
--------------------------------------------------------------------------------
@2591237:ncbi:1-20100/2
TTATAGCAGCCGACATAGGCAAACACACAGCCTCCAAAACATCTAGTCCTACCTCCCTTGCGGAGTCGAGTTTCAATGTTTGAGTGGTTGTGATAATCTGCAACACTATGCTCAGGTCCAATCTCTGGGTCTTGACAGGCAGGACATGGC
+
CCCGGGGGCGGGGGJJJJJGJJJJG8JJ=GJJCGGJCJ1GGCJJGCGGJJJJGGCJGGCGJJ=JCGGG=GGG(C=GCCGC=GGGGCGCGGGGGGGGGG=GCGCCJJJCGGGGCCGCGGCCGGCCGGGCC8CGGGCGC=GGGCGCGCCCCC
@2591237:ncbi:1-20098/2
ATCATTACCGGTCTTCATCCAACACAGGCACCTACACACCTCAGCGTTGACACAAAATTTAAGACTGAGGGACTATGTGTTGACATACCAGGCATACCAAAGGACATGACCTACCGTAGACTCATCTCTATGATGGGTTTTAAAATGAAT
+
=CCGG=G1GGGGGCGJJGGJJJJ8JJJJ=JCGJJGJJGJJJJCJJGGGJJGGGGG=CJGCGGGCCJJ8CG8J=CGGGGCCGGGGGCGCCCGGGGGCGGCG=CJCJCJ=C=GGGCGGCGGGG=CGGGGGGCGGGCGGGGGGGCGCGCGGGC
@2591237:ncbi:1-20096/2
CGGTACTAGACATACCTATCAGCTTCGTGCAAGATCAGTTTCACCAAAACTTTTCATCAGACAAGAGGAAGTTCACCAAGAGCTCTACTCACCGCTTTTTCTCATTGTTGCTGCTCTAGTATTTATAATACTTTGCTTCACCATTAAGAG
+
CCCGGGGGGG=GC1JJJJJJJJ1JJJCGCJJJGCJGJCG(JGGJGJGJGJGGGCCJJCCJJGG8JCGGCGG=G=J8JCGG=8GCCGCGCGGG=C8GCGGG=CJJCJJCGGG1GGG=GG=GCGGC(GCCCGGG=GCGCCGCCGCGCG=GC=
========================================================================================================================
File Name: paired_1seq_150bp1.fq.
--------------------------------------------------------------------------------
@2591237:ncbi:1-20100/1
TGAAGGACCTACTACATGTGGGTACCTACCTACTAATGCTGTAGTGAAAATGCCATGTCCTGCCTGTCAAGACCCAGAGATTGGACCTGAGCATAGTGTTGCAGATTATCACAACCACTCAAACATTGAAACTCGACTCCGCAAGGGAGG
+
CC1CGGGGGGGGGJJJCGJGGJGJJJJGGCGJ=J1JGJJJJJGGGG1GJJGJGJGJGGCGGGGGJGGJJGGGJCCCGCGCGG=GGGCGGGGGGGGG=CG1JCGGC(GGCCGCC8GGGGGGCGGGCGGCCGCGCCCGCGGCCCGGGGC8GC
@2591237:ncbi:1-20098/1
TAGCTTCTTCGCGGGTGATAAACATATTAGCGTAACCATTGACTTGGTAATTCATTTTAAAACCCATCATAGAGATGAGTCTACGGTAGGTCATGTCCTTTGGTATGCCTGGTATGTCAACACATAGTCCCTCAGTTTAAAATTTTGTGT
+
1CCGGGGGGCGGGJJJGJGJJGGJJGJGGG(GJJJGJJCCJJJCJGJJGCGJGJGG(GGGGGGCJGGGJGGGCCGGGGGGGGGCGGCGGGGCGGGCG(GGJCCGGGGCGG=1GGCGCG=GCGCGGG=CGGGCCC1G(G(CG=GCGGGCGG
@2591237:ncbi:1-20096/1
GAATAGCAGAAAGGCTAAAAAGCACAAATAGAAGTCAATTAAAGTGAGCTCATTCATTCTGTCTTTCTCTTAATGGTGAAGCAAAGTATTATAAATACTAGAGCAGCAACAATGAGAAAAAGCGGTGAGTAGAGCTCTTGGTGAACTTCC
+
CCCGGGCGGGGGGJJGJJJJGJGJJGJGJ8JJGGJJCJGG=JJJCJC(JJGJJJGGJJGG1GCGGGJGCCJCGGCJGCJGCGCCCGGCGGGGGGG8GGGGJGGGG1GCCGGGCGGGCCGGCC=G=GGGGCCGG=CG=CGCGGGCGGGGGC
art.list_all_output_files()
paired_1seq_150bp
- paired_1seq_150bp2.aln
- paired_1seq_150bp2.fq
- paired_1seq_150bp1.fq
- paired_1seq_150bp1.aln
single_1seq_150bp
- single_1seq_150bp.fq
- single_1seq_150bp.aln