art

Use ART, next gen read simulation tool, from within a python notebook

ART is an open source package simmulation next generation read of genomes, available on the website of the National Institute of Environmental Health Sciences here. It is a command line interface package. This module makes the package accessible from a jupyter notebook

Typical usage

read simulation with paired reads:
- art_illumina -ss HS25 -sam -i file.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_seq_1
read simulation with single reads:
- art_illumina -ss HS25 -sam -i file.fa -l 150 -f 10 -o single_seq_1

Where the parameters are:

  -f   --fcov   the fold of read coverage to be simulated or number of reads/read pairs generated for each amplicon
  -i   --in     the filename of input DNA/RNA reference
  -l   --len    the length of reads to be simulated
  -m   --mflen  the mean size of DNA/RNA fragments for paired-end simulations
  -o   --out    the prefix of output filename
  -p   --paired indicate a paired-end read simulation or to generate reads from both ends of amplicons
                NOTE: art will automatically switch to a mate-pair simulation if the given mean fragment size >= 2000
  -s   --sdev   the standard deviation of DNA/RNA fragment size for paired-end simulations.
  -sam --samout indicate to generate SAM alignment file
  -ss  --seqSys The name of Illumina sequencing system of the built-in profile used for simulation
                NOTE: sequencing system ID names are:
                GA1 - GenomeAnalyzer I (36bp,44bp), GA2 - GenomeAnalyzer II (50bp, 75bp)
                HS10 - HiSeq 1000 (100bp),          HS20 - HiSeq 2000 (100bp),      HS25 - HiSeq 2500 (125bp, 150bp)
                HSXn - HiSeqX PCR free (150bp),     HSXt - HiSeqX TruSeq (150bp),   MinS - MiniSeq TruSeq (50bp)
                MSv1 - MiSeq v1 (250bp),            MSv3 - MiSeq v3 (250bp),        NS50 - NextSeq500 v2 (75bp)

Notes:

For single-end simulation, ART requires input sequence file, output file prefix, read length, and read count/fold coverage.
For paired-end simulation (except for amplicon sequencing), ART also requires the parameter values of the mean and standard deviation of DNA/RNA fragment lengths

source

ArtIllumina

 ArtIllumina (path2app:str|pathlib.Path, input_dir:str|pathlib.Path,
              output_dir:str|pathlib.Path=None,
              app_in_system_path:bool=False)

Class to handle all aspects of simulating sequencing with art_illumina

	Type	Default	Details
path2app	str \| pathlib.Path		full path to art_illumina application on the system
input_dir	str \| pathlib.Path		full path to dir where input files are
output_dir	str \| pathlib.Path	None	full path to dir where to save output files, if different from input_dir
app_in_system_path	bool	False	whether `art_illumina` is in the system path or not

Usage

Create an instance of ArtIllumina
Run a simulation
Export output files

Create an instance of ArtIllumina with: - the path to the application on the local system - the directories for input and output files (optional)

p2art = Path('/bin/art_illumina')
assert p2art.exists()
p2data = Path('data_dev/ncbi/refsequences/cov')
assert p2data.exists()

art = ArtIllumina(
    path2app=p2art,
    input_dir=p2data,
    )

Ready to operate with art: /bin/art_illumina
Input files from : /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov
Output files to :  /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov

nbdev.show_doc(ArtIllumina.sim_reads)

source

ArtIllumina.sim_reads

 ArtIllumina.sim_reads (input_file:str, output_seed:str,
                        sim_type:str='single', read_length:int=150,
                        fold:int=10, mean_read:int=None,
                        std_read:int=None, ss:str='HS25',
                        overwrite:bool=False, print_output:bool=True)

Simulates reads with art_illumina. Output files saved in a separate directory

	Type	Default	Details
input_file	str		name of the fasta file to use as input
output_seed	str		seed to use for the output files
sim_type	str	single	type of read simmulation: ‘single’ or ‘paired’
read_length	int	150	length of the read in bp
fold	int	10	fold
mean_read	int	None	mean length of the read for paired reads
std_read	int	None	std of the read length, for paired reads
ss	str	HS25	quality profile to use for simulation,
overwrite	bool	False	overwrite existing output files if true, raise error if false
print_output	bool	True	if True, prints art ilumina’s CLI output

Run a single read simulations

Provide an input file and a seed for the names of the output files
Prints out the log messages issued by art_illumina

input_fname = 'cov_virus_sequence_one.fa'
output_seed = 'single_1seq_150bp'

art.sim_reads(
    input_file=input_fname,
    output_seed=output_seed,
    sim_type="single",
    read_length=150,
    fold=100,
    overwrite=True
)

return code:  0 


    ====================ART====================
             ART_Illumina (2008-2016)          
          Q Version 2.5.8 (June 6, 2016)       
     Contact: Weichun Huang <whduke@gmail.com> 
    -------------------------------------------

                  Single-end Simulation

Total CPU time used: 0.436844

The random seed for the run: 1738405239

Parameters used during run
    Read Length:    150
    Genome masking 'N' cutoff frequency:    1 in 150
    Fold Coverage:            100X
    Profile Type:             Combined
    ID Tag:                   

Quality Profile(s)
    First Read:   HiSeq 2500 Length 150 R1 (built-in profile) 

Output files

  FASTQ Sequence File:
    /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/single_1seq_150bp/single_1seq_150bp.fq

  ALN Alignment File:
    /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/single_1seq_150bp/single_1seq_150bp.aln

Run a paired read simulations with the input file.

art.print_last_output_file_excerpts()

========================================================================================================================
File Name: single_1seq_150bp.fq.
--------------------------------------------------------------------------------
@2591237:ncbi:1-20100
GTACCACAGATGTGCACTTTACGTCAGACATTTTAGACTGTACAGTAGCAACCTTGATACATGGTTTACCTCCAATACCTAACAACTTAATGTTAAGCTTGAAAGCATCAATACTACTCTTAGGAGGCAAAAGCCCCTGGGAGTTCATAT
+
CCCGGGGGG1GGCGJJGJJGJJGJGJJJJJJJGJJ=GGJGGJJGJJGCCJGJGGGJGCGCC=GGJCGCGJGJGGCC=GGCGGGGGGGGG8GG=GG8GGCCJGCCCGGCCCGG=CGGGGCGGCGGGGGGGGGGGGGGCGGGGCCGGGGGCG
@2591237:ncbi:1-20099
TACACCCTTTGCCAGCTCGCTATGAGCTGTAGCAACGAGTACCTTAAGTTTTTCCATAGGAACACTAAAAGTTGCTGAAAAGGTGTCGACATAAGCATCAAACATCTTAACAGAAACTTCAGTACTATCTCCAACATCTGATACGAGAGC
+
=CCG=GGGGGGGGJJGJJJCGJGJJJJJJGGJJJJJJGJJJJCJJGJCGGGJGGGGJGJJG(J=JGGGCG=G=CGGGGG=GGGCG8GGGGGGGC8C=GGCJ8G=CGGGGGGGGG=GGGG=1G8G==GCGGGGGCGGGGGGGGCCGCCCGC
@2591237:ncbi:1-20098
ATGTCCTGCCTGTCAAGACCCAGAGATTGGACCTGAGCATAGTGTTGCAGATTATCACAACCACTCAAACATTGAAACTCGACTCCGCAAGGGAGGTAGGACTAGATGTTTTGGAGGCTGTGTGTTTGCCTATGTCGGCTGCTATAACAA
+
CCC1CGG1GGGGGJJJJCGJJ1JGJGJGJJGJGGJGJGGJJGJGJGJJGCGJCJ=JJGCGGGCJG1CGCGC=GCGGGCGCGG=GGCGGGGCG8GCGGGCGCCG=GCGGGGG(GGGCGGGGG=CGGGCGCGGC8CGGGCGCGCCGGGGGGG

input_fname = 'cov_virus_sequence_one.fa'

art.sim_reads(
    input_file=input_fname,
    output_seed='paired_1seq_150bp',
    sim_type="paired",
    read_length=150,
    fold=100,
    mean_read=200,
    std_read=10,
    overwrite=True
)

return code:  0 


    ====================ART====================
             ART_Illumina (2008-2016)          
          Q Version 2.5.8 (June 6, 2016)       
     Contact: Weichun Huang <whduke@gmail.com> 
    -------------------------------------------

                  Paired-end sequencing simulation

Total CPU time used: 0.437712

The random seed for the run: 1738405243

Parameters used during run
    Read Length:    150
    Genome masking 'N' cutoff frequency:    1 in 150
    Fold Coverage:            100X
    Mean Fragment Length:     200
    Standard Deviation:       10
    Profile Type:             Combined
    ID Tag:                   

Quality Profile(s)
    First Read:   HiSeq 2500 Length 150 R1 (built-in profile) 
    First Read:   HiSeq 2500 Length 150 R2 (built-in profile) 

Output files

  FASTQ Sequence Files:
     the 1st reads: /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/paired_1seq_150bp/paired_1seq_150bp1.fq
     the 2nd reads: /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/paired_1seq_150bp/paired_1seq_150bp2.fq

  ALN Alignment Files:
     the 1st reads: /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/paired_1seq_150bp/paired_1seq_150bp1.aln
     the 2nd reads: /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/paired_1seq_150bp/paired_1seq_150bp2.aln

art.print_last_output_file_excerpts()

========================================================================================================================
File Name: paired_1seq_150bp2.fq.
--------------------------------------------------------------------------------
@2591237:ncbi:1-20100/2
TTATAGCAGCCGACATAGGCAAACACACAGCCTCCAAAACATCTAGTCCTACCTCCCTTGCGGAGTCGAGTTTCAATGTTTGAGTGGTTGTGATAATCTGCAACACTATGCTCAGGTCCAATCTCTGGGTCTTGACAGGCAGGACATGGC
+
CCCGGGGGCGGGGGJJJJJGJJJJG8JJ=GJJCGGJCJ1GGCJJGCGGJJJJGGCJGGCGJJ=JCGGG=GGG(C=GCCGC=GGGGCGCGGGGGGGGGG=GCGCCJJJCGGGGCCGCGGCCGGCCGGGCC8CGGGCGC=GGGCGCGCCCCC
@2591237:ncbi:1-20098/2
ATCATTACCGGTCTTCATCCAACACAGGCACCTACACACCTCAGCGTTGACACAAAATTTAAGACTGAGGGACTATGTGTTGACATACCAGGCATACCAAAGGACATGACCTACCGTAGACTCATCTCTATGATGGGTTTTAAAATGAAT
+
=CCGG=G1GGGGGCGJJGGJJJJ8JJJJ=JCGJJGJJGJJJJCJJGGGJJGGGGG=CJGCGGGCCJJ8CG8J=CGGGGCCGGGGGCGCCCGGGGGCGGCG=CJCJCJ=C=GGGCGGCGGGG=CGGGGGGCGGGCGGGGGGGCGCGCGGGC
@2591237:ncbi:1-20096/2
CGGTACTAGACATACCTATCAGCTTCGTGCAAGATCAGTTTCACCAAAACTTTTCATCAGACAAGAGGAAGTTCACCAAGAGCTCTACTCACCGCTTTTTCTCATTGTTGCTGCTCTAGTATTTATAATACTTTGCTTCACCATTAAGAG
+
CCCGGGGGGG=GC1JJJJJJJJ1JJJCGCJJJGCJGJCG(JGGJGJGJGJGGGCCJJCCJJGG8JCGGCGG=G=J8JCGG=8GCCGCGCGGG=C8GCGGG=CJJCJJCGGG1GGG=GG=GCGGC(GCCCGGG=GCGCCGCCGCGCG=GC=

========================================================================================================================
File Name: paired_1seq_150bp1.fq.
--------------------------------------------------------------------------------
@2591237:ncbi:1-20100/1
TGAAGGACCTACTACATGTGGGTACCTACCTACTAATGCTGTAGTGAAAATGCCATGTCCTGCCTGTCAAGACCCAGAGATTGGACCTGAGCATAGTGTTGCAGATTATCACAACCACTCAAACATTGAAACTCGACTCCGCAAGGGAGG
+
CC1CGGGGGGGGGJJJCGJGGJGJJJJGGCGJ=J1JGJJJJJGGGG1GJJGJGJGJGGCGGGGGJGGJJGGGJCCCGCGCGG=GGGCGGGGGGGGG=CG1JCGGC(GGCCGCC8GGGGGGCGGGCGGCCGCGCCCGCGGCCCGGGGC8GC
@2591237:ncbi:1-20098/1
TAGCTTCTTCGCGGGTGATAAACATATTAGCGTAACCATTGACTTGGTAATTCATTTTAAAACCCATCATAGAGATGAGTCTACGGTAGGTCATGTCCTTTGGTATGCCTGGTATGTCAACACATAGTCCCTCAGTTTAAAATTTTGTGT
+
1CCGGGGGGCGGGJJJGJGJJGGJJGJGGG(GJJJGJJCCJJJCJGJJGCGJGJGG(GGGGGGCJGGGJGGGCCGGGGGGGGGCGGCGGGGCGGGCG(GGJCCGGGGCGG=1GGCGCG=GCGCGGG=CGGGCCC1G(G(CG=GCGGGCGG
@2591237:ncbi:1-20096/1
GAATAGCAGAAAGGCTAAAAAGCACAAATAGAAGTCAATTAAAGTGAGCTCATTCATTCTGTCTTTCTCTTAATGGTGAAGCAAAGTATTATAAATACTAGAGCAGCAACAATGAGAAAAAGCGGTGAGTAGAGCTCTTGGTGAACTTCC
+
CCCGGGCGGGGGGJJGJJJJGJGJJGJGJ8JJGGJJCJGG=JJJCJC(JJGJJJGGJJGG1GCGGGJGCCJCGGCJGCJGCGCCCGGCGGGGGGG8GGGGJGGGG1GCCGGGCGGGCCGGCC=G=GGGGCCGG=CG=CGCGGGCGGGGGC

art.list_all_output_files()

paired_1seq_150bp
- paired_1seq_150bp2.aln
- paired_1seq_150bp2.fq
- paired_1seq_150bp1.fq
- paired_1seq_150bp1.aln
single_1seq_150bp
- single_1seq_150bp.fq
- single_1seq_150bp.aln