art

Use ART, next gen read simulation tool, from within a python notebook

ART is an open source package simmulation next generation read of genomes, available on the website of the National Institute of Environmental Health Sciences here. It is a command line interface package. This module makes the package accessible from a jupyter notebook

Typical usage

Where the parameters are:

  -f   --fcov   the fold of read coverage to be simulated or number of reads/read pairs generated for each amplicon
  -i   --in     the filename of input DNA/RNA reference
  -l   --len    the length of reads to be simulated
  -m   --mflen  the mean size of DNA/RNA fragments for paired-end simulations
  -o   --out    the prefix of output filename
  -p   --paired indicate a paired-end read simulation or to generate reads from both ends of amplicons
                NOTE: art will automatically switch to a mate-pair simulation if the given mean fragment size >= 2000
  -s   --sdev   the standard deviation of DNA/RNA fragment size for paired-end simulations.
  -sam --samout indicate to generate SAM alignment file
  -ss  --seqSys The name of Illumina sequencing system of the built-in profile used for simulation
                NOTE: sequencing system ID names are:
                GA1 - GenomeAnalyzer I (36bp,44bp), GA2 - GenomeAnalyzer II (50bp, 75bp)
                HS10 - HiSeq 1000 (100bp),          HS20 - HiSeq 2000 (100bp),      HS25 - HiSeq 2500 (125bp, 150bp)
                HSXn - HiSeqX PCR free (150bp),     HSXt - HiSeqX TruSeq (150bp),   MinS - MiniSeq TruSeq (50bp)
                MSv1 - MiSeq v1 (250bp),            MSv3 - MiSeq v3 (250bp),        NS50 - NextSeq500 v2 (75bp)

Notes:


source

ArtIllumina

 ArtIllumina (path2app:str|pathlib.Path, input_dir:str|pathlib.Path,
              output_dir:str|pathlib.Path=None,
              app_in_system_path:bool=False)

Class to handle all aspects of simulating sequencing with art_illumina

Type Default Details
path2app str | pathlib.Path full path to art_illumina application on the system
input_dir str | pathlib.Path full path to dir where input files are
output_dir str | pathlib.Path None full path to dir where to save output files, if different from input_dir
app_in_system_path bool False whether art_illumina is in the system path or not

Usage

  1. Create an instance of ArtIllumina
  2. Run a simulation
  3. Export output files

Create an instance of ArtIllumina with: - the path to the application on the local system - the directories for input and output files (optional)

p2art = Path('/bin/art_illumina')
assert p2art.exists()
p2data = Path('data_dev/ncbi/refsequences/cov')
assert p2data.exists()
art = ArtIllumina(
    path2app=p2art,
    input_dir=p2data,
    )
Ready to operate with art: /bin/art_illumina
Input files from : /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov
Output files to :  /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov
nbdev.show_doc(ArtIllumina.sim_reads)

source

ArtIllumina.sim_reads

 ArtIllumina.sim_reads (input_file:str, output_seed:str,
                        sim_type:str='single', read_length:int=150,
                        fold:int=10, mean_read:int=None,
                        std_read:int=None, ss:str='HS25',
                        overwrite:bool=False, print_output:bool=True)

Simulates reads with art_illumina. Output files saved in a separate directory

Type Default Details
input_file str name of the fasta file to use as input
output_seed str seed to use for the output files
sim_type str single type of read simmulation: ‘single’ or ‘paired’
read_length int 150 length of the read in bp
fold int 10 fold
mean_read int None mean length of the read for paired reads
std_read int None std of the read length, for paired reads
ss str HS25 quality profile to use for simulation,
overwrite bool False overwrite existing output files if true, raise error if false
print_output bool True if True, prints art ilumina’s CLI output

Run a single read simulations

  • Provide an input file and a seed for the names of the output files
  • Prints out the log messages issued by art_illumina
input_fname = 'cov_virus_sequence_one.fa'
output_seed = 'single_1seq_150bp'

art.sim_reads(
    input_file=input_fname,
    output_seed=output_seed,
    sim_type="single",
    read_length=150,
    fold=100,
    overwrite=True
)
return code:  0 


    ====================ART====================
             ART_Illumina (2008-2016)          
          Q Version 2.5.8 (June 6, 2016)       
     Contact: Weichun Huang <whduke@gmail.com> 
    -------------------------------------------

                  Single-end Simulation

Total CPU time used: 0.436844

The random seed for the run: 1738405239

Parameters used during run
    Read Length:    150
    Genome masking 'N' cutoff frequency:    1 in 150
    Fold Coverage:            100X
    Profile Type:             Combined
    ID Tag:                   

Quality Profile(s)
    First Read:   HiSeq 2500 Length 150 R1 (built-in profile) 

Output files

  FASTQ Sequence File:
    /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/single_1seq_150bp/single_1seq_150bp.fq

  ALN Alignment File:
    /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/single_1seq_150bp/single_1seq_150bp.aln

Run a paired read simulations with the input file.

art.print_last_output_file_excerpts()
========================================================================================================================
File Name: single_1seq_150bp.fq.
--------------------------------------------------------------------------------
@2591237:ncbi:1-20100
GTACCACAGATGTGCACTTTACGTCAGACATTTTAGACTGTACAGTAGCAACCTTGATACATGGTTTACCTCCAATACCTAACAACTTAATGTTAAGCTTGAAAGCATCAATACTACTCTTAGGAGGCAAAAGCCCCTGGGAGTTCATAT
+
CCCGGGGGG1GGCGJJGJJGJJGJGJJJJJJJGJJ=GGJGGJJGJJGCCJGJGGGJGCGCC=GGJCGCGJGJGGCC=GGCGGGGGGGGG8GG=GG8GGCCJGCCCGGCCCGG=CGGGGCGGCGGGGGGGGGGGGGGCGGGGCCGGGGGCG
@2591237:ncbi:1-20099
TACACCCTTTGCCAGCTCGCTATGAGCTGTAGCAACGAGTACCTTAAGTTTTTCCATAGGAACACTAAAAGTTGCTGAAAAGGTGTCGACATAAGCATCAAACATCTTAACAGAAACTTCAGTACTATCTCCAACATCTGATACGAGAGC
+
=CCG=GGGGGGGGJJGJJJCGJGJJJJJJGGJJJJJJGJJJJCJJGJCGGGJGGGGJGJJG(J=JGGGCG=G=CGGGGG=GGGCG8GGGGGGGC8C=GGCJ8G=CGGGGGGGGG=GGGG=1G8G==GCGGGGGCGGGGGGGGCCGCCCGC
@2591237:ncbi:1-20098
ATGTCCTGCCTGTCAAGACCCAGAGATTGGACCTGAGCATAGTGTTGCAGATTATCACAACCACTCAAACATTGAAACTCGACTCCGCAAGGGAGGTAGGACTAGATGTTTTGGAGGCTGTGTGTTTGCCTATGTCGGCTGCTATAACAA
+
CCC1CGG1GGGGGJJJJCGJJ1JGJGJGJJGJGGJGJGGJJGJGJGJJGCGJCJ=JJGCGGGCJG1CGCGC=GCGGGCGCGG=GGCGGGGCG8GCGGGCGCCG=GCGGGGG(GGGCGGGGG=CGGGCGCGGC8CGGGCGCGCCGGGGGGG
input_fname = 'cov_virus_sequence_one.fa'

art.sim_reads(
    input_file=input_fname,
    output_seed='paired_1seq_150bp',
    sim_type="paired",
    read_length=150,
    fold=100,
    mean_read=200,
    std_read=10,
    overwrite=True
)
return code:  0 


    ====================ART====================
             ART_Illumina (2008-2016)          
          Q Version 2.5.8 (June 6, 2016)       
     Contact: Weichun Huang <whduke@gmail.com> 
    -------------------------------------------

                  Paired-end sequencing simulation

Total CPU time used: 0.437712

The random seed for the run: 1738405243

Parameters used during run
    Read Length:    150
    Genome masking 'N' cutoff frequency:    1 in 150
    Fold Coverage:            100X
    Mean Fragment Length:     200
    Standard Deviation:       10
    Profile Type:             Combined
    ID Tag:                   

Quality Profile(s)
    First Read:   HiSeq 2500 Length 150 R1 (built-in profile) 
    First Read:   HiSeq 2500 Length 150 R2 (built-in profile) 

Output files

  FASTQ Sequence Files:
     the 1st reads: /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/paired_1seq_150bp/paired_1seq_150bp1.fq
     the 2nd reads: /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/paired_1seq_150bp/paired_1seq_150bp2.fq

  ALN Alignment Files:
     the 1st reads: /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/paired_1seq_150bp/paired_1seq_150bp1.aln
     the 2nd reads: /home/vtec/projects/bio/metagentools/nbs-dev/data_dev/ncbi/refsequences/cov/paired_1seq_150bp/paired_1seq_150bp2.aln

art.print_last_output_file_excerpts()
========================================================================================================================
File Name: paired_1seq_150bp2.fq.
--------------------------------------------------------------------------------
@2591237:ncbi:1-20100/2
TTATAGCAGCCGACATAGGCAAACACACAGCCTCCAAAACATCTAGTCCTACCTCCCTTGCGGAGTCGAGTTTCAATGTTTGAGTGGTTGTGATAATCTGCAACACTATGCTCAGGTCCAATCTCTGGGTCTTGACAGGCAGGACATGGC
+
CCCGGGGGCGGGGGJJJJJGJJJJG8JJ=GJJCGGJCJ1GGCJJGCGGJJJJGGCJGGCGJJ=JCGGG=GGG(C=GCCGC=GGGGCGCGGGGGGGGGG=GCGCCJJJCGGGGCCGCGGCCGGCCGGGCC8CGGGCGC=GGGCGCGCCCCC
@2591237:ncbi:1-20098/2
ATCATTACCGGTCTTCATCCAACACAGGCACCTACACACCTCAGCGTTGACACAAAATTTAAGACTGAGGGACTATGTGTTGACATACCAGGCATACCAAAGGACATGACCTACCGTAGACTCATCTCTATGATGGGTTTTAAAATGAAT
+
=CCGG=G1GGGGGCGJJGGJJJJ8JJJJ=JCGJJGJJGJJJJCJJGGGJJGGGGG=CJGCGGGCCJJ8CG8J=CGGGGCCGGGGGCGCCCGGGGGCGGCG=CJCJCJ=C=GGGCGGCGGGG=CGGGGGGCGGGCGGGGGGGCGCGCGGGC
@2591237:ncbi:1-20096/2
CGGTACTAGACATACCTATCAGCTTCGTGCAAGATCAGTTTCACCAAAACTTTTCATCAGACAAGAGGAAGTTCACCAAGAGCTCTACTCACCGCTTTTTCTCATTGTTGCTGCTCTAGTATTTATAATACTTTGCTTCACCATTAAGAG
+
CCCGGGGGGG=GC1JJJJJJJJ1JJJCGCJJJGCJGJCG(JGGJGJGJGJGGGCCJJCCJJGG8JCGGCGG=G=J8JCGG=8GCCGCGCGGG=C8GCGGG=CJJCJJCGGG1GGG=GG=GCGGC(GCCCGGG=GCGCCGCCGCGCG=GC=

========================================================================================================================
File Name: paired_1seq_150bp1.fq.
--------------------------------------------------------------------------------
@2591237:ncbi:1-20100/1
TGAAGGACCTACTACATGTGGGTACCTACCTACTAATGCTGTAGTGAAAATGCCATGTCCTGCCTGTCAAGACCCAGAGATTGGACCTGAGCATAGTGTTGCAGATTATCACAACCACTCAAACATTGAAACTCGACTCCGCAAGGGAGG
+
CC1CGGGGGGGGGJJJCGJGGJGJJJJGGCGJ=J1JGJJJJJGGGG1GJJGJGJGJGGCGGGGGJGGJJGGGJCCCGCGCGG=GGGCGGGGGGGGG=CG1JCGGC(GGCCGCC8GGGGGGCGGGCGGCCGCGCCCGCGGCCCGGGGC8GC
@2591237:ncbi:1-20098/1
TAGCTTCTTCGCGGGTGATAAACATATTAGCGTAACCATTGACTTGGTAATTCATTTTAAAACCCATCATAGAGATGAGTCTACGGTAGGTCATGTCCTTTGGTATGCCTGGTATGTCAACACATAGTCCCTCAGTTTAAAATTTTGTGT
+
1CCGGGGGGCGGGJJJGJGJJGGJJGJGGG(GJJJGJJCCJJJCJGJJGCGJGJGG(GGGGGGCJGGGJGGGCCGGGGGGGGGCGGCGGGGCGGGCG(GGJCCGGGGCGG=1GGCGCG=GCGCGGG=CGGGCCC1G(G(CG=GCGGGCGG
@2591237:ncbi:1-20096/1
GAATAGCAGAAAGGCTAAAAAGCACAAATAGAAGTCAATTAAAGTGAGCTCATTCATTCTGTCTTTCTCTTAATGGTGAAGCAAAGTATTATAAATACTAGAGCAGCAACAATGAGAAAAAGCGGTGAGTAGAGCTCTTGGTGAACTTCC
+
CCCGGGCGGGGGGJJGJJJJGJGJJGJGJ8JJGGJJCJGG=JJJCJC(JJGJJJGGJJGG1GCGGGJGCCJCGGCJGCJGCGCCCGGCGGGGGGG8GGGGJGGGG1GCCGGGCGGGCCGGCC=G=GGGGCCGG=CG=CGCGGGCGGGGGC
art.list_all_output_files()
paired_1seq_150bp
- paired_1seq_150bp2.aln
- paired_1seq_150bp2.fq
- paired_1seq_150bp1.fq
- paired_1seq_150bp1.aln
single_1seq_150bp
- single_1seq_150bp.fq
- single_1seq_150bp.aln