Manual for running MULTo

The MULTo package runs most smoothly if all files are saved in a pre-determined file structure in the MULTo folder. The scripts must be run from the MULTo folder to work.

Observe that although it is an optional argument, it can be a good idea to choose the number of processors (option -p) according to the processing power you have avaliable (default is 25 processors).

Standard runs

Genomic MULTo run

The minimal input for a genome run will be following:

python src/MULTo1.0.py -s <speciesName> -a <assembly>

The genomic fasta files will automatically be downloaded to the folder MULTo/files/<speciesName>/<assembly>/fastaFiles/genomeFasta/noRandomChrom/ from the UCSC downloads webpage, unless they are already downloaded to this folder. The fasta files can also be transferred manually to this directory if the user already have them downloaded and stored elsewhere. Alternatively, the original folder containing fasta files can be specified to the program using the option -g.

Bowtie index will automatically created, if it does not already exist in the folder MULTo/files/<speciesName>/<assembly>/bowtie_indexes/<assembly>_genome.

The output will be created in MULTo/files/<speciesName>/<assembly>/MULfiles/<assembly>_<kMin>-<kMax>/.

Trascriptome MULTo run

The transcriptome MULTo run can be either on gene level (reads mapping to overlapping transcripts can be considered unique) or on transcript level (only reads mappin to a single transcript are unique).

Gene level command:

python src/MULTo1.0.py -s <speciesName> -a <assembly> -t <annotationType> -G

Transcript level command:

python src/MULTo1.0.py -s <speciesName> -a <assembly> -t <annotationType> -T

The genomic fasta files for the assembly are needed to create transcriptome fasta files, and a bowtie-index for the genome and transcriptome combined. They will be downloaded as above. The annotation for the given annotation type will be downloaded to the folder MULTo/files/<speciesName>/<assembly>/fastaFiles/annotationFiles from UCSC download webpage. Annotation types can be refGene, ensGene or knownGene. In some gene annotations, gene copies occuring in different locations are described by the same identifier. These identifiers will be changed before running through MULTo, to make sure every annotated gene has a unique identifier. Reads that might be unique to a set of gene copies will therefore be counted as non-unique due to mapping to several genomic locations.

The output (for gene level) will be created in MULTo/files/<speciesName>/<assembly>/MULfiles/<assembly>_<annotationType>_geneLevel_<kMin>-<kMax>/.

Countonly (-C) that gives only an estimation of number of unique positions in each transcript can be used for transcriptome runs.

Bisulfite MULTo run

python src/MULTo1.0.py -s <speciesName> -a <assembly> -B

The bisulfite genome will not have a reverse strand that is complementary to the forward strand, since all C:s are converted to T:s. Therefore the default bisulfite run creates separate uniqueness files for the forward and reverse strand. The option -r can be used to change the run to only forward or only reverse strand.

Bisulfite genome fasta files are automatically created from the genom fasta files. One bowtie-index is created for the forward strand and one for reverse strand.

The output will be created in MULTo/files/<speciesName>/<assembly>/MULfiles/<assembly>_bisulfite_<kMin>-<kMax>/.

Paired end MULTo run

Gene level command:

python src/MULTo1.0.py -s <speciesName> -a <assembly> -t <annotationType> -G -P -C -l <insertLength>,<std>,<noRand> -k <kMin> -m <kMax>

Paired end mode is only supported for transcriptome runs, and will not give positional information about uniqueness, but rather an estimation of the number of unique positions at different read lengths in every transcript given a fragment length distribution (countonly, option -C). Fragment length distribution is specified by the -l option. This option requires a comma-separated list of 3 numbers. The first number is the mean fragment length, the second one is the standard deviation from this mean, and the third one is the number of randomized fragments that should be created for each position in the transcript. The randomized fragments are drawn from a gaussian distribution with the given mean and standard deviation. Default is -l 250,25,5.

Observe that you may need to change kMax from default (255), since too long reads will create mate pairs that overlap each other. If one mate is contained entirely within the other bowtie will consider the mapping invalid.

The output (for gene level) will be created in MULTo/files/<speciesName>/<assembly>/MULfiles/<assembly>_pairedEnd<insertLength>-<std>-<noRand>_<annotationType>_geneLevel_<kMin>-<kMax>/.

Examples

Genome

command:

python src/MULTo1.0.py -s mmu -a mm9 -p 30 -k 20 -m 255

output folder:

MULTo/files/mmu/mm9/MULfiles/mm9_20-255/

Gene level transcriptome

command:

python src/MULTo1.0.py -s mmu -a mm9 -p 30 -k 20 -m 255 -t refGene -G

output folder:

MULTo/files/mmu/mm9/MULfiles/mm9_refGene_geneLevel_20-255/

Transcript level transcriptome

command:

python src/MULTo1.0.py -s mmu -a mm9 -p 30 -k 20 -m 255 -t refGene -T

output folder:

MULTo/files/mmu/mm9/MULfiles/mm9_refGene_transLevel_20-255/

Bisulfite genome

command:

python src/MULTo1.0.py -s mmu -a mm9 -p 30 -k 20 -m 255 -B

output folder:

MULTo/files/mmu/mm9/MULfiles/mm9_bisulfite_20-255/

Paired end gene level transcriptome

command:

python src/MULTo1.0.py -s mmu -a mm9 -p 30 -k 20 -m 255 -t refGene -T -P -C -l 500,50,5

output folder:

MULTo/files/mmu/mm9/MULfiles/mm9_pairedEnd500-50-5_refGene_geneLevel_20-255/

Options

Non-optional arguments for uniqueness query from file

-a <assembly>	The name of the genome assembly to be used (e.g. hg19 for Human Genome version 19)
-s <species>	KEGG three letter abbreviation for the species (e.g. hsa for homo sapiens, mmu for mus musculus)
-t <annotationType>	This option is required for transcriptome runs. The supported annotation types are refGene, ensGene, and knownGene.

Inputs to change the type of run

-G	Activates gene level transcriptome runs.
-T	Activates transcript level transcriptome runs.
-P	Activates paired end runs. This option must be used together with either -G or -T (i.e. a transcriptome run), and the output of a paired end run can only be in the countonly (-C) format.
-C	Activates the countonly mode.
-B	Activates the bisulfite run mode. This is only supported for genome runs, not for transcriptome (-T, -G) runs.
-r <bisulfiteRunMode>	Bisulfite run mode can be 0, 1 or 2. Use this option to choose if the bisulfite genome for forward (0), reverse (1) or both strands (2) should be created. Due to the conversion of all C:s to T:s, the forward and reverse strand are not reverse complementary and can differ some in uniqueness. Default value is 2, i.e. Uniqueness files will be created for both strands.
-S	This option can be used to bin small chromosomes together to increase the efficiency of the mapping. In genome runs, each chromosomes will by default be processed separately. If the chromosomes are very small, it is more efficient to bin several chromosomes together.

Input for changing properties of the run

-v <missmatches>	Number of missmatches allowed in the bowtie mapping. Default is 0.
-p <processors>	Number of processors MULTo is allowed to use. If -p is higher than 15, 15 processors will be reserved for running bowtie and the rest will be used to run blocks of chromosomes or transcripts in parallell. If -p is lower than 15, all but 1 processor will be reserved for running bowtie, and no parallell processing will be allowed. Default is 25.
-k <kMin>	Minimum read length queried. Default is 20.
-m <kMax>	Maximum read length queried. Read lengths up to 255 are supported. Default is 255.
-l <insertLength>, <std>,<noRand>	Parameters needed for paired end runs, specifying a gaussian distribution of fragment lengths. Should be a comma separated list of 3 elements. The first one is mean insert length and the second is the standard deviation. The third parameter specifies how many fragments should be sampled from each position in each transcript. Default is 250,25,5
-z <blocksize>	Blocksize, i.e. the number of reads that should be run together through bowtie. The number is optimized to 10 000 000. Bowties mapping speed per read increases with the number of read run simultaneously, reaching a plateau around 10 000 000 reads. Using bigger blocks can slow the processing down, due having files too large to keep in memory.

Specifying files from non-default directories

-g <genomeFastaDir>	The directory of the input genome fasta files. Can be used if the files are not in the default MULTo directory. However, it is recommended to move the files to the default directory instead.
-o <outputDir>	Output directory, if the user wants the output in another directory than the default.
-b <bowtie-index>	Bowtie-index, if it is not in the default folder.
-n <annotationFile>	Annotation file, if it is not in the default folder. Only needed for transcriptome runs.
-j <exonPositionFile>	A json file created by mappingConversionSimple.py. The file contains exon-position info for transcripts, to enable matching transcriptome positions to genomic positions. This file is by default automatically created by MULTo, and does usually not need manual input. Only needed for transcriptome runs.
-c <chromFasta>	Chromosome fasta file. Use as input if only one chromosome should be processed.
-e <blockDir>	Block directory, containing temporary files needed internally for MULTo runs. The directory is automatically created by MULTo, and should never need manual input.

Optional arguments for overwriting files and resuming runs

-O	Overwrite existing uniqueness files. By default, MULTo will not try to create files that already exists.
-D	Use this option if the fasta-dictionary and output-dictionary files are already created. This files are created by MULTo in at the start of the run. The fasta dictionary files contains all the reads with maximum read length. Using this option will make a re-run faster. Can be used if the script was interrupted after these files had been created.
-R	Use this option to indicate that this is a re-run, and some uniqueness files may already be created. Using this option avoids having to start the script from the beginning in case of an interruption.
-I	Use to create new bowtie indexes even if they are already created.
-F	To create new fasta files (transcriptome or bisulfite), even if they are already created.

Help and debugging

-K	Debugging option, will keep intermediate files.
--print-help	Will show the help message. The help message will also print if the script is ran with insufficient input.

Manual MULToSearch

MULToSearch.py can be used to retrieve uniqueness values for particular regions in the genome. The input to the script can be either a single region, or a file containing several regions. When a single region is queried the script can return either the uniqueness at a single read length or at a range of read lengths. When the input is a file of several regions, only a single read length can be queried.

The output can be either a count of the number of unique positions, or a positional list of the uniqueness in each position, or both.

By default, the script will only return uniqueness information for reads that fit within the region, which means that no uniqueness information can be given for positions closer than kMin to the given end point. If the uniqueness of the whole region is requested, use the -W flag.

The script supports searches in both normal genome and bisulfite genome uniqueness files.

Single regions queries

python src/MULToSearch.py -p <species> -a <assembly> -s <startPosition> -e <endPosition> -c <chromosome> -t <strand> -m <kMin> -x <kMax>

By default, the output of a single region query will be both positional and counts of unique positions. These queries can be done either with a single read length (given by -k ) or a range of read lengths (given by -m and -x). The output will be printed to standard output.

Query from file

python src/MULToSearch.py -p <species> -a <assembly> -i <inputFile> -o <outputFile> -g <fileType> -k <kmereTreshold>

By default, the output of file queries will be only counts of unique positions. Positional information can be obtained by adding the --all flag. These queries can only be done at a single read length. The output will be saved to the specified output file.

Options

Non-optional arguments

-a <assembly>	The assembly to be queried.
-p <species>	The species to be queried.
-u <multoDir>	A link to the folder where the MULTo-files to be queried are located. If assembly and species are given, and the MULTo-files are created in the default folders, this option can be omitted. Observe that the folder that should be given is the “MULTo_files”-folder and not the “fw” or “rv” folders for bisulfite genome.

Options for single region queries

-c <chromosome>	Chromosome for single region query.
-s <startPosition>	Start position for single region query. The position should be 1-based (i.e. first position in the chromosome is position 1), and the given position will be included in the search.
-e <endPosition>	End position for single region query. The position should be 1-based (i.e. first position in the chromosome is position 1), and the given position will be included in the search.
-t <strand>	Strand for single region query. Should be + or -.

Options for file queries

-i <inputFile>	Input file in either gtf, gff or bed format, specifying regions to be queried.
-o <outputFile>	Output file, where the results will be saved in case of a query from an input file.
-g <fileType>	The type of input file (bed, gtf and gff is supported). If the file ending is .gtf, .gff or .bed, this option can be omitted.

Run type options

-k <kmereTreshold>	The read length to query. Default is 50.
-m <kMin>	The minimum read length to query. Default is 20.
-x <kMax>	The maximum read length to query. Default is 255.
-W	Use this option to get the uniqueness for all position within the region, not considering that the read needs to fit both start and end positions within the region. By default, this option is off.
-B	Use for querying bisulfite uniqueness files.

Output options

--counter	Will give a count of number of unique positions, total number of positions and the fraction of unique positions as output. This option is default when the input is a file with regions.
--positions	Will output positional information about uniqueness. If the kmereTreshold is used, it will ouput a 1 for a unique position or a 0 for a non-unique. If kMin and kMax is used, the output will be a list of MUL-values.
--all	Both --counter and --positions output will be obtained. Default for single region queries.