Summary statistics for a fasta file

On my previous post (Calculate length of all sequences in an multi-fasta file) , one of the user commented that it would be useful to have summary statistics for the sequence length distribution or histogram. So, this simple script can be used along with the previous script to get the summary statistics.

#!/bin/sh
# reads stdin and prints summary statistics
# total, count, mean, median, min and max
# you can pipe this through scripts or redirect input <
# modified from http://unix.stackexchange.com/a/13779/27194
sort -n | awk '
  $1 ~ /^[0-9]*(\.[0-9]*)?$/ {
    a[c++] = $1;
    sum += $1;
  }
  END {
    ave = sum / c;
    if( (c % 2) == 1 ) {
      median = a[ int(c/2) ];
    } else {
      median = ( a[c/2] + a[c/2-1] ) / 2;
    }
    OFS="\t";
    { printf ("Total:\t""%'"'"'d\n", sum)}
    { printf ("Count:\t""%'"'"'d\n", c)}
    { printf ("Mean:\t""%'"'"'d\n", ave)}
    { printf ("Median:\t""%'"'"'d\n", median)}
    { printf ("Min:\t""%'"'"'d\n", a[0])}
    { printf ("Max:\t""%'"'"'d\n", a[c-1])}
  }
'

For using this script, first change the permissions and pipe it through the output of the previous script (seq_length.py, see post: Calculate length of all sequences in an multi-fasta file).

chmod +x summary_stats.sh
seq_length.py input_file.fasta | cut -f 2 |summary_stats.sh

You will see formatted output as follows

Total:  1,825,408
Count:  1,155
Mean:   1,580
Median: 1,360
Min:    1,001
Max:    12,972

For getting a histogram, I suggest using a per-existing package called data_hacks (on github). The installation is fairly simple (see README), can be used with the above scripts. Once you download/installed the package, run the above script to get the min/max values. This can be plugged in to the data_hacks script to get the histogram (you can also save the data to be later plotted using any other ways, too).

seq_length.py input_file.fasta |cut -f 2 | histogram.py --percentage --max=12972 --min=1001

The output you will get is:

# NumSamples = 1155; Min = 1001.00; Max = 12972.00
# Mean = 1580.439827; Variance = 624229.340751; SD = 790.081857; Median 1360.000000
# each ∎ represents a count of 13
 1001.0000 -  2198.1000 [  1019]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (88.23%)
 2198.1000 -  3395.2000 [   108]: ∎∎∎∎∎∎∎∎ (9.35%)
 3395.2000 -  4592.3000 [    17]: ∎ (1.47%)
 4592.3000 -  5789.4000 [     6]:  (0.52%)
 5789.4000 -  6986.5000 [     2]:  (0.17%)
 6986.5000 -  8183.6000 [     0]:  (0.00%)
 8183.6000 -  9380.7000 [     1]:  (0.09%)
 9380.7000 - 10577.8000 [     1]:  (0.09%)
10577.8000 - 11774.9000 [     0]:  (0.00%)
11774.9000 - 12972.0000 [     1]:  (0.09%)

I hope this post helps!

PBS: How to submit jobs that depend on previously submitted jobs?

To submit jobs one after the other (i.e., run second job after the completion of first), we can use depend function of qsub

First submit the firstjob, like normal

qsub first_job.sub

You will get the output (jobid#)

1234567.hpc

Second submit the second job following way,

qsub -W depend=afterok:1234567 second_job.sub

Both job will be queued, but second job won’t start till the first job is finished

You can also automate this step using a simple bash script

#!/bin/bash
FIRST=$(qsub first_job.sub)
SECOND=$(qsub -W depend=afterok:$FIRST second_job.sub)
THIRD=$(qsub -W depend=afterok:$SECOND third_job.sub)
FOURTH=$(qsub -W depend=afterok:$THIRD fourth_job.sub)

 

Running interactive jobs on clusters

Sometimes you might have to run the jobs on clusters interactively (programs that aren’t optimized to run by itself, programs that need user input for various steps or simply for de-bugging purposes). In those cases, it is better to run them in virtual terminals using screen command. I normally use this to order contigs using  Mauve, where I need large memory/processors. Running this on head node, will get me an email from sysadmin.  Here is an example:

After you SSH into the head node, open up a new screen session, assume that you will be running program1 here.

screen -S program1

This will begin the new terminal, that is virtual and doesn’t require a running computer to keep it alive (and thus it can run even after you close all your windows, turn off your PC). You can detach from this screen any time by pressing Ctrl+a followed by d. This will bring you back to your first terminal which you SSH’ed before.

You can open several of these screen sessions, for program2, program3 etc.,

screen -S program2
#press 'Ctrl+a' followed by 'd'
[detached]
screen -S program3
#press 'Ctrl+a' followed by 'd'
[detached]

Once you are back on head node, you can see all the running screen sessions by typing:

screen -ls
There are screens on:
        1234.program1   (Detached)
        1235.program2   (Detached)
        1236.program3   (Detached)

You can re-attach to any screen you want, by:

screen -r 1234.program1

To close a particular screen session, simply attach to that session and enter Ctrl+a followed by : and type quit.
To get help, enter Ctrl+a followed by ?

Once you open a screen session, you can request a interactive run access via PBS script using:

qsub -I -l mem=256Gb,nodes=1:ppn=32,walltime=48:00:00 -N program1
# this might depend on your cluster settings

After your job gets accepted, you can start running any program just like you would on the head node, without being killed by sysadmin.

Alternatively, you can also add this as alias to your .bashrc file, so that you can quickly open an interactive session, just by typing qlive

alias qlive="qsub -I -l mem=256Gb,nodes=1:ppn=32,walltime=48:00:00 -N stdin"

I hope this helps!

 

Simple script to count the number of reads in FASTQ file

If you want to quickly count the number of reads in a fastq file, you can count the total number of line and divide them by 4. However, when you want to generate a nice table for the reads when writing a report, this will be a inconvenience. So, here is my simple bash script that does this job

#!/bin/bash
if [ $# -lt 1 ] ; then
	echo ""
	echo "usage: count_fastq.sh [fastq_file1] <fastq_file2> ..|| *.fastq"
	echo "counts the number of reads in a fastq file"
	echo ""
	exit 0
fi

filear=${@};
for i in ${filear[@]}
do
lines=$(wc -l $i|cut -d " " -f 1)
count=$(($lines / 4))
echo -n -e "\t$i : "
echo "$count"  | \
sed -r '
  :L
  s=([0-9]+)([0-9]{3})=\1,\2=
  t L'
done

Extract sequences from a FASTA file given the co-ordinates

Many times, it might be essential to grab  a portion of a FASTA sequence to perform downstream analyses. One such case would be to extract all the genes given the whole genome (like all chloroplast and mitochondrial genes using just the GFF file). There are are several ways to do this.

  1. If you are familiar with Galaxy, there is a “Extract Genomic DNA” tool that can do this job. You can use either a GFF or GTF datasets (if not in this format you can easily convert it). This is pretty straight forward.

Image

 

2. Using BEDTools, a utility called ‘fastaFromBed’ can do this. Simply,

fastaFromBed -fi in.fasta -bed regions.bed -fo out.regions.fa
  1. GLIMMER package as 2 scripts that can be used to extract sequences based on co-ordinates.
extract [options] sequence coords

This program reads a genome sequence and a list of coordinates for it and outputs a multifasta file of the regions specified by the coordinates

multi-extract [options] sequences coords

This program is a multi-fasta version of the preceding program

 

How to install a RPM package

If you have access to a large computing cluster (HPC), chances are you will be running Red Hat Enterprise Linux operating system. Also, you will be a regular user without privileges to install any programs system-wide. Sometimes it might be necessary to install some programs. If you have a good support team running the cluster, installation will be as easy as sending them a request email. But, if you are like me, maintaining programs for your group, then you’ll more likely end up installing one or the other package for the users in your group. With RHEL OS, some of the packages are only found as RPM packages (all centOS RPM packages are compatible with RHEL). You can download them (either source or binaries) from various locations such as RPM findPKGS etc., To install them follow these steps:

rpm2cpio package.rpm |cpio -idv
# will extract the RPM into different files
# you will most likely find a tar.gz file, and sometimes few patch files
# first extract the tar.gz file and cd into it
tar -xzf package.tar.gz
cd package
# apply patch
patch -Np1 -i ../name_of_the_file.patch
# install as normal
./configure --prefix=/path/to/install/programs
make
make install

Convert FASTQ to FASTA

Using SED

sed -n '1~4s/^@/>/p;2~4p' INFILE.fastq > OUTFILE.fasta

NOTE: This is, by far, fastest way to convert FASTQ to FASTA

Using PASTE

cat INFILE.fastq | paste - - - - | \\
cut -f 1, 2| sed 's/@/>/'g | \\
tr -s "/t" "/n" > OUTFILE.fasta

EMBOSS:seqret

seqret -sequence reads.fastq -outseq reads.fasta

Using AWK

cat infile.fq | \\
awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa

FASTX-toolkit

fastq_to_fasta -h
usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE]
# Remember to use -Q33 for illumina reads!
version 0.0.6
       [-h]         = This helpful help screen.
       [-r]         = Rename sequence identifiers to numbers.
       [-n]         = keep sequences with unknown (N) nucleotides.
                    Default is to discard such sequences.
       [-v]         = Verbose - report number of sequences.
                    If [-o] is specified, report will be printed to STDOUT.
                    If [-o] is not specified (and output goes to STDOUT),
                    report will be printed to STDERR.
       [-z]         = Compress output with GZIP.
       [-i INFILE]  = FASTA/Q input file. default is STDIN.
       [-o OUTFILE] = FASTA output file. default is STDOUT.

Split multi-fasta sequence file

Sometimes it is necessary to split a large file containing several sequences (fasta format) in to individual files. I do this by a simple ‘awk’ command where i separate sequences based on regular expression match and then write it to a file numbered sequentially. It is easy and quick!

awk '/^>/{s=++d".fasta"} {print > s}' <inputFile>

Bash loops for productivity

Working in bash is so much fun! If you spend enough time in terminal, then you might get addicted to it and never like the gui windows. There are commands (especially loops) that can save you lot of time. They are very useful to do some routine stuff. My favorite loops are as follows:
Loops through all the files with txt extension and performs the action

for f in *.txt; do yourcommand $f >$f.out; done

Read file line by line and run command on it

while read line; do yourcommand $line; done<FileToRead.txt

Other variation of this above command (extremely useful when you have to read arguments from a file:

while read fld1 fld2 fld3; do YourCommand -a $fld1 -b $fld2 -c $fld3 > $fld1.$fld2.$fld3.txt; done<FileToRead.txt

A simple loop for a set of numbers (you can also use {a..z} etc., or mix them)

for i in {1..10}; do echo $i; done

Another variation of the above command

for i in {1..22} X Y; do echo "human chromosome $i"; done

I hope these will help you too!

 

Serving HTTP from your Linux terminal

There is a very useful python module called SimpleHTTPServer that lets you create a webserver instantly. Say you have a huge file (50-60gb) and you want to send it to someone across the campus, then you have a use for SimpleHTTPserver.

First get to know the hostname of your server, just type hostname and press enter, it will return yourhostname

Then type,

python -m SimpleHTTPServer

That’s it! you can now access all your files in that directory through browser using this url:

http://yourhostname:8000/

It will stay as long as you let that command run (to kill it just type Ctrl + C).