Calculate length of all sequences in an multi-fasta file

Sometimes it is essential to know the length distribution of your sequences. It may be your newly assembled scaffolds or it might be a genome, that you wish to know the size of chromosomes, or it could just be any multi fasta sequence file. A simple way to do it is using biopython.

For example save this script as

from Bio import SeqIO
import sys
cmdargs = str(sys.argv)
for seq_record in SeqIO.parse(str(sys.argv[1]), "fasta"):
 output_line = '%s\t%i' % \
(, len(seq_record))

To run,

chmod +x inpput_file.fasta

This will print length for all the sequences in that file.

PBS: How to submit jobs that depend on previously submitted jobs?

To submit jobs one after the other (i.e., run second job after the completion of first), we can use depend function of qsub

First submit the firstjob, like normal

qsub first_job.sub

You will get the output (jobid#)


Second submit the second job following way,

qsub -W depend=afterok:1234567 second_job.sub

Both job will be queued, but second job won’t start till the first job is finished

You can also automate this step using a simple bash script

FIRST=$(qsub first_job.sub)
SECOND=$(qsub -W depend=afterok:$FIRST second_job.sub)
THIRD=$(qsub -W depend=afterok:$SECOND third_job.sub)
FOURTH=$(qsub -W depend=afterok:$THIRD fourth_job.sub)


Calculating genome size from flow cytometry results

Flow cytometry (FCM) is a laser-based, biophysical technology employed in various molecular biology studies. Many applications of FCM include cell counting, cell sorting, biomarker detection and protein engineering. It is also now being used to measure nuclear DNA content in plants (using DNA-selective fluorochromes). It is replacing older methods of estimating DNA such as Feulgen densitometry, because of its simple sample preparation requirement and high throughput. Since poliploidy is more common in plants, this method makes it ideal to estimate the ploidy level, too. It works by suspending cells in a stream of fluid and passing them by an electronic detection apparatus (wikipedia).  You usually get the results as ρg (pico grams) of DNA. To convert into basepairs (bp), which gives more realistic measure of DNA content, you can use these conversions

genome size (bp) = (0.978 x 109) x DNA content (pg)
so roughly,
1 pg = 978 Mb

Running interactive jobs on clusters

Sometimes you might have to run the jobs on clusters interactively (programs that aren’t optimized to run by itself, programs that need user input for various steps or simply for de-bugging purposes). In those cases, it is better to run them in virtual terminals using screen command. I normally use this to order contigs using  Mauve, where I need large memory/processors. Running this on head node, will get me an email from sysadmin.  Here is an example:

After you SSH into the head node, open up a new screen session, assume that you will be running program1 here.

screen -S program1

This will begin the new terminal, that is virtual and doesn’t require a running computer to keep it alive (and thus it can run even after you close all your windows, turn off your PC). You can detach from this screen any time by pressing Ctrl+a followed by d. This will bring you back to your first terminal which you SSH’ed before.

You can open several of these screen sessions, for program2, program3 etc.,

screen -S program2
#press 'Ctrl+a' followed by 'd'
screen -S program3
#press 'Ctrl+a' followed by 'd'

Once you are back on head node, you can see all the running screen sessions by typing:

screen -ls
There are screens on:
        1234.program1   (Detached)
        1235.program2   (Detached)
        1236.program3   (Detached)

You can re-attach to any screen you want, by:

screen -r 1234.program1

To close a particular screen session, simply attach to that session and enter Ctrl+a followed by : and type quit.
To get help, enter Ctrl+a followed by ?

Once you open a screen session, you can request a interactive run access via PBS script using:

qsub -I -l mem=256Gb,nodes=1:ppn=32,walltime=48:00:00 -N program1
# this might depend on your cluster settings

After your job gets accepted, you can start running any program just like you would on the head node, without being killed by sysadmin.

Alternatively, you can also add this as alias to your .bashrc file, so that you can quickly open an interactive session, just by typing qlive

alias qlive="qsub -I -l mem=256Gb,nodes=1:ppn=32,walltime=48:00:00 -N stdin"

I hope this helps!


Simple script to count the number of reads in FASTQ file

If you want to quickly count the number of reads in a fastq file, you can count the total number of line and divide them by 4. However, when you want to generate a nice table for the reads when writing a report, this will be a inconvenience. So, here is my simple bash script that does this job

if [ $# -lt 1 ] ; then
	echo ""
	echo "usage: [fastq_file1] <fastq_file2> ..|| *.fastq"
	echo "counts the number of reads in a fastq file"
	echo ""
	exit 0

for i in ${filear[@]}
lines=$(wc -l $i|cut -d " " -f 1)
count=$(($lines / 4))
echo -n -e "\t$i : "
echo "$count"  | \
sed -r '
  t L'

Extract sequences from a FASTA file given the co-ordinates

Many times, it might be essential to grab  a portion of a FASTA sequence to perform downstream analyses. One such case would be to extract all the genes given the whole genome (like all chloroplast and mitochondrial genes using just the GFF file). There are are several ways to do this.

  1. If you are familiar with Galaxy, there is a “Extract Genomic DNA” tool that can do this job. You can use either a GFF or GTF datasets (if not in this format you can easily convert it). This is pretty straight forward.



2. Using BEDTools, a utility called ‘fastaFromBed’ can do this. Simply,

fastaFromBed -fi in.fasta -bed regions.bed -fo out.regions.fa
  1. GLIMMER package as 2 scripts that can be used to extract sequences based on co-ordinates.
extract [options] sequence coords

This program reads a genome sequence and a list of coordinates for it and outputs a multifasta file of the regions specified by the coordinates

multi-extract [options] sequences coords

This program is a multi-fasta version of the preceding program


How to install a RPM package

If you have access to a large computing cluster (HPC), chances are you will be running Red Hat Enterprise Linux operating system. Also, you will be a regular user without privileges to install any programs system-wide. Sometimes it might be necessary to install some programs. If you have a good support team running the cluster, installation will be as easy as sending them a request email. But, if you are like me, maintaining programs for your group, then you’ll more likely end up installing one or the other package for the users in your group. With RHEL OS, some of the packages are only found as RPM packages (all centOS RPM packages are compatible with RHEL). You can download them (either source or binaries) from various locations such as RPM findPKGS etc., To install them follow these steps:

rpm2cpio package.rpm |cpio -idv
# will extract the RPM into different files
# you will most likely find a tar.gz file, and sometimes few patch files
# first extract the tar.gz file and cd into it
tar -xzf package.tar.gz
cd package
# apply patch
patch -Np1 -i ../name_of_the_file.patch
# install as normal
./configure --prefix=/path/to/install/programs
make install

Convert FASTQ to FASTA

Using SED

sed -n '1~4s/^@/>/p;2~4p' INFILE.fastq > OUTFILE.fasta

NOTE: This is, by far, fastest way to convert FASTQ to FASTA


cat INFILE.fastq | paste - - - - | \\
cut -f 1, 2| sed 's/@/>/'g | \\
tr -s "/t" "/n" > OUTFILE.fasta


seqret -sequence reads.fastq -outseq reads.fasta

Using AWK

cat infile.fq | \\
awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa


fastq_to_fasta -h
usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE]
# Remember to use -Q33 for illumina reads!
version 0.0.6
       [-h]         = This helpful help screen.
       [-r]         = Rename sequence identifiers to numbers.
       [-n]         = keep sequences with unknown (N) nucleotides.
                    Default is to discard such sequences.
       [-v]         = Verbose - report number of sequences.
                    If [-o] is specified, report will be printed to STDOUT.
                    If [-o] is not specified (and output goes to STDOUT),
                    report will be printed to STDERR.
       [-z]         = Compress output with GZIP.
       [-i INFILE]  = FASTA/Q input file. default is STDIN.
       [-o OUTFILE] = FASTA output file. default is STDOUT.

Split multi-fasta sequence file

Sometimes it is necessary to split a large file containing several sequences (fasta format) in to individual files. I do this by a simple ‘awk’ command where i separate sequences based on regular expression match and then write it to a file numbered sequentially. It is easy and quick!

awk '/^>/{s=++d".fasta"} {print > s}' <inputFile>

Bash loops for productivity

Working in bash is so much fun! If you spend enough time in terminal, then you might get addicted to it and never like the gui windows. There are commands (especially loops) that can save you lot of time. They are very useful to do some routine stuff. My favorite loops are as follows:
Loops through all the files with txt extension and performs the action

for f in *.txt; do yourcommand $f >$f.out; done

Read file line by line and run command on it

while read line; do yourcommand $line; done<FileToRead.txt

Other variation of this above command (extremely useful when you have to read arguments from a file:

while read fld1 fld2 fld3; do YourCommand -a $fld1 -b $fld2 -c $fld3 > $fld1.$fld2.$fld3.txt; done<FileToRead.txt

A simple loop for a set of numbers (you can also use {a..z} etc., or mix them)

for i in {1..10}; do echo $i; done

Another variation of the above command

for i in {1..22} X Y; do echo "human chromosome $i"; done

I hope these will help you too!