Highlight words enclosed within parenthesis in MS-Word

Office 2013 offers robust regular expression matching, that will be very handy to do some basic stuff. One example, I was recently given a 10 page document to edit and incorporate citations. They wanted me to look for the authors (citations) within the text and add correct reference in the “references” section. One problem: there were way too many citations! It was hard for me to distinguish authors and normal text. I decided to highlight all the authors using regular expressions. This is how I did it:


Here the first "\" is escape character, this tells Word to treat the next character "(", as-is i.e., find a opening parenthesis in the document. Next, "(*)" tells Word to find one or more words after the opening parenthesis, and finally "\)" the search pattern ends when it encounters closing parenthesis (requires an escape character). So basically it matches anything within the parenthesis.

Note that you need to have "Use wildcards" checked. For the next part (to highlight), simply click on "Format" button and select "Highlight". When done, just click on "Replace All". You will have all the text in the document, within the parenthesis, highlighted! You can change "\(" and "\)" to other things as well. For eg., "\{" and  "\}" for text within the curly braces, so on.


Calculate length of all sequences in an multi-fasta file

Sometimes it is essential to know the length distribution of your sequences. It may be your newly assembled scaffolds or it might be a genome, that you wish to know the size of chromosomes, or it could just be any multi fasta sequence file. A simple way to do it is using biopython.

For example save this script as seq_length.py

from Bio import SeqIO
import sys
cmdargs = str(sys.argv)
for seq_record in SeqIO.parse(str(sys.argv[1]), "fasta"):
 output_line = '%s\t%i' % \
(seq_record.id, len(seq_record))

To run,

chmod +x seq_length.py
seq_length.py inpput_file.fasta

This will print length for all the sequences in that file.

PBS: How to submit jobs that depend on previously submitted jobs?

To submit jobs one after the other (i.e., run second job after the completion of first), we can use depend function of qsub

First submit the firstjob, like normal

qsub first_job.sub

You will get the output (jobid#)


Second submit the second job following way,

qsub -W depend=afterok:1234567 second_job.sub

Both job will be queued, but second job won’t start till the first job is finished

You can also automate this step using a simple bash script

FIRST=$(qsub first_job.sub)
SECOND=$(qsub -W depend=afterok:$FIRST second_job.sub)
THIRD=$(qsub -W depend=afterok:$SECOND third_job.sub)
FOURTH=$(qsub -W depend=afterok:$THIRD fourth_job.sub)


Calculating genome size from flow cytometry results

Flow cytometry (FCM) is a laser-based, biophysical technology employed in various molecular biology studies. Many applications of FCM include cell counting, cell sorting, biomarker detection and protein engineering. It is also now being used to measure nuclear DNA content in plants (using DNA-selective fluorochromes). It is replacing older methods of estimating DNA such as Feulgen densitometry, because of its simple sample preparation requirement and high throughput. Since poliploidy is more common in plants, this method makes it ideal to estimate the ploidy level, too. It works by suspending cells in a stream of fluid and passing them by an electronic detection apparatus (wikipedia).  You usually get the results as ρg (pico grams) of DNA. To convert into basepairs (bp), which gives more realistic measure of DNA content, you can use these conversions

genome size (bp) = (0.978 x 109) x DNA content (pg)
so roughly,
1 pg = 978 Mb

Running interactive jobs on clusters

Sometimes you might have to run the jobs on clusters interactively (programs that aren’t optimized to run by itself, programs that need user input for various steps or simply for de-bugging purposes). In those cases, it is better to run them in virtual terminals using screen command. I normally use this to order contigs using  Mauve, where I need large memory/processors. Running this on head node, will get me an email from sysadmin.  Here is an example:

After you SSH into the head node, open up a new screen session, assume that you will be running program1 here.

screen -S program1

This will begin the new terminal, that is virtual and doesn’t require a running computer to keep it alive (and thus it can run even after you close all your windows, turn off your PC). You can detach from this screen any time by pressing Ctrl+a followed by d. This will bring you back to your first terminal which you SSH’ed before.

You can open several of these screen sessions, for program2, program3 etc.,

screen -S program2
#press 'Ctrl+a' followed by 'd'
screen -S program3
#press 'Ctrl+a' followed by 'd'

Once you are back on head node, you can see all the running screen sessions by typing:

screen -ls
There are screens on:
        1234.program1   (Detached)
        1235.program2   (Detached)
        1236.program3   (Detached)

You can re-attach to any screen you want, by:

screen -r 1234.program1

To close a particular screen session, simply attach to that session and enter Ctrl+a followed by : and type quit.
To get help, enter Ctrl+a followed by ?

Once you open a screen session, you can request a interactive run access via PBS script using:

qsub -I -l mem=256Gb,nodes=1:ppn=32,walltime=48:00:00 -N program1
# this might depend on your cluster settings

After your job gets accepted, you can start running any program just like you would on the head node, without being killed by sysadmin.

Alternatively, you can also add this as alias to your .bashrc file, so that you can quickly open an interactive session, just by typing qlive

alias qlive="qsub -I -l mem=256Gb,nodes=1:ppn=32,walltime=48:00:00 -N stdin"

I hope this helps!


Simple script to count the number of reads in FASTQ file

If you want to quickly count the number of reads in a fastq file, you can count the total number of line and divide them by 4. However, when you want to generate a nice table for the reads when writing a report, this will be a inconvenience. So, here is my simple bash script that does this job

if [ $# -lt 1 ] ; then
	echo ""
	echo "usage: count_fastq.sh [fastq_file1] <fastq_file2> ..|| *.fastq"
	echo "counts the number of reads in a fastq file"
	echo ""
	exit 0

for i in ${filear[@]}
lines=$(wc -l $i|cut -d " " -f 1)
count=$(($lines / 4))
echo -n -e "\t$i : "
echo "$count"  | \
sed -r '
  t L'

Extract sequences from a FASTA file given the co-ordinates

Many times, it might be essential to grab  a portion of a FASTA sequence to perform downstream analyses. One such case would be to extract all the genes given the whole genome (like all chloroplast and mitochondrial genes using just the GFF file). There are are several ways to do this.

  1. If you are familiar with Galaxy, there is a “Extract Genomic DNA” tool that can do this job. You can use either a GFF or GTF datasets (if not in this format you can easily convert it). This is pretty straight forward.



2. Using BEDTools, a utility called ‘fastaFromBed’ can do this. Simply,

fastaFromBed -fi in.fasta -bed regions.bed -fo out.regions.fa
  1. GLIMMER package as 2 scripts that can be used to extract sequences based on co-ordinates.
extract [options] sequence coords

This program reads a genome sequence and a list of coordinates for it and outputs a multifasta file of the regions specified by the coordinates

multi-extract [options] sequences coords

This program is a multi-fasta version of the preceding program


How to install a RPM package

If you have access to a large computing cluster (HPC), chances are you will be running Red Hat Enterprise Linux operating system. Also, you will be a regular user without privileges to install any programs system-wide. Sometimes it might be necessary to install some programs. If you have a good support team running the cluster, installation will be as easy as sending them a request email. But, if you are like me, maintaining programs for your group, then you’ll more likely end up installing one or the other package for the users in your group. With RHEL OS, some of the packages are only found as RPM packages (all centOS RPM packages are compatible with RHEL). You can download them (either source or binaries) from various locations such as RPM findPKGS etc., To install them follow these steps:

rpm2cpio package.rpm |cpio -idv
# will extract the RPM into different files
# you will most likely find a tar.gz file, and sometimes few patch files
# first extract the tar.gz file and cd into it
tar -xzf package.tar.gz
cd package
# apply patch
patch -Np1 -i ../name_of_the_file.patch
# install as normal
./configure --prefix=/path/to/install/programs
make install

Convert FASTQ to FASTA

Using SED

sed -n '1~4s/^@/>/p;2~4p' INFILE.fastq > OUTFILE.fasta

NOTE: This is, by far, fastest way to convert FASTQ to FASTA


cat INFILE.fastq | paste - - - - | \\
cut -f 1, 2| sed 's/@/>/'g | \\
tr -s "/t" "/n" > OUTFILE.fasta


seqret -sequence reads.fastq -outseq reads.fasta

Using AWK

cat infile.fq | \\
awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa


fastq_to_fasta -h
usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE]
# Remember to use -Q33 for illumina reads!
version 0.0.6
       [-h]         = This helpful help screen.
       [-r]         = Rename sequence identifiers to numbers.
       [-n]         = keep sequences with unknown (N) nucleotides.
                    Default is to discard such sequences.
       [-v]         = Verbose - report number of sequences.
                    If [-o] is specified, report will be printed to STDOUT.
                    If [-o] is not specified (and output goes to STDOUT),
                    report will be printed to STDERR.
       [-z]         = Compress output with GZIP.
       [-i INFILE]  = FASTA/Q input file. default is STDIN.
       [-o OUTFILE] = FASTA output file. default is STDOUT.