Extract sequences from a FASTA file given the co-ordinates

Many times, it might be essential to grab  a portion of a FASTA sequence to perform downstream analyses. One such case would be to extract all the genes given the whole genome (like all chloroplast and mitochondrial genes using just the GFF file). There are are several ways to do this.

  1. If you are familiar with Galaxy, there is a “Extract Genomic DNA” tool that can do this job. You can use either a GFF or GTF datasets (if not in this format you can easily convert it). This is pretty straight forward.

Image

 

2. Using BEDTools, a utility called ‘fastaFromBed’ can do this. Simply,

fastaFromBed -fi in.fasta -bed regions.bed -fo out.regions.fa
  1. GLIMMER package as 2 scripts that can be used to extract sequences based on co-ordinates.
extract [options] sequence coords

This program reads a genome sequence and a list of coordinates for it and outputs a multifasta file of the regions specified by the coordinates

multi-extract [options] sequences coords

This program is a multi-fasta version of the preceding program

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s