Calculate length of all sequences in an multi-fasta file

Sometimes it is essential to know the length distribution of your sequences. It may be your newly assembled scaffolds or it might be a genome, that you wish to know the size of chromosomes, or it could just be any multi fasta sequence file. A simple way to do it is using biopython.

For example save this script as seq_length.py

#!/usr/bin/python
from Bio import SeqIO
import sys
cmdargs = str(sys.argv)
for seq_record in SeqIO.parse(str(sys.argv[1]), "fasta"):
 output_line = '%s\t%i' % \
(seq_record.id, len(seq_record))
 print(output_line)

To run,

chmod +x seq_length.py
seq_length.py inpput_file.fasta

This will print length for all the sequences in that file.

9 thoughts on “Calculate length of all sequences in an multi-fasta file

  1. Thanks, this is very useful. Is there a way to summarize the information on sequence lengths, to get for example the mean, and range of the lengths in the file. And, maybe summarizing the size distribution in a histogram?

  2. Just discovered your blog and I think it will save my life. This little script just helped me a lot to check the length of the contigs I obtained from a sequencing platform.

    Thank you so much for sharing your little tips. It’s usefull and perfectly explained. Straight to the point, I love it!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s