Bioinformatics Encyclopedia
Home Bioinformatics Science Fair Projects Bioinformatics Resources Bioinformatics Books Biology Jokes and Evolution
 
 


Stockholm Format



See also:

Stockholm format is a Multiple sequence alignment format used by Pfam and Rfam to disseminate protein and RNA sequence alignments[1][2][3]. The alignment editors Ralee and Belvu support Stockholm format as do the probabilistic database search tools, Infernal and HMMER. A simple example of an Rfam alignment (UPSK RNA) in Stockholm format is shown below:

# STOCKHOLM 1.0

AF035635.1/619-641             UGAGUUCUCGAUCUCUAAAAUCG
M24804.1/82-104                UGAGUUCUCUAUCUCUAAAAUCG
J04373.1/6212-6234             UAAGUUCUCGAUCUUUAAAAUCG
M24803.1/1-23                  UAAGUUCUCGAUCUCUAAAAUCG
#=GC SS_cons                   .AAA....<<<<aaa....>>>>
//

A minimal well formed Stockholm files should contain the header which states the format and version identifier, currently '# STOCKHOLM 1.0'. Followed by the sequences and corresponding unique sequence names:

<seqname> <aligned sequence>
<seqname> <aligned sequence>
<seqname> <aligned sequence>

'<seqname>' stands for "sequence name", typically in the form "name/start-end" or just "name". Finally, the "//" line indicates the end of the alignment. Sequence letters may include any characters except whitespace. Gaps may be indicated by "." or "-".

The alignment mark-up:

Mark-up lines may include any characters except whitespace. Use underscore ("_") instead of space.

#=GF <feature> <Generic per-File annotation, free text>
#=GC <feature> <Generic per-Column annotation, exactly 1 char per column>
#=GS <seqname> <feature> <Generic per-Sequence annotation, free text>
#=GR <seqname> <feature> <Generic per-Sequence AND per-Column markup, exactly 1 char per column>

Magic or recommended features:

#=GF

(See Pfam documentation, under "Description of fields")

For embedding trees:

#=GF NH <tree in New Hampshire eXtended format>
#=GF TN <Unique identifier for the next tree>
  • Notes: A tree may be stored on multiple #=GF NH lines.
  • If multiple trees are stored in the same file, each tree must be preceded by a #=GF TN line with a unique tree identifier. If only one tree is included, the #=GF TN line may be omitted.

#=GC

The same features as for #=GR with "_cons" appended, meaning "consensus". Example: "SS_cons".

#=GS

Rfam and Pfam uses these features:

      Feature                    Description
      ---------------------      -----------
      AC <accession>             ACcession number
      DE <freetext>              DEscription
      DR <db>; <accession>;      Database Reference
      OS <organism>              OrganiSm (species)
      OC <clade>                 Organism Classification (clade, etc.)
      LO <look>                  Look (Color, etc.)

#=GR

      Feature   Description            Markup letters
      -------   -----------            --------------
      SS        Secondary Structure    For RNA [.,;<>(){}[]AaBb...], 
                                       For protein [HGIEBTSCX]
      SA        Surface Accessibility  [0-9X] 
                    (0=0%-10%; ...; 9=90%-100%)
      TM        TransMembrane          [Mio]
      PP        Posterior Probability  [0-9*] 
                    (0=0.00-0.05; 1=0.05-0.15; *=0.95-1.00)
      LI        LIgand binding         [*]
      AS        Active Site            [*]
      IN        INtron (in or after)   [0-2]
  • Note: Do not use multiple lines with the same #=GR label. Only one unique feature assignment can be made for each sequence.
  • "X" in SA and SS means "residue with unknown structure".
  • In SS the letters are taken from DSSP: H=alpha-helix, G=3/10-helix, I=p-helix, E=extended strand, B=residue in isolated b-bridge, T=turn, S=bend, C=coil/loop.)

Recommended placements:

  • #=GF Above the alignment
  • #=GC Below the alignment
  • #=GS Above the alignment or just below the corresponding sequence
  • #=GR Just below the corresponding sequence

Size limits:

  • No size limits on any field.
  • However, a simple parser that uses fixed field sizes should work safely on Pfam alignments with these limits:
    • Line length: 10000.
    • <seqname>: 255.
    • <feature>: 255.

References

  1. ^ Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A (2005). "Rfam: annotating non-coding RNAs in complete genomes.". Nucleic Acids Res 33 (Database issue): D121-4. PMID 15608160. 
  2. ^ Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR (2003). "Rfam: an RNA family database.". Nucleic Acids Res 31 (1): 439-41. PMID 12520045. 
  3. ^ Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A (2008). "The Pfam protein families database.". Nucleic Acids Res 36 (Database issue): D281-8. PMID 18039703. 

External links

This article is licensed under the GNU Free Documentation License. It uses material from Wikipedia Encyclopedia article "Stockholm Format"

Most Popular

Bioinformatics Introduction

Sequence Alignment

Sequence Database

Phylogenetics

Protein Structure Prediction


Bioinformatics Books





















Site Map   About Us

Comments and inquiries could be addressed to:
webmaster@juliantrubin.com


Last updated: July 2008
Copyright © 2003-2008 Julian Rubin