How to use the DAS-TMfilter prediction tool


Read this page when you are using the DAS-TMfilter prediction server for the first time or when the server have been updated since your last visit. (See the time-stamp on the main page.) Read about the theory behind the server here.

Sections on this page:

Input format

The DAS-TMfilter prediction tool takes fasta-format protein queries as input. Each entry starts with one header line with the ">" character at the beginning of the line followed by an arbitrary comment until the end of the header line. This text is for the identification of your query. The header is followed by the protein sequence itself using single letter all upper case code for the residues. You can break down your sequence into as many lines as you like. The sequence should not contain any other character but the residue codes. (No numbering, no spaces etc..) The end of an entry is marked by the header of the next one or by the last line.

This is a sample query with five entries (the output of this query used for demonstration of various features of the server later in this documentation):

> My first protein ID: prot0001
MSSNAQVKTPLPPAPAPKKESNFLIDFLMGGVSAAVAKTAASPIERVKLLIQNQDEMLKQ
GTLDRKYAGILDCFKRTATQEGVISFWRGNTANVIRYFPTQALNFAFKDKIKAMFGFKKE
EGYAKWFAGNLASGGAAGALSLLFVYSLDYARTRLAADSKSSKKGGARQFNGLIDVYKKT
LKSDGVAGLYRGFLPSVVGIVVYRGLYFGMYDSLKPLLLTGSLEGSFLASFLLGWVVTTG
ASTCSYPLDTVRRRMMMTSGQAVKYDGAFDCLRKIVAAEGVGSLFKGCGANILRGVAGAG
VISMYDQLQMILFGKKFK
> Here goes the second one ID: prot0002
MSTSKSENYLSELRKIIWPIEQYENKKFLPLAFMMFCILLNYSTLRSIKDGFVVTDIGTE
SISFLKTYIVLPSAVIAMIIYVKLCDILKQENVFYVITSFFLGYFALFAFVLYPYPDLVH
PDHKTIESLSLAYPNFKWFIKIVGKWSFASFYTIAELWGTMMLSLLFWQFANQITKIAEA
KRFYSMFGLLANLALPVTSVVIGYFLHEKTQIVAEHLKFVPLFVIMITSSFLIILTYRWM
NKNVLTDPRLYDPALVKEKKTKAKLSFIESLKMIFTSKYVGYIALLIIAYGVSVNLVEGV
WKSKVKELYPTKEAYTIYMGQFQFYQGWVAIAFMLIGSNILRKVSWLTAAMITPLMMFIT
GAAFFSFIFFDSVIAMNLTGILASSPLTLAVMIGMIQNVLSKGVKYSLFDATKNMAYIPL
DKDLRVKGQAAVEVIGGRLGKSGGAIIQSTFFILFPVFGFIEATPYFASIFFIIVILWIF
AVKGLNKEYQVLVNKNEK
> Non-TM protein with a fasle positive peak ID: glob0001
MKDNTVPLKLIALLANGEFHSGEQLGETLGMSRAAINKHIQTLRDWGVDVFTVPGKGYSL
PEPIQLLNAKQILGQLDGGSVAVLPVIDSTNQYLLDRIGELKSGDACIAEYQQAGSPFGA
NLYLSMFWRLEQPAAAIGLSLVIGIVMAEVLRKLGADKVRVKWPNDLYLQDRKLAGILVE
LTGAAQIVIGAGINMAMWITLQEAGINLDRNTLAAMLIRELRAALELFEQEGLAPYLSRW
EKLDNFINRPVKLIIGDKEIFGISRGIDKQGALLLEQDGIIKPWMGGEISLR
> One more globular ID: glob0002
SVGTSCIPGMAIPHNPLDSCRWYVSTRTCGVGPRLATQEMKARCCRQLEAIPAYCRCEAV
RILMDGVVTSSGQHEGRLLQDLPGCPRQVQRAFAPKLVTEVECNLATIHGGPFCLSLLGA
GE
> The last one, globular ID: glob0003
MLDKIVIANRGEIALRILRACKELGIKTVAVHSSADRDLKHVLLADETVCIGPAPSVKSY
LNIPAIISAAEITGAVAIHPGYGFLSENANFAEQVERSGFIFIGPKAETIRLMGDKVSAI
AAMKKAGVPCVPGSDGDDMDKNRAIAKRIGYPVIIKRVVRGDAELAQSISMTRAYMEKYL
ENPRHVEIQVLADGQGNAIYLAERDCSMQRRHQKVVEEAPAPGITPELRRYIGERCAKAC
VDIGYRGAGTFEFLFENGEFYFIEMNTRIQVEHPVTEMITGVDLIKEQLRIAAGQPLSIK
QEEVHVRGHAVECRINAEDPNTFLPSPGKITRFHAPGGFGVRWESHIYAGYTVPPYYDSM
IGKLICYGENRDVAIARMKNALQELIIDGIKTNVDLQIRIMNDENFQHGGTNIHYLEKKL
GLQE
Back to the top

Output switches

The operation of the server is controlled by three pairs of alternative switches. The first one controls the format: you can have "short" format output or a more detailed "long" one with graphical representation of the DAS profile. The second switch controls the scale of the plot in case of the "long" option: "free" scale adjusted according to the maximal value of the actual DAS profile; "fixed" is cut at y = 5.0 value - peaks above this limit are off the scale. The third pair effects the evaluation of the query.

The "Library size" switch defines the number of sequences inculded as internal references. The setting of this switch has minimal effect on the quality of the result while it scales the CPU time linearly. The switch kept for compatibility reason mainly - alternation of the default value is not recommended.

The list of predicted non-TM-protein codes echoed followed by the list of TM-protein codes (if any). The horisontal bar marks the start of the list of results. For each entry the header line appears first with the user defined comment in it. It is followed by the number of detected TM-helix segments and the Q-score of the entry. Then the peaks of the DAS-curve above the empirical cutoff limit are listed - one peak per line. The "@" followed by the position of the peak in the sequence, the value of the DAS-curve at this point, the core segment (i.e. the portion of the curve above the cutoff) and the "E-value" of the peak. (The E-value is the propability that the peak is a false positive hit.) The lines may contain warnings too. The entries in the query are separated by horisontal bars.

The short output for the sample query above looks like this:


*** List of predicted non-TM-protein codes ***

> Non-TM protein with a fasle positive peak ID: glob0001
> One more globular ID: glob0002
> The last one, globular ID: glob0003

*** List of predicted TM-protein codes ***

> My first protein ID: prot0001
> Here goes the second one ID: prot0002


> My first protein ID: prot0001 # TMH: 3 Q: trusted @ 142 3.014 core: 138 .. 146 1.097e-02 @ 200 3.148 core: 194 .. 205 6.850e-03 @ 232 3.317 core: 225 .. 238 3.773e-03
> Here goes the second one ID: prot0002 # TMH: 12 Q: trusted @ 37 3.932 core: 31 .. 41 6.741e-04 @ 76 4.348 core: 67 .. 83 1.551e-04 @ 104 5.351 core: 95 .. 114 4.504e-06 @ 163 2.886 core: 158 .. 166 2.705e-02 @ 195 3.842 core: 188 .. 203 9.274e-04 @ 227 5.103 core: 219 .. 236 1.082e-05 @ 286 4.880 core: 279 .. 294 2.374e-05 @ 334 3.146 core: 330 .. 337 1.080e-02 @ 364 4.052 core: 349 .. 394 4.405e-04 Twin peaks - two TMH with a short linker @ 389 3.344 core: 349 .. 394 5.376e-03 @ 454 4.140 core: 449 .. 482 3.234e-04 Twin peaks - two TMH with a short linker @ 474 6.781 core: 449 .. 482 2.893e-08
> Non-TM protein with a fasle positive peak ID: glob0001 # TMH: 1 Q: 0.53 !!! Warning! Non-TM protein! @ 143 3.014 core: 139 .. 147 1.009e-02
> One more globular ID: glob0002 # TMH: 0 Q: trusted !!! Warning! Non-TM protein!
> The last one, globular ID: glob0003 # TMH: 0 Q: trusted !!! Warning! Non-TM protein!

The long output format contains all the information of the short format plus the graphic representation of the DAS-curve (PNG). The empirical cutoff value - 2.5 - is marked by red dots.

Back to the top

Evaluation

The evaluation of the query is controlled by the second alternative pair of switches. A quality score for an entry of the query against a library of known TM-proteins is used to judge whether the entry is a TM-protein or not. In "trusted" mode of operation the server will not compute this quality score for clear cases but only for questionable ones. In the current implementation this applies only for cases when only one TM-helix segment detected in the query. Then upon the value of the score the server will decide the type of the entry in question.

In the "unconditional" mode of operation the server is forced to calculate the quality score even for the trivial cases. In the output "Q: trusted" will appear when the score is actually not computed, while the value of the score - a real number between 0 and 1 (higher the better) - will replace the "trusted" string when it is evaluated.

Here is the output of the sample query with "unconditional" evaluation:

> My first protein ID: prot0001
# TMH:  3 Q:  0.93
@  142   2.796 core:  139 ..  145 2.367e-02
@  200   2.906 core:  196 ..  203 1.606e-02
@  232   3.085 core:  226 ..  237 8.544e-03


> Here goes the second one ID: prot0002 # TMH: 12 Q: 0.99 @ 37 3.805 core: 31 .. 41 1.056e-03 @ 76 4.211 core: 68 .. 82 2.515e-04 @ 104 5.186 core: 95 .. 114 8.047e-06 @ 163 2.787 core: 159 .. 166 3.829e-02 @ 195 3.719 core: 189 .. 203 1.430e-03 @ 227 4.946 core: 219 .. 236 1.883e-05 @ 286 4.730 core: 279 .. 294 4.031e-05 @ 334 3.043 core: 330 .. 337 1.551e-02 @ 364 3.922 core: 349 .. 394 6.973e-04 Twin peaks - two TMH with a short linker @ 389 3.237 core: 349 .. 394 7.836e-03 @ 454 4.013 core: 449 .. 482 5.070e-04 Twin peaks - two TMH with a short linker @ 474 6.581 core: 449 .. 482 5.854e-08
> Non-TM protein with a fasle positive peak ID: glob0001 # TMH: 1 Q: 0.61 !!! Warning! Non-TM protein! @ 143 3.050 core: 139 .. 147 8.898e-03
> One more globular ID: glob0002 # TMH: 0 Q: 0.00 !!! Warning! Non-TM protein!
> The last one, globular ID: glob0003 # TMH: 0 Q: 0.00 !!! Warning! Non-TM protein!
Back to the top

Limitations

The DAS-TMfilter server can process aproximately 5000 sequences per hour. (Several users may use the server at the same time so do not expect that the full power is yours.)

Back to the top

Warning messages

Here is the list of warnings and error messages, what they mean and how to interpret them.

Back to the top

Practical suggestions

Follow these guidelines for the efficient use of the DAS-TMfilter server:

Back to the top

Miklos Cserzo, cserzo.miklos@ext.semmelweis.hu (Nov 7 2001; May 18 2023)