How to use the DAS-TMfilter prediction tool

Read this page when you are using the DAS-TMfilter prediction server for the first time or when the server have been updated since your last visit. (See the time-stamp on the main page.) Read about the theory behind the server here.

Sections on this page:

Input format
Output switches
Limitations
Evaluation
Warning messages
Practical suggestions

Input format

The DAS-TMfilter prediction tool takes fasta-format protein queries as input. Each entry starts with one header line with the ">" character at the beginning of the line followed by an arbitrary comment until the end of the header line. This text is for the identification of your query. The header is followed by the protein sequence itself using single letter all upper case code for the residues. You can break down your sequence into as many lines as you like. The sequence should not contain any other character but the residue codes. (No numbering, no spaces etc..) The end of an entry is marked by the header of the next one or by the last line.

This is a sample query with five entries (the output of this query used for demonstration of various features of the server later in this documentation):

> My first protein ID: prot0001
MSSNAQVKTPLPPAPAPKKESNFLIDFLMGGVSAAVAKTAASPIERVKLLIQNQDEMLKQ
GTLDRKYAGILDCFKRTATQEGVISFWRGNTANVIRYFPTQALNFAFKDKIKAMFGFKKE
EGYAKWFAGNLASGGAAGALSLLFVYSLDYARTRLAADSKSSKKGGARQFNGLIDVYKKT
LKSDGVAGLYRGFLPSVVGIVVYRGLYFGMYDSLKPLLLTGSLEGSFLASFLLGWVVTTG
ASTCSYPLDTVRRRMMMTSGQAVKYDGAFDCLRKIVAAEGVGSLFKGCGANILRGVAGAG
VISMYDQLQMILFGKKFK
> Here goes the second one ID: prot0002
MSTSKSENYLSELRKIIWPIEQYENKKFLPLAFMMFCILLNYSTLRSIKDGFVVTDIGTE
SISFLKTYIVLPSAVIAMIIYVKLCDILKQENVFYVITSFFLGYFALFAFVLYPYPDLVH
PDHKTIESLSLAYPNFKWFIKIVGKWSFASFYTIAELWGTMMLSLLFWQFANQITKIAEA
KRFYSMFGLLANLALPVTSVVIGYFLHEKTQIVAEHLKFVPLFVIMITSSFLIILTYRWM
NKNVLTDPRLYDPALVKEKKTKAKLSFIESLKMIFTSKYVGYIALLIIAYGVSVNLVEGV
WKSKVKELYPTKEAYTIYMGQFQFYQGWVAIAFMLIGSNILRKVSWLTAAMITPLMMFIT
GAAFFSFIFFDSVIAMNLTGILASSPLTLAVMIGMIQNVLSKGVKYSLFDATKNMAYIPL
DKDLRVKGQAAVEVIGGRLGKSGGAIIQSTFFILFPVFGFIEATPYFASIFFIIVILWIF
AVKGLNKEYQVLVNKNEK
> Non-TM protein with a fasle positive peak ID: glob0001
MKDNTVPLKLIALLANGEFHSGEQLGETLGMSRAAINKHIQTLRDWGVDVFTVPGKGYSL
PEPIQLLNAKQILGQLDGGSVAVLPVIDSTNQYLLDRIGELKSGDACIAEYQQAGSPFGA
NLYLSMFWRLEQPAAAIGLSLVIGIVMAEVLRKLGADKVRVKWPNDLYLQDRKLAGILVE
LTGAAQIVIGAGINMAMWITLQEAGINLDRNTLAAMLIRELRAALELFEQEGLAPYLSRW
EKLDNFINRPVKLIIGDKEIFGISRGIDKQGALLLEQDGIIKPWMGGEISLR
> One more globular ID: glob0002
SVGTSCIPGMAIPHNPLDSCRWYVSTRTCGVGPRLATQEMKARCCRQLEAIPAYCRCEAV
RILMDGVVTSSGQHEGRLLQDLPGCPRQVQRAFAPKLVTEVECNLATIHGGPFCLSLLGA
GE
> The last one, globular ID: glob0003
MLDKIVIANRGEIALRILRACKELGIKTVAVHSSADRDLKHVLLADETVCIGPAPSVKSY
LNIPAIISAAEITGAVAIHPGYGFLSENANFAEQVERSGFIFIGPKAETIRLMGDKVSAI
AAMKKAGVPCVPGSDGDDMDKNRAIAKRIGYPVIIKRVVRGDAELAQSISMTRAYMEKYL
ENPRHVEIQVLADGQGNAIYLAERDCSMQRRHQKVVEEAPAPGITPELRRYIGERCAKAC
VDIGYRGAGTFEFLFENGEFYFIEMNTRIQVEHPVTEMITGVDLIKEQLRIAAGQPLSIK
QEEVHVRGHAVECRINAEDPNTFLPSPGKITRFHAPGGFGVRWESHIYAGYTVPPYYDSM
IGKLICYGENRDVAIARMKNALQELIIDGIKTNVDLQIRIMNDENFQHGGTNIHYLEKKL
GLQE

Back to the top

Output switches

The operation of the server is controlled by three pairs of alternative switches. The first one controls the format: you can have "short" format output or a more detailed "long" one with graphical representation of the DAS profile. The second switch controls the scale of the plot in case of the "long" option: "free" scale adjusted according to the maximal value of the actual DAS profile; "fixed" is cut at y = 5.0 value - peaks above this limit are off the scale. The third pair effects the evaluation of the query.

The "Library size" switch defines the number of sequences inculded as internal references. The setting of this switch has minimal effect on the quality of the result while it scales the CPU time linearly. The switch kept for compatibility reason mainly - alternation of the default value is not recommended.

The list of predicted non-TM-protein codes echoed followed by the list of TM-protein codes (if any). The horisontal bar marks the start of the list of results. For each entry the header line appears first with the user defined comment in it. It is followed by the number of detected TM-helix segments and the Q-score of the entry. Then the peaks of the DAS-curve above the empirical cutoff limit are listed - one peak per line. The "@" followed by the position of the peak in the sequence, the value of the DAS-curve at this point, the core segment (i.e. the portion of the curve above the cutoff) and the "E-value" of the peak. (The E-value is the propability that the peak is a false positive hit.) The lines may contain warnings too. The entries in the query are separated by horisontal bars.

The short output for the sample query above looks like this:


*** List of predicted non-TM-protein codes ***

> Non-TM protein with a fasle positive peak ID: glob0001
> One more globular ID: glob0002
> The last one, globular ID: glob0003

*** List of predicted TM-protein codes ***

> My first protein ID: prot0001
> Here goes the second one ID: prot0002


> My first protein ID: prot0001
# TMH:  3 Q: trusted
@  142   3.014 core:  138 ..  146 1.097e-02
@  200   3.148 core:  194 ..  205 6.850e-03
@  232   3.317 core:  225 ..  238 3.773e-03


> Here goes the second one ID: prot0002
# TMH: 12 Q: trusted
@   37   3.932 core:   31 ..   41 6.741e-04
@   76   4.348 core:   67 ..   83 1.551e-04
@  104   5.351 core:   95 ..  114 4.504e-06
@  163   2.886 core:  158 ..  166 2.705e-02
@  195   3.842 core:  188 ..  203 9.274e-04
@  227   5.103 core:  219 ..  236 1.082e-05
@  286   4.880 core:  279 ..  294 2.374e-05
@  334   3.146 core:  330 ..  337 1.080e-02
@  364   4.052 core:  349 ..  394 4.405e-04 Twin peaks - two TMH with a short linker
@  389   3.344 core:  349 ..  394 5.376e-03
@  454   4.140 core:  449 ..  482 3.234e-04 Twin peaks - two TMH with a short linker
@  474   6.781 core:  449 ..  482 2.893e-08


> Non-TM protein with a fasle positive peak ID: glob0001
# TMH:  1 Q:  0.53 !!! Warning! Non-TM protein!
@  143   3.014 core:  139 ..  147 1.009e-02


> One more globular ID: glob0002
# TMH:  0 Q: trusted !!! Warning! Non-TM protein!


> The last one, globular ID: glob0003
# TMH:  0 Q: trusted !!! Warning! Non-TM protein!

The long output format contains all the information of the short format plus the graphic representation of the DAS-curve (PNG). The empirical cutoff value - 2.5 - is marked by red dots.

Back to the top

Evaluation

The evaluation of the query is controlled by the second alternative pair of switches. A quality score for an entry of the query against a library of known TM-proteins is used to judge whether the entry is a TM-protein or not. In "trusted" mode of operation the server will not compute this quality score for clear cases but only for questionable ones. In the current implementation this applies only for cases when only one TM-helix segment detected in the query. Then upon the value of the score the server will decide the type of the entry in question.

In the "unconditional" mode of operation the server is forced to calculate the quality score even for the trivial cases. In the output "Q: trusted" will appear when the score is actually not computed, while the value of the score - a real number between 0 and 1 (higher the better) - will replace the "trusted" string when it is evaluated.

Here is the output of the sample query with "unconditional" evaluation:

> My first protein ID: prot0001
# TMH:  3 Q:  0.93
@  142   2.796 core:  139 ..  145 2.367e-02
@  200   2.906 core:  196 ..  203 1.606e-02
@  232   3.085 core:  226 ..  237 8.544e-03



> Here goes the second one ID: prot0002
# TMH: 12 Q:  0.99
@   37   3.805 core:   31 ..   41 1.056e-03
@   76   4.211 core:   68 ..   82 2.515e-04
@  104   5.186 core:   95 ..  114 8.047e-06
@  163   2.787 core:  159 ..  166 3.829e-02
@  195   3.719 core:  189 ..  203 1.430e-03
@  227   4.946 core:  219 ..  236 1.883e-05
@  286   4.730 core:  279 ..  294 4.031e-05
@  334   3.043 core:  330 ..  337 1.551e-02
@  364   3.922 core:  349 ..  394 6.973e-04 Twin peaks - two TMH with a short linker
@  389   3.237 core:  349 ..  394 7.836e-03
@  454   4.013 core:  449 ..  482 5.070e-04 Twin peaks - two TMH with a short linker
@  474   6.581 core:  449 ..  482 5.854e-08



> Non-TM protein with a fasle positive peak ID: glob0001
# TMH:  1 Q:  0.61 !!! Warning! Non-TM protein!
@  143   3.050 core:  139 ..  147 8.898e-03



> One more globular ID: glob0002
# TMH:  0 Q:  0.00 !!! Warning! Non-TM protein!



> The last one, globular ID: glob0003
# TMH:  0 Q:  0.00 !!! Warning! Non-TM protein!

Back to the top

Limitations

The server will deal with proper fasta-format queries only. Each sequence must be preceded by one and only one header line. The first character of the header must be the ">" sign. The length of the sequence lines are not relevant. Invalid queries will result an error message and not processed at all.
The number of entries in a query is limited to 50. This is to prevent naive or careless users submitting huge number of sequences and overloading the server with a single click of the mouse. For such queries only the first 50 sequences will be processed and the rest of the sequences are skipped without any further notice. The list of the header lines of the processed sequences is echoed at the top of the result page.
If you need to process large number of sequences regularly you may consider the local installation of the DAS-TMfilter code on your system.
The server will not process sequences shorter than 30 residue but giving a warning message on the output. Other valid sequences in the query will be processed.
The server will not take sequences longer than 5000 residues. For these queries only the first 5000 residue will be processed the rest of the sequence will be ignored and a warning echoed on the output.

The DAS-TMfilter server can process aproximately 5000 sequences per hour. (Several users may use the server at the same time so do not expect that the full power is yours.)

Back to the top

Warning messages

Here is the list of warnings and error messages, what they mean and how to interpret them.

```
"!!! Warning! Too short sequence. Skipped."
```
The sequence is shorter than 30 residues, not processed.
```
"!!! Sequence is too long! Only the first 5000 AA. loaded!"
```
This is the warning for too long sequences. The fragment after the 5000 th residues is not processed. Longer sequences should be submitted in less than 5000 AA bits. Leave at least 20 residue overlap between the consecutive fragments. Note: the residue numbers in the second bit will start at 1.
```
"!!! Warning! Potential signal peptide."
```
This is a reminder that DAS-TMfilter can not make distinction between permanent TM-helices and temporary ones (signal peptides). You get this message if there is a peak at the first 20 residues of a sequence.
```
 "!!! Warning! Non-TM protein!" 
```
This message generated when there is a peak detected in the DAS-curve but the quality score of the query sequence against the TM-library is too low or no peak detected in the curve at all. The position of the peak is listed in the output but it is most likely a false positive one. This message generated for "trusted" and "unconditional" mode of operation as well.
```
"Twin peaks - two TMH with a short linker"
```
Two peaks are detected over a long fragment above the cutoff of 2.5.
```
"!!! Warning! Weak twin peak. Ignored. @ xxxx ..."
```
Two peaks are detected over a section of the DAS-curve above the 2.5 cutoff but one of the peaks is likely to be a shoulder of the other one than a real peak. The position of the false peak listed in the output after that warning.
If there is a "weak twin" in the DAS-curve the number of detected TM-helices is corrected at the end of the list of peaks with the following message:
```
"!!! Warning! Peak assignment overruled. Corrected no. of TMH: zzz"
```

Back to the top

Practical suggestions

Follow these guidelines for the efficient use of the DAS-TMfilter server:

Pay attention to the proper format of the query. Do not submit peptides shorter than 30 residues.
Use the default set of switches first (i.e. short output and trusted evaluation). That will give you reasonable size of output and fast results. If there are no other warnings in the output but the "Non-TM protein", "Too short sequence" or "Potential signal peptide" there is no reason to ask for a more detailed run.
Repeated submission of a sequence may give a marginaly different DAS-curve. This is due to the pseudo-random number generator used in the internal reference procedure. When a peak alters its position relative to the empirical cutoff (2.5) submission by submission then the prediction for that peak is not reliable - the method reached its limitations.
Warnings generated when the internal decision-making mechanism thinks they are relevant. However, the warnings will not effect the shape of the DAS-curve and all the detected peaks are listed in the output. Repeated submission of a problematic sequence with different switch setting will not eliminate the warnings. It is your ultimate responsibility to accept or reject them. Asking for long output may assist that decision in problematic cases. You may want to check the results of a test run on the learning set of 128 well characterised TM-proteins to judge the weaknesses and the power of the method.
The core TMH segments (i.e. the sections of the DAS-curve above the 2.5 cutoff) are not identical to points where the TM-helices enter or leave the membrane. The experimental database is not accurate enough for this kind of predictions.
If you are after TMH end-points ask for long output format and extend the core around the peaks according to the shape of the DAS-curve. Keep in mind: there is no guarantee that the actual end-points are correct!
If you have experimental evidence of the location of a TMH in a particular sequence and the DAS server misses it I would love to hear about this! Also contact me with any suggestions, comments or remarks about the operation of this server.

Back to the top

Miklos Cserzo, cserzo.miklos@ext.semmelweis.hu (Nov 7 2001; May 18 2023)