Read this page when you are using the DAS-TMfilter prediction server for the first time or when the server have been updated since your last visit. (See the time-stamp on the main page.) Read about the theory behind the server here.
Sections on this page:
The DAS-TMfilter prediction tool takes fasta-format protein queries as input. Each entry starts with one header line with the ">" character at the beginning of the line followed by an arbitrary comment until the end of the header line. This text is for the identification of your query. The header is followed by the protein sequence itself using single letter all upper case code for the residues. You can break down your sequence into as many lines as you like. The sequence should not contain any other character but the residue codes. (No numbering, no spaces etc..) The end of an entry is marked by the header of the next one or by the last line.
This is a sample query with five entries (the output of this query used for demonstration of various features of the server later in this documentation):
> My first protein ID: prot0001 MSSNAQVKTPLPPAPAPKKESNFLIDFLMGGVSAAVAKTAASPIERVKLLIQNQDEMLKQ GTLDRKYAGILDCFKRTATQEGVISFWRGNTANVIRYFPTQALNFAFKDKIKAMFGFKKE EGYAKWFAGNLASGGAAGALSLLFVYSLDYARTRLAADSKSSKKGGARQFNGLIDVYKKT LKSDGVAGLYRGFLPSVVGIVVYRGLYFGMYDSLKPLLLTGSLEGSFLASFLLGWVVTTG ASTCSYPLDTVRRRMMMTSGQAVKYDGAFDCLRKIVAAEGVGSLFKGCGANILRGVAGAG VISMYDQLQMILFGKKFK > Here goes the second one ID: prot0002 MSTSKSENYLSELRKIIWPIEQYENKKFLPLAFMMFCILLNYSTLRSIKDGFVVTDIGTE SISFLKTYIVLPSAVIAMIIYVKLCDILKQENVFYVITSFFLGYFALFAFVLYPYPDLVH PDHKTIESLSLAYPNFKWFIKIVGKWSFASFYTIAELWGTMMLSLLFWQFANQITKIAEA KRFYSMFGLLANLALPVTSVVIGYFLHEKTQIVAEHLKFVPLFVIMITSSFLIILTYRWM NKNVLTDPRLYDPALVKEKKTKAKLSFIESLKMIFTSKYVGYIALLIIAYGVSVNLVEGV WKSKVKELYPTKEAYTIYMGQFQFYQGWVAIAFMLIGSNILRKVSWLTAAMITPLMMFIT GAAFFSFIFFDSVIAMNLTGILASSPLTLAVMIGMIQNVLSKGVKYSLFDATKNMAYIPL DKDLRVKGQAAVEVIGGRLGKSGGAIIQSTFFILFPVFGFIEATPYFASIFFIIVILWIF AVKGLNKEYQVLVNKNEK > Non-TM protein with a fasle positive peak ID: glob0001 MKDNTVPLKLIALLANGEFHSGEQLGETLGMSRAAINKHIQTLRDWGVDVFTVPGKGYSL PEPIQLLNAKQILGQLDGGSVAVLPVIDSTNQYLLDRIGELKSGDACIAEYQQAGSPFGA NLYLSMFWRLEQPAAAIGLSLVIGIVMAEVLRKLGADKVRVKWPNDLYLQDRKLAGILVE LTGAAQIVIGAGINMAMWITLQEAGINLDRNTLAAMLIRELRAALELFEQEGLAPYLSRW EKLDNFINRPVKLIIGDKEIFGISRGIDKQGALLLEQDGIIKPWMGGEISLR > One more globular ID: glob0002 SVGTSCIPGMAIPHNPLDSCRWYVSTRTCGVGPRLATQEMKARCCRQLEAIPAYCRCEAV RILMDGVVTSSGQHEGRLLQDLPGCPRQVQRAFAPKLVTEVECNLATIHGGPFCLSLLGA GE > The last one, globular ID: glob0003 MLDKIVIANRGEIALRILRACKELGIKTVAVHSSADRDLKHVLLADETVCIGPAPSVKSY LNIPAIISAAEITGAVAIHPGYGFLSENANFAEQVERSGFIFIGPKAETIRLMGDKVSAI AAMKKAGVPCVPGSDGDDMDKNRAIAKRIGYPVIIKRVVRGDAELAQSISMTRAYMEKYL ENPRHVEIQVLADGQGNAIYLAERDCSMQRRHQKVVEEAPAPGITPELRRYIGERCAKAC VDIGYRGAGTFEFLFENGEFYFIEMNTRIQVEHPVTEMITGVDLIKEQLRIAAGQPLSIK QEEVHVRGHAVECRINAEDPNTFLPSPGKITRFHAPGGFGVRWESHIYAGYTVPPYYDSM IGKLICYGENRDVAIARMKNALQELIIDGIKTNVDLQIRIMNDENFQHGGTNIHYLEKKL GLQEBack to the top
The operation of the server is controlled by three pairs of alternative switches. The first one controls the format: you can have "short" format output or a more detailed "long" one with graphical representation of the DAS profile. The second switch controls the scale of the plot in case of the "long" option: "free" scale adjusted according to the maximal value of the actual DAS profile; "fixed" is cut at y = 5.0 value - peaks above this limit are off the scale. The third pair effects the evaluation of the query.
The "Library size" switch defines the number of sequences inculded as internal references. The setting of this switch has minimal effect on the quality of the result while it scales the CPU time linearly. The switch kept for compatibility reason mainly - alternation of the default value is not recommended.
The list of predicted non-TM-protein codes echoed followed by the list of TM-protein codes (if any). The horisontal bar marks the start of the list of results. For each entry the header line appears first with the user defined comment in it. It is followed by the number of detected TM-helix segments and the Q-score of the entry. Then the peaks of the DAS-curve above the empirical cutoff limit are listed - one peak per line. The "@" followed by the position of the peak in the sequence, the value of the DAS-curve at this point, the core segment (i.e. the portion of the curve above the cutoff) and the "E-value" of the peak. (The E-value is the propability that the peak is a false positive hit.) The lines may contain warnings too. The entries in the query are separated by horisontal bars.
The short output for the sample query above looks like this:
*** List of predicted non-TM-protein codes *** > Non-TM protein with a fasle positive peak ID: glob0001 > One more globular ID: glob0002 > The last one, globular ID: glob0003 *** List of predicted TM-protein codes *** > My first protein ID: prot0001 > Here goes the second one ID: prot0002
> My first protein ID: prot0001 # TMH: 3 Q: trusted @ 142 3.014 core: 138 .. 146 1.097e-02 @ 200 3.148 core: 194 .. 205 6.850e-03 @ 232 3.317 core: 225 .. 238 3.773e-03
> Here goes the second one ID: prot0002 # TMH: 12 Q: trusted @ 37 3.932 core: 31 .. 41 6.741e-04 @ 76 4.348 core: 67 .. 83 1.551e-04 @ 104 5.351 core: 95 .. 114 4.504e-06 @ 163 2.886 core: 158 .. 166 2.705e-02 @ 195 3.842 core: 188 .. 203 9.274e-04 @ 227 5.103 core: 219 .. 236 1.082e-05 @ 286 4.880 core: 279 .. 294 2.374e-05 @ 334 3.146 core: 330 .. 337 1.080e-02 @ 364 4.052 core: 349 .. 394 4.405e-04 Twin peaks - two TMH with a short linker @ 389 3.344 core: 349 .. 394 5.376e-03 @ 454 4.140 core: 449 .. 482 3.234e-04 Twin peaks - two TMH with a short linker @ 474 6.781 core: 449 .. 482 2.893e-08
> Non-TM protein with a fasle positive peak ID: glob0001 # TMH: 1 Q: 0.53 !!! Warning! Non-TM protein! @ 143 3.014 core: 139 .. 147 1.009e-02
> One more globular ID: glob0002 # TMH: 0 Q: trusted !!! Warning! Non-TM protein!
> The last one, globular ID: glob0003 # TMH: 0 Q: trusted !!! Warning! Non-TM protein!
The long output format contains all the information of the short format plus the graphic representation of the DAS-curve (PNG). The empirical cutoff value - 2.5 - is marked by red dots.
Back to the topThe evaluation of the query is controlled by the second alternative pair of switches. A quality score for an entry of the query against a library of known TM-proteins is used to judge whether the entry is a TM-protein or not. In "trusted" mode of operation the server will not compute this quality score for clear cases but only for questionable ones. In the current implementation this applies only for cases when only one TM-helix segment detected in the query. Then upon the value of the score the server will decide the type of the entry in question.
In the "unconditional" mode of operation the server is forced to calculate the quality score even for the trivial cases. In the output "Q: trusted" will appear when the score is actually not computed, while the value of the score - a real number between 0 and 1 (higher the better) - will replace the "trusted" string when it is evaluated.
Here is the output of the sample query with "unconditional" evaluation:
> My first protein ID: prot0001 # TMH: 3 Q: 0.93 @ 142 2.796 core: 139 .. 145 2.367e-02 @ 200 2.906 core: 196 .. 203 1.606e-02 @ 232 3.085 core: 226 .. 237 8.544e-03Back to the top
> Here goes the second one ID: prot0002 # TMH: 12 Q: 0.99 @ 37 3.805 core: 31 .. 41 1.056e-03 @ 76 4.211 core: 68 .. 82 2.515e-04 @ 104 5.186 core: 95 .. 114 8.047e-06 @ 163 2.787 core: 159 .. 166 3.829e-02 @ 195 3.719 core: 189 .. 203 1.430e-03 @ 227 4.946 core: 219 .. 236 1.883e-05 @ 286 4.730 core: 279 .. 294 4.031e-05 @ 334 3.043 core: 330 .. 337 1.551e-02 @ 364 3.922 core: 349 .. 394 6.973e-04 Twin peaks - two TMH with a short linker @ 389 3.237 core: 349 .. 394 7.836e-03 @ 454 4.013 core: 449 .. 482 5.070e-04 Twin peaks - two TMH with a short linker @ 474 6.581 core: 449 .. 482 5.854e-08
> Non-TM protein with a fasle positive peak ID: glob0001 # TMH: 1 Q: 0.61 !!! Warning! Non-TM protein! @ 143 3.050 core: 139 .. 147 8.898e-03
> One more globular ID: glob0002 # TMH: 0 Q: 0.00 !!! Warning! Non-TM protein!
> The last one, globular ID: glob0003 # TMH: 0 Q: 0.00 !!! Warning! Non-TM protein!
If you need to process large number of sequences regularly you may consider the local installation of the DAS-TMfilter code on your system.
The DAS-TMfilter server can process aproximately 5000 sequences per hour. (Several users may use the server at the same time so do not expect that the full power is yours.)
Here is the list of warnings and error messages, what they mean and how to interpret them.
"!!! Warning! Too short sequence. Skipped."The sequence is shorter than 30 residues, not processed.
"!!! Sequence is too long! Only the first 5000 AA. loaded!"This is the warning for too long sequences. The fragment after the 5000 th residues is not processed. Longer sequences should be submitted in less than 5000 AA bits. Leave at least 20 residue overlap between the consecutive fragments. Note: the residue numbers in the second bit will start at 1.
"!!! Warning! Potential signal peptide."This is a reminder that DAS-TMfilter can not make distinction between permanent TM-helices and temporary ones (signal peptides). You get this message if there is a peak at the first 20 residues of a sequence.
"!!! Warning! Non-TM protein!"This message generated when there is a peak detected in the DAS-curve but the quality score of the query sequence against the TM-library is too low or no peak detected in the curve at all. The position of the peak is listed in the output but it is most likely a false positive one. This message generated for "trusted" and "unconditional" mode of operation as well.
"Twin peaks - two TMH with a short linker"Two peaks are detected over a long fragment above the cutoff of 2.5.
"!!! Warning! Weak twin peak. Ignored. @ xxxx ..."Two peaks are detected over a section of the DAS-curve above the 2.5 cutoff but one of the peaks is likely to be a shoulder of the other one than a real peak. The position of the false peak listed in the output after that warning.
If there is a "weak twin" in the DAS-curve the number of detected
TM-helices is corrected at the end of the list of peaks with the following
message:
"!!! Warning! Peak assignment overruled. Corrected no. of TMH: zzz"
Follow these guidelines for the efficient use of the DAS-TMfilter server:
If you are after TMH end-points ask for long output format and extend the core around the peaks according to the shape of the DAS-curve. Keep in mind: there is no guarantee that the actual end-points are correct!