TSTMP: Target Selection for human TransMembrane Proteins

Description of the database

Figure 1:
Definition of 'modelable' proteins

TSTMP is a database designed to help the target selection of transmembrane proteins for structural genomics project. Transmembrane proteins having two or more transmembrane helices were extracted from the HTP [1] database. We categorized a protein as '3D' if a transmembrane chain in a cross-referring PDB entry had overlapped all transmembrane regions (DBREF lines of PDB entries were used to examine the fulfillment of the former criteria). Other proteins marked as '3D' in HTP immediately became classified as 'modelable' in TSTMP if all TMS of at least one assigned PDBTM structure and the protein overlapped. All proteins were searched using two methods: TMFoldRec [2] and HHBlits [3] against a non-redundant PDBTM [4, 5] database containing alpha helical transmembrane PDB chains. TMFoldRec checks the topography of the query and target sequences, and hits were accepted if their reliability exceeded 0.5. For HHBlits we used E-value threshold of 10-10, sequence identity of 25% and since it does not check topography internally, we accepted results only if membrane segments of the query and hit proteins overlapped and no transmembrane segment was left out from the alignment. In our definition, a protein could be modeled by an existing structure if both methods returned at least one valid hit. Figure 1 is a summary figure of the process of determining homologues.

Proteins were searched against HTP database with HHBlits. The acceptance criteria were the same as above and we determined the number of homologues for each protein. Basically, this method gave us a graph where two nodes (proteins) are connected if they are homologous with each other and clusters were determined as connected components of this graph. ‘The Most Wanted’ targets were chosen on the basis of the number of their (target) homologues.

To facilitate the monitoring of the progress of crystallization for each proteins, we also incorporated homologous entries from TargetTrack [6], if any. A Blast search was performed against the trial sequences of TargetTrack entries and hits were accepted with an E-value less than 10-10, identity with 95% or higher and if the alignment covered all of the transmembrane segments of the query protein.

Figure 2:
Flowchart of the used pipeline.

The whole database can be automatically built from new releases of database with the developed pipeline (Figure 2).

XML files of all entries are available for download from the web server. Every XML file contains cross-references to UniProt [7] and HTP databases. Cross-references to TargetTrack were also included when available, together with the identity and overlap of Blast search results and the status of the trial sequences.

In addition, we defined three types of evidence for the entries: proteins that could be matched with at least one PDB entry were categorized as '3D'. Proteins that could be assigned to a PDBTM structure by TMFoldRec and HHBlits received evidence type ‘modelable’. The rest of the entries had evidence type ‘target’. For all sequences, the XML files also contain all of their homologues in the human transmembrane proteome if any, and the identifier of their cluster.

  1. Dobson L, Reményi I and Tusnády GE (2015) The human transmembrane proteome. Biol Direct. 10:31.
  2. Kozma D and Tusnády GE (2015) TMFoldRec: a statistical potential-based transmembrane protein fold recognition tool. BMC Bioinformatics 16, 201.
  3. Remmert M, Biegert A, Hauser A and Söding J. (2011) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods 9, 173–175.
  4. Kozma D, Simon I and Tusnády GE (2013) PDBTM: Protein Data Bank of transmembrane proteins after 8 years. Nucleic Acids Res. 41(Database issue), D524-9.
  5. Tusnády GE, Dosztányi Zs and Simon I (2004) Transmembrane proteins in protein data bank: identification and classification. Bioinformatics 20, 2964-2972.
  6. Gabanyi MJ et al. (2011) The Structural Biology Knowledgebase: a portal to protein structures, sequences, functions, and methods. J. Struct. Funct. Genomics 12, 45–54.
  7. The UniProt Consortium (2015) UniProt: a hub for protein information. Nucleic Acids Res. 43 Database issu, D204–D212.

Evidence levels

  3D
  Modelable
  Target

Target Track statuses

 selected, cloned or expressed
 solubilized or purified
 crystallized or HSQC satisfactory
 XRAY, NMR or ERAY data collected
 model fitted
 in structure database