Home Features Download Contact Us

Features

1) Introduction

  • Domains are the building blocks of proteins and one of the most useful characteristics for determining protein function. The functions of the individual domains of a multidomain protein contribute to our understanding of the properties of the protein as a whole. The sequential order of protein domains is known as the domain architecture. Architectures are useful for classifying evolutionarily related proteins, detecting evolutionarily distant homologs, and comparing multidomain proteins

  • DAhunter is a new web-based server that identifies homologous proteins by comparing the sequence of domains (domain architecture). DAhunter considers promiscuous domains (domains that typically carry out auxiliary functions and appear in many unrelated proteins), which are not directly related to homology.

  • To detect promiscuous domains, we assigned a weight score to each domain extracted from RefSeq proteins that was based on its abundance and versatility. We used a domain¡¯s scores to represent its importance in protein world. In scoring domains, we considered domain combinations as well as single domains. We use (1) the cosine similarity, (2) the Goodman-Kruskal gamma function, and (3) domain duplication index to measure the similarity of a pair of domain architectures.

2) Datasets

3) How to assign a weight score of each domain unit

  • To measure abundance and versatility of each domain unit, we use IAF (Inverse Abundance Frequency) and IVF (Inverse Versatility Frequency) of a domain unit. The basic idea of the IAF and IVF is derived from IDF (Inverse Document Frequency).

                 Domain unit weight score = IAF*IVF

4) How to compare two domain architectures
  • DAhunter search for the best matched domain architecture from the domain architecture database, which is from RefSeq proteins, UniProtKB/Swiss-Prot, and UniProtKB/TrEMBL

  • DAhunter compare three features of domain architectures.

          - domain unit content (x): the Cosine similarity.

          - domain order (y): the Goodman-Kruskal gamma function.

          - domain unit copy (z): the domain duplication index, whose definition is similar to that of the IDF.

  • Similarity score of two domain architectures (S) = x + a*y + b*z  (a=0.8, b=0.3)

  • A domain architecture with maximum score is the best similar domain architecture.

5) How to fix the parameters and evaluate DAhunter
  • Determining the parameters a and b

- To fix parameters a and b of the similarity score, we used Homologene DB release 61 containing 44,481 groups. (1) From these groups, we obtained 8,290 domain architectures from 5,215 groups having more than 2 architectures. (2) We carried out 8,290 tests. In each test, one of 8,290 domain architectures was compared to the other 8,289 by allowing a and b to vary from 1.0 to 0.0 in steps of 0.1. (3) We chose 0.8 for a and 0.3 for b because these values produce the maximum number of the best-matched combinations with the same group. To obtain the test results of DAhunter with Homologene DB for each a and b value, click here. We also tested with the COG database in a similar manner. The user can downloaded the test results from here.

  • Evaluation

- To evaluate the DAhunter algorithm, we compared the DAhunter results (a=0.8, b=0.3) with the PDART results (a=0.36, b=0.01, c=0.63).

 

52 Eoeun-dong,Yuseong-gu,Daejeon, 305-333, Korean Bioinformation Center (KOBIC)
TEL: 82-42-879-8511 FAX: 82-42-879-8535 bulee@kribb.re.kr