Database establishment
Homologous families and superfamilies were named according to the Cytochrome P450 Homepage [6] and filled with consistently named CYP sequences from the first version of the CYPED. Thus, seed sequences for almost 400 superfamilies could be identified. Positions 1-499 were annotated as P450-domain to avoid loading reductases into the CYPED while updating fusion enzymes. For each seed sequence a BLAST search [8] was performed in the non-redundant sequence database at NCBI http://www.ncbi.nlm.nih.gov with an E-value of 10-100. For each hit, information on sequence, position specific annotations, functional descriptions, and the source organism was extracted and loaded by an automated retrieval system into an in-house developed relational database system [7]. In 28% of the entries the correct CYP name according to Nelson's classification [9] was provided in the NCBI database entry. In 1% the name was in contrast to sequence similarity, and therefore the protein was re-assigned. 1% of the proteins had a name which does not exist according to Nelson's classification, and therefore were assigned to the most similar existing family. Entries which were lacking information on the CYP name were assigned to a family by sequence similarity. Thus 64% of the proteins could be assigned which have not been classified yet. All sequences which were assigned only based on sequence similarity were labelled by "homologous protein of family X (BY SIMILARITY)". 218 proteins without CYP name information and no sequence similarity to existing families, as well as 279 protein fragments were discarded. Following this procedure the entries of the CYPED are consistent with the recommendations of the nomenclature committee.
Sequence entries that originate from the same organism and share a sequence identity of at least 98% are assigned to a single protein entry. For proteins with multiple sequence entries, the longest sequence was defined as reference sequence of the respective protein. Protein structures were downloaded from the Protein Data Bank (PDB) [17] and stored as structural monomers. Secondary structure information was calculated using DSSP [18]. Information on structurally or functionally relevant residues was extracted from the GenBank and annotated in the CYPED.
New features and functionalities
The current version of the CYPED also provides a feature page for each protein entry where the sequence is displayed and annotations are highlighted. A newly developed dynamic web interface directly incorporates changes in the database.
The integration of protein databases is based on a common, unique key. A database-independent attribute of a protein which can be applied like a primary key is the protein sequence itself. While a primary key has to be specific, sequences can slightly vary although they might belong to the same protein entry. To overcome this problem, an algorithm was implemented (figure 1) which allows the direct use of the sequences as primary keys without the requirement of being completely identical which was termed a metric primary key. For each CYPED entry, a BLAST search against the CPK database was performed. The BLAST hits were ranked by E-value and a global pairwise alignment was performed [19]. The CPK entries with a sequence identity of more than 90% are displayed on the protein feature page, linking to the corresponding CYPED sequence to the respective entries in the CPK. Thus, the sequences can be applied as common attribute of protein entries and serve as primary keys.
For all sequence entries the conserved secondary structures were predicted by a structure-based HMM-profile which was embedded in an automated annotation program, stored as annotations in the DWARF-system and are displayed on the protein feature pages and within the multisequence alignments.
Information on human CYP alleles was extracted from the "Home Page of the Human Cytochrome P450 (CYP) Allele Nomenclature Committee" [15] and stored in tables designated for this purpose in the database. The mutations and their effect, whether the enzymes lack of activity or gained increased activity, are listed on the protein feature page.
Contents
The CYPED contains 11193 sequence entries for 8613 protein entries. The proteins have been assigned to 249 superfamilies and 619 homologous families. Structure information for 47 different proteins which originate from 36 different homologous families was extracted from 228 PDB entries.
In total, 3575 CYPED proteins matched the respective CPK entries with a sequence identity of more than 90%. These matches provided the links to 3257 different compounds (1699 substrates, 723 inducers and 1227 inhibitors). This information has been extracted from more than 10000 research papers cited in PubMed [14].
For each family, a multisequence alignment and a phylogenetic tree were generated by CLUSTALW [20]. The annotated version is colour-coded and highlights functionally relevant sites and the predicted secondary structure. For each alignment, the degree of conservation of each column is indicated on the bottom as a coloured chart as calculated by PLOTCON [21]. For each homologous family and superfamily, family specific HMM profiles http://hmmer.janelia.org/ are supplied.