Protein
  Analysis
  Toolkit

Help If you use this server for research projects, please cite:
Gracy J. and Chiche L. (2005) PAT: a protein analysis toolkit for integrated biocomputing on the web. Nucleic Acids Research. Web server issue, 33, W65-71.

PAT help



1. Quick start

  • All server functionalities are accessible by clicking on the five buttons located in the top bar.
  • Click the Input button to choose the first tool you want to launch and its associated input data.
    From there you will be able to create pipelines and complex queries (macros).
  • Click the Output button to get the results of your input queries.
  • Click the Macro button to retrieve standard or previously defined macros
  • Click the Help button to retrieve this help file.
  • Click the Mail button to send bugs and suggestions to the author.


 

2. Overview

PAT (for Protein Analysis Toolkit) is an integrated bio-computing server. The main goal of its design was to facilitate the combination of different processing tools for more complex protein analyses. To this end, the PAT implementation has the following characteristics:

  • PAT is able to retrieve protein entries from many databases using specific identifier or accession number indexes.
  • PAT is able to launch many processing tools dedicated to 1D, 2D and 3D protein analysis using specific wrappers.
  • PAT is able to read and write biological data in many bioinformatic formats using specific parsers and dumpers.
  • PAT is able to redirect the output of one tool to the input of another tool by a seamless data format translation using appropriate (parser, dumper) pairs.
  • PAT is able to perform complex analyses by the combination of different processing tools via checkboxes and drop-down menus, or using a dedicated macro language.

A typical PAT query is composed of two main steps:

  1. Click the Input button to choose an analysis tool and their associated input data and options. Depending on the tool, checkboxes can be available to select similar tools for parallel processing. A redirection menu lets you send the output to other data-compatible tools or formatting options. Several redirections can be successively selected, this way. Processing of the corresponding script (i.e. user-defined macro) is then launched by clicking the Run button at the bottom of the form. The Macro button provides direct access to previous user-defined macros as well as few standard macros.
  2. Click the Output button to retrieve the results from a table where the history of the successive analyses are stored.



3. Input

  • Analysis tools

    All available protein analysis tools can be accessed by clicking the top Input button.
    (Table 1) lists all available tools including litterature references or related URLs.
    Tools are currently covering the following topics:
    • Primary sequence analysis
    • Sequence similarity search
    • Sequence alignment
    • Sequence motif search
    • Phylogeny
    • Secondary structure prediction
    • Non globular structure prediction
    • Solvent accessibility prediction
    • Tertiary structure analysis
    • Tertiary structure display
    • Tertiary structure superposition
    • Tertiary structure modeling
    • Tertiary structure evaluation
    • Cellular localization
    • Output format

    Each tool form can be accessed by clicking its name in this tool table. The tool form is composed of three parts. A top Tool header gives a short description of the tool and internet links to the native web site and help file. The Input form lets you type paste upload and edit the protein query. The Option form is for selection of the main options of the current analysis tool. It is complemented by a textual input area for specifying additional options using the syntax "-option1 value1 -option2 value2". The optional Parallel processing form lets you choose additional similar tools to launch at the same time on the current input. The output of this processing can be send to further calcultations or formatting by using the Output redirection menu. In the latter case, the Input page will be recreated with additional forms for option selections or further redirection.

    When no more redirection is desired, the whole processing can be launched by clicking the Run button. The analysis is then started using an independant process launched on the server and one or several processing lines are added on top of the Output table. Don't forget to click the "Output" button in the top bar to refresh the output table.

     

  • Input data fields

    The Input form lets you specify the protein data that should be analysed. It includes one multi-line text field where you can type or paste your input according to the syntax described below and one file field from which you can choose and upload one of your local files as input data. Filling at least one of these two fields is mandatory.
  • Text field

    The first field of the input form is a text field where you can type protein identifiers or accession numbers, each one on a different line, or some text following one the formats (FASTA, MSF, PDB, ...) described in the format section or old output identifiers. You can mix data of different formats in the text field but you should separate them by lines consisting the single word '#!PAT'. This text field is not case sensitive.
  • Protein identifiers or accession numbers

    In the text field, you can type protein or family identifiers (eg, EGF_HUMAN or PF00133) or accession numbers (eg, P01133 or tRNA-synt_1) from the following databases: SWISSPROT, TREMBL, PDB, SCOP, PFAM or DOMO. PAT has indexes for automated retrieval of entries from each of the above databases and appropriate parsers for extracting the sequences and other fields from these entries. Protein chains from the PDB can be specified by concatenating both protein and chain identifiers (e.g., 1reiA). If you want to restrict the processing to a protein segment, you can specify its first and last positions by appending "/begin-end" to the protein name (eg EGF_HUMAN/15-42 or 1reiA/35-65).
  • Formatted data

    If your protein data does not belong to the indexed databases, you can type or paste your data (i.e. your sequence, your alignment or your PDB file) directly in the text field using one of the formats readable by PAT. PAT can parse input data in any format listed in the "Output format" category of the tool table (click the top Input button to see the list of readable formats). PAT can also parse the output of most available tools (e.g. PSIBLAST, HMM, PREDATOR, STRIDE, etc).
  • Pipelines

    A very effective feature of PAT is its ability to automatically redirect outputs, i.e. to reuse results previously computed by PAT as input data of other tools. This can be achieved in several ways:
    • using the "Output redirection" menu in the tool form
    • using the menus in the "Redirection" column of the output table
    • using the "Use any previous input or output" menu in a tool input area (if any)
    • or by directly typing previous input or output labels (i.e. Ix or Ox, where x is the input/output number) in the text input area of a tool.
  • Uploading a file

    Instead of directly typing or pasting data in the text field, you can upload a file from your local computer using the file field below the text field. You can either type your full local file path in the input line or select the file through interactive directory browsing. The content of the uploaded file should conform to the syntax of the text field explained above.

4. Output table

The output table stores the session history of all your queries. When starting a new session, this table is empty. Each time you send a new query, one or several lines are added on top of previous outputs. The table reminds you all the protein analyses you have already done, informs you on the status of ongoing processes and lets you access at any moment to any input/output data. You can check whether a running process is completed by clicking the 'Output' button in the top bar. The output table has seven columns:

  • The first column provides checkboxes to select/deselect lines for deletion or synthetic reports.
  • The "Input" column lists every input data you have provided to the PAT server. Each input is described by an identifier ("I" concatenated with a number). Clicking on the input identifier will display from top to bottom , the full data input, some tool information, and the corresponding output.
  • The "Tool" column lists every tool processed by PAT with optional parameter values.
  • The "Output" column lists every query result. If the query is completed and the process output available, the output is described by an identifier ("O" concatenated with a number). Clicking on the output identifier will display from top to bottom, the results, some tool information, and the corresponding input.. If the tool query is not completed, the "Output" column indicates "Processing..." or "Queued..." depending on the current process status. It is then possible to check whether the query result is available by clicking again the top "Output" button. It is possible to launch multiple queries in parallel (our current web server has four processors) before the current process has ended.
  • The "Bytes" column lists the number of bytes of each result file once the process has been completed.
  • The "Time" column lists the number of CPU seconds consumed by each process.
  • The "Redirection" column provides drop-down menus to redirect outputs to data-compatible tools, if any.

When clicked, the "Result synthesis" button below the output table generates an HTML page collecting all selected analysis results. This page can then be saved on the client's disk using the Save option of the internet browser. The "Delete rows" button lets you delete the selected analysis results from the output table.


5. Macros

The redirection and parallel processing facilities of the Input or Output menus let the user build complex queries simply with mouse clicks. By doing this, the user actually creates scripts (called macros) which are based on a dedicated macro langage. The macro language is simple yet powerful. It is based on two operators : the concatenation symbol ',' and the pipelining symbol '|'. The concatenation symbol ',' lets you perform simultaneous analyses from different tools and collect them into a single global output. Each tool can be parametrized by setting some options with the syntax "tool -option1 value1 -option2 value2 ...". The pipeline symbol '|' asks for the transfer of the output of the tool(s) launched before the symbol '|' to the input of the tool(s) specified after this symbol. It should noted that this symbol has the similar meaning as in the Unix world, although in our case the transfer seamlessly involves automated data reformatting and/or index-based protein sequence or structure retrievals.

The Macro menu lets the user to directly access all previously defined macros without going through the drop-down menus again. Besides, several standard macros that correspond to typical protein analyses are also available. Any selected macro can be hand-edited by the user before execution for fine tuning.

  • Macro 1 : dsc, simpa96, predator, psipred, seg, ncoils, tmpred, signalp | consensus | color.
    This macro launches different local structure prediction tools, then adds a consensus from obtained secondary structure predictions and transmembrane segment predictions, then formats the resulting output using the coloring tool COLOR.
  • Macro 2 : wublast2 -d pdb_seq | sim2ali | mview
    This macro looks for sequence homologs in the PDB database with WUBLAST2 collecting many PDB sequences, then makes a multiple alignments from all pairwise sequence alignments found using SIM2ALI, then creates an HTML page using the MVIEW formatting tool.
  • Macro 3: clustalw | bionj | atv
    This macro aligns the input protein sequences using CLUSTALW, then builds a phylogenetic tree from this alignment using BIONJ, then displays the resulting tree using the applet viewer ATV.
  • Macro 4: ce | profit -out pdb | jmol
    This macro builds a structural alignment of two input protein structures using CE, then fits them in 3D using PROFIT, then displays the superimposed structures using the applet JMOL. It should be noted that if the input data are protein sequences, the closest homologs found in the Protein Data Bank will be used as input instead.
  • Macro 5: pdbgeo, eval23d, verify3d | color
    This macro extracts 3D features using home-made software PDBGEO, evalutates the 1D-3D compatibility using the statistical potentials from EVAL23D and VERIFY3D, and then tabulates the gathered information in COLOR format. 3D data can be automatically inferred as done in macro 4.
  • Macro 6 : wublast2 | cdhit | muscle | mview
    This macro searches query similarities using WUBLAST2, selects representative homologs with CDHIT, aligns them with MUSCLE, then displays the resulting multiple alignment using MVIEW.
  • Macro 7 : wublast2 | cdhit | seqname | dsc | selex
    This macro searches query similarities using WUBLAST2, selects representative homologs with CDHIT, retrieves the corresponding whole sequences, predict their secondary structures with DSC, and then prints all sequences and predictions using the SELEX format.

6. Restrictions

This server is still in development. Due to the variety of tasks and combinations performed by this server, we are concious that bugs remain to be corrected. For this reason, bug reports and suggestions are very welcome and can be submitted to the author by e-mail. If you detect a bug, please indicate your session number (check the web address of your last server call and look the "dir" value) and the concerned query number in Output table.

We log each server query for debugging purpose only. Each session history and associated data (input, queries, output) are stored in a temporary server workspace and are periodically deleted to prevent saturation. The typical lifetime of one session file on our server is one day but it may vary depending on the server load. It is therefore safer to copy and save your session results on your own disks.

Some tasks may require high resource consumption. To prevent the saturation of our server, we set different limits to the resources consumed by each process. The processing of your query will be automatically aborted if any one of the following conditions is met:

  • CPU time is larger than 10 minutes.
  • The process requires more than 500Mb of RAM.
  • The output files are bigger than 10Mb.

7. Data formats

PAT accepts many data formats in the input text area of each analysis tool. The major standard bioinformatic formats described below can be read by PAT. PAT is also able to read as input the output of many analysis toolswich is necessary for output redirections.


7.1 SEQNAME format:

Each line includes a single protein or family name. The corresponding protein sequences will be automatically retrieved from the databases indexed by PAT (currently SWISSPROT, TREMBL, PDB or PFAM). If one protein is missing or out of date, please e-mail me.

Example:

2ETI
 4CPAI

7.2 SEGMENT format:

Each line includes a protein name, followed by a '/' symbol, followed by the first residue sequence position, followed by a '-' symbol, followed by the last residue sequence position.

Example:

2ETI/1-15
 4CPAI/10-28

7.3 FASTA format:

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:

Reference: http://tigrblast.tigr.org/web-hmm/fasta.html

Example:

 >PRTZ_BOVIN VITAMIN K-DEPENDENT PROTEIN Z.
 AGSYLLEELFEGHLEKECWEEICVYEEAREVFEDDETTDEFWRTYMGGSPCASQPCLNNGSCQDSIRGYACTCAPGYEGP
 NCAFAESECHPLRLDGCQHFCYPGPESYTCSCARGHKLGQDRRSCLPHDRCACGTLGPECCQRPQGSQQNLLPFPWQVKL
 TNSEGKDFCGGVLIQDNFVLTTATCSLLYANISVKTRSHFRLHVRGVHVHTRFEADTGHNDVALLDLARPVRCPDAGRPV
 CTADADFADSVLLPQPGVLGGWTLRGREMVPLRLRVTHVEPAECGRALNATVTTRTSCERGAAAGAARWVAGGAVVREHR
 GAWFLTGLLGAAPPEGPGPLLLIKVPRYALWLRQVTQQPSRASPRGDRGQGRDGEPVPGDRGGRWAPTALPPGPLV

Blank lines are not allowed in the middle of FASTA input. Sequences are expected to be represented in the standard IUB/IUPAC amino acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length. Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., X for unknown amino acid residue). The accepted amino acid codes are:

       A  alanine                         P  proline
       B  aspartate or asparagine         Q  glutamine
       C  cystine                         R  arginine
       D  aspartate                       S  serine
       E  glutamate                       T  threonine
       F  phenylalanine                   
       G  glycine                         V  valine
       H  histidine                       W  tryptophan
       I  isoleucine                      Y  tyrosine
       K  lysine                          
       L  leucine                         X  any
       M  methionine                      
       N  asparagine                      -  gap of indeterminate length

7.4 PDB format:

The PDB entries consists in many field types. The two required field types are the HEADER field, which should include the protein identifier, and all ATOM fields which describe the spatial coordinates of each protein atom.

Reference: http://www.rcsb.org/pdb/docs/format/pdbguide2.2/guide2.2_frame.html

Example (fragment):

 HEADER    PROTEIN INHIBITOR                       15-JUL-91   2ETI      2ETI   2
 ATOM      1  N   GLY     1      10.340   0.136  -1.967  1.00  0.00      2ETI  88
 ATOM      2  CA  GLY     1       9.711  -1.190  -1.866  1.00  0.00      2ETI  89
 ATOM      3  C   GLY     1       8.303  -1.083  -1.299  1.00  0.00      2ETI  90
 ATOM      4  O   GLY     1       7.612  -0.086  -1.493  1.00  0.00      2ETI  91
 ATOM      5 1H   GLY     1      10.288   0.521  -1.034  1.00  0.00      2ETI  92
 TER

7.5 SELEX format:

Each line includes a protein name followed by its one-letter coded amino acid sequence. Protein sequences can be aligned or not.

Examples:

 2ETI                         GCPRILMRCKQDSDCLAGCVCGPNGFCG
 4CPAI                        ZHADPICNKPCKTHDDCSGAWFCQACWNSARTCGPYV
 or
 ITR2_MOMCH/1-28              ....RICPRIWMECKRDSDCMAQ..C.ICVD..GHCG...
 4CPAI/1-37                   ZHADPICN...KPCKTHDDCSGAWFCQACWNSARTCGPYV

7.6 PIR format:

Each entry has a header line begining with '>' followed by a single description line followed by an arbitrary number of lines describing the whole protein sequence. The last entry line is terminated with a * symbol. The header line begins with P1 (for protein) followed by the ';' symbol, followed by the protein name. The protein description line is composed of fields separated by ':' symbols. The fields are the entry type, the structure file path, the first amino acid number, the chain label, the last amino acid number, a short protein description, respectively.

Example:

 >P1;2ETI
 structure:/tmp/2ETI.pdb:1: :28: :trypsin inhibitor II - Ecballium elaterium
 GCPRILMRCKQDSDCLAGCVCGPNGFCG*

7.7 MSF format:

The entry consists of a header section describing each protein name and length followed by a block-based multiple sequence alignment.

Description URL: http://www.embl-heidelberg.de/predictprotein/Dexa/optin_msfDes.html

Example:

 PileUp
 
 
 
    MSF:  115  Type: P    Check:  1324   .. 
 
  Name: 1REIA oo  Len:  115  Check:  7925  Weight:  50.0
  Name: 2RHE oo  Len:  115  Check:  3399  Weight:  50.0
 
 //
 
 1REIA           DIQMTQSPSS LSASVGDRVT ITCQASQDII ..KYLNWYQQ TPGKAPKLLI 
 2RHE            ESVLTQPPS. ASGTPGQRVT ISCTGSATDI GSNSVIWYQQ VPGKAPKLLI 
 
 
 1REIA           YEASNLQAGV PSRFSGSGSG TDYTFTISSL QPEDIATYYC QQY.QSLPYT 
 2RHE            YYNDLLPSGV SDRFSASKSG TSASLAISGL ESEDEADYYC AAWNDSLDEP 
 
 
 1REIA           .FGQGTKLQI T....
 2RHE            GFGGGTKLTV LGQPK

7.8 TREE format:

The entry describes a classification tree. Each subtree is embedded in a parenthesis block. The tree branches or edges are separated by the symbol ','. Each edge length is labeled by the symbol ':' followed by a the length measure. Each leaf is labeled with its associated protein segment name. The line ends with the ';' symbol.

Example:

 (1EGF/1-54:459.5,(2ETI/1-54:56.5,2CTI/1-54:56.5):93.5,4CPAI/1-54:217.5);

7.9 XML format:

Each top-level <set> entry is a recipient for one more <seg> entries, each one describing a protein segment. Each <seg> entry is associated to a <seq> entry including the protein identifier, accession number, source database, short description, and amino acid number with respective tags <id>, <acc>, <db>, <des> and <len>. The <beg> and <end> tags delineate the first and last amino acid sequence positions of the considered protein segment. The protein sequence is listed in the <str> field.

Example:

 <set>
   <seg>
     <seq>
       <id>2ETI</id>
       <acc>2ETI</acc>
       <db>nrl3d</db>
       <des>trypsin inhibitor II - Ecballium elaterium</des>
       <len>28</len>
     </seq>
     <beg>1</beg>
     <end>28</end>
     <str>GCPRILMRCKQDSDCLAGCVCGPNGFCG</str>
   </seg>
 </set>

8. Session examples

The links below will direct you to "Output" tables corresponding to short session examples. Since these are examples that should not be modified, additional queries or redirections cannot be executed from there.
Click on the input (Ix) or output (Ox) identifiers in the output tables to display the corresponding input or output, respectively.

  • Example 1:

    This session computes and displays several structure predictions on one protein.

  • Example 2:

    This session aligns several proteins, applies two prediction methods on each protein, makes sequence and prediction consensus, then displays all results as a colored multiple alignment.

  • Example 3:

    This session aligns two protein sequences with CLUSTALW, evaluates the compatibility between the aligned sequences and the structure 1reiA with EVAL23D, EVDTREE and VERIFY3D, and displays an aligned and colored output of all structural evaluations.

  • Example 4:

    This session applies the predefined macro 4 from the "Macro" menu to two protein structures.
    The two structures are first structurally aligned with CE. Then the structures are fitted by PROFIT using the CE alignment, and the superimposed structures are displayed with the interactive applet Jmol.

  • Example 5:

    This session performs a similarity search for a Rossmann motif with WUBLAST2.
    As many similar sequences were found, a representative sequence subset was selected with CDHIT. Sequences were then aligned using CLUSTALW, and PREDATOR was used to predict secondary structure. Sequences, predictions and consensus are displayed using the COLOR format (Output O6).
    The CLUSTALW result (Output O3) has been then sent to BIONJ and ATV using the popup menu in the corresponding line of the Redirection column. This created output O7 (the BIONJ result) and O8. Clicking the O8 link will display the calculated phylogenetic tree with the ATV applet.
    To get sequence similarities, the CLUSTALW result was also sent to MATRIX (Output O9).

  • Example 6:

    This session predicts non-globular regions of a protein (signal peptide, transmembrane segments, coiled-coils, low complexity regions), then displays all results aligned under the sequence. Transmembrane segments detected by a consensus of all transmembrane predictors are indicated in the Consensus.mm segment and regions detected as potentially non globular by any tool are marked by 'X' in the gb.GLOBULAR prediction segment.