Help
If you use this server for research projects, please cite:
Gracy J. and Chiche L. (2005) PAT: a protein analysis toolkit for integrated biocomputing on the web. Nucleic Acids Research. Web server issue, 33, W65-71.
PAT help
1. Quick start
- All server functionalities are accessible
by clicking on the five buttons located in the top bar.
- Click the Input button to choose the
first tool you want to launch and its associated input data.
From there you will be able to create pipelines and complex queries
(macros).
- Click the Output button to get the
results of your input queries.
- Click the Macro button to retrieve
standard or previously defined macros
- Click the Help button to retrieve this
help file.
- Click the Mail button to send bugs
and suggestions to the author.
2. Overview
PAT (for Protein Analysis Toolkit) is an integrated bio-computing
server. The main goal of its design was to facilitate the combination
of different processing tools for more complex protein analyses.
To this end, the PAT implementation has the following characteristics:
- PAT is able to retrieve protein entries from many databases
using specific identifier or accession number indexes.
- PAT is able to launch many processing tools
dedicated to 1D, 2D and 3D protein analysis using specific wrappers.
- PAT is able to read and write biological data in many bioinformatic
formats using specific parsers and dumpers.
- PAT is able to redirect the output of
one tool to the input of another tool by a seamless data format
translation using appropriate (parser, dumper) pairs.
- PAT is able to perform complex analyses by the combination
of different processing tools via checkboxes and drop-down menus,
or using a dedicated macro language.
A typical PAT query is composed of two main steps:
- Click the Input button to choose an analysis
tool and their associated input data and options. Depending on
the tool, checkboxes can be available to select similar tools
for parallel processing. A redirection menu lets you send the
output to other data-compatible tools or formatting options.
Several redirections can be successively selected, this way.
Processing of the corresponding script (i.e. user-defined macro)
is then launched by clicking the Run button at the bottom of
the form. The Macro button provides direct
access to previous user-defined macros as well as few standard
macros.
- Click the Output button to retrieve the
results from a table where the history of the successive analyses
are stored.
3. Input
Analysis tools All available
protein analysis tools can be accessed by clicking the top Input
button.
(Table 1) lists all available
tools including litterature references or related URLs.
Tools are currently covering the following topics:
- Primary sequence analysis
- Sequence similarity search
- Sequence alignment
- Sequence motif search
- Phylogeny
- Secondary structure prediction
- Non globular structure prediction
- Solvent accessibility prediction
- Tertiary structure analysis
- Tertiary structure display
- Tertiary structure superposition
- Tertiary structure modeling
- Tertiary structure evaluation
- Cellular localization
- Output format
Each tool form can be accessed by clicking its name in this
tool table. The tool form is composed of three parts. A top Tool
header gives a short description of the tool and internet links
to the native web site and help file. The Input form lets
you type paste upload and edit the protein
query. The Option form is for selection of the main
options of the current analysis tool. It is complemented by a
textual input area for specifying additional options using the
syntax "-option1 value1 -option2 value2". The optional
Parallel processing form lets you choose additional similar
tools to launch at the same time on the current input. The output
of this processing can be send to further calcultations or formatting
by using the Output redirection menu. In the latter case,
the Input page will be recreated with additional forms for option
selections or further redirection.
When no more redirection is desired, the whole processing can
be launched by clicking the Run button. The analysis is then
started using an independant process launched on the server and
one or several processing lines are added on top of the Output
table. Don't forget to click the "Output" button in
the top bar to refresh the output table.
Input data fields The Input
form lets you specify the protein data that should be analysed.
It includes one multi-line text field where
you can type or paste your input according to the syntax described
below and one file field from which you
can choose and upload one of your local files as input data.
Filling at least one of these two fields is mandatory.
Text field The first field
of the input form is a text field where you can type protein
identifiers or accession numbers, each one on a different line,
or some text following one the formats (FASTA, MSF, PDB, ...)
described in the format section or old
output identifiers. You can mix data of different formats in
the text field but you should separate them by lines consisting
the single word '#!PAT'. This text field is not case sensitive.
Protein identifiers or accession numbers
In the text field, you can type protein or family identifiers
(eg, EGF_HUMAN or PF00133) or accession numbers (eg, P01133 or
tRNA-synt_1) from the following databases: SWISSPROT, TREMBL,
PDB, SCOP, PFAM or DOMO. PAT has indexes for automated retrieval
of entries from each of the above databases and appropriate parsers
for extracting the sequences and other fields from these entries.
Protein chains from the PDB can be specified by concatenating
both protein and chain identifiers (e.g., 1reiA). If you want
to restrict the processing to a protein segment, you can specify
its first and last positions by appending "/begin-end"
to the protein name (eg EGF_HUMAN/15-42 or 1reiA/35-65).
Formatted data If your protein
data does not belong to the indexed databases, you can type or
paste your data (i.e. your sequence, your alignment or your PDB
file) directly in the text field using one of the formats readable
by PAT. PAT can parse input data in any format listed in the
"Output format" category of the tool table (click the
top Input button to see the list of readable formats). PAT can
also parse the output of most available tools (e.g. PSIBLAST,
HMM, PREDATOR, STRIDE, etc).
Pipelines A very effective feature
of PAT is its ability to automatically redirect outputs, i.e.
to reuse results previously computed by PAT as input data of
other tools. This can be achieved in several ways:
- using the "Output redirection" menu in the tool
form
- using the menus in the "Redirection" column of
the output table
- using the "Use any previous input or output" menu
in a tool input area (if any)
- or by directly typing previous input or output labels (i.e.
Ix or Ox, where x is the input/output number) in the text input
area of a tool.
Uploading a file Instead of
directly typing or pasting data in the text field, you can upload
a file from your local computer using the file field below the
text field. You can either type your full local file path in
the input line or select the file through interactive directory
browsing. The content of the uploaded file should conform to
the syntax of the text field explained above.
4. Output table
The output table stores the session history of all your queries.
When starting a new session, this table is empty. Each time you
send a new query, one or several lines are added on top of previous
outputs. The table reminds you all the protein analyses you have
already done, informs you on the status of ongoing processes and
lets you access at any moment to any input/output data. You can
check whether a running process is completed by clicking the 'Output'
button in the top bar. The output table has seven columns:
- The first column provides checkboxes to select/deselect lines
for deletion or synthetic reports.
- The "Input" column lists every input data you have
provided to the PAT server. Each input is described by an identifier
("I" concatenated with a number). Clicking on the input
identifier will display from top to bottom , the full data input,
some tool information, and the corresponding output.
- The "Tool" column lists every tool processed by
PAT with optional parameter values.
- The "Output" column lists every query result. If
the query is completed and the process output available, the
output is described by an identifier ("O" concatenated
with a number). Clicking on the output identifier will display
from top to bottom, the results, some tool information, and the
corresponding input.. If the tool query is not completed, the
"Output" column indicates "Processing..."
or "Queued..." depending on the current process status.
It is then possible to check whether the query result is available
by clicking again the top "Output" button. It is possible
to launch multiple queries in parallel (our current web server
has four processors) before the current process has ended.
- The "Bytes" column lists the number of bytes of
each result file once the process has been completed.
- The "Time" column lists the number of CPU seconds
consumed by each process.
- The "Redirection" column provides drop-down menus
to redirect outputs to data-compatible tools, if any.
When clicked, the "Result synthesis" button below
the output table generates an HTML page collecting all selected
analysis results. This page can then be saved on the client's
disk using the Save option of the internet browser. The "Delete
rows" button lets you delete the selected analysis results
from the output table.
5. Macros
The redirection and parallel processing facilities of the Input
or Output menus let the user build complex queries simply with
mouse clicks. By doing this, the user actually creates scripts
(called macros) which are based on a dedicated macro langage.
The macro language is simple yet powerful. It is based on two
operators : the concatenation symbol ',' and the pipelining symbol
'|'. The concatenation symbol ',' lets you perform simultaneous
analyses from different tools and collect them into a single global
output. Each tool can be parametrized by setting some options
with the syntax "tool -option1 value1 -option2 value2 ...".
The pipeline symbol '|' asks for the transfer of the output of
the tool(s) launched before the symbol '|' to the input of the
tool(s) specified after this symbol. It should noted that this
symbol has the similar meaning as in the Unix world, although
in our case the transfer seamlessly involves automated data reformatting
and/or index-based protein sequence or structure retrievals.
The Macro menu lets the user to directly access all previously
defined macros without going through the drop-down menus again.
Besides, several standard macros that correspond to typical protein
analyses are also available. Any selected macro can be hand-edited
by the user before execution for fine tuning.
- Macro 1 : dsc, simpa96, predator, psipred, seg, ncoils, tmpred,
signalp | consensus | color.
This macro launches different local structure prediction tools,
then adds a consensus from obtained secondary structure predictions
and transmembrane segment predictions, then formats the resulting
output using the coloring tool COLOR.
- Macro 2 : wublast2 -d pdb_seq | sim2ali | mview
This macro looks for sequence homologs in the PDB database with
WUBLAST2 collecting many PDB sequences, then makes a multiple
alignments from all pairwise sequence alignments found using
SIM2ALI, then creates an HTML page using the MVIEW formatting
tool.
- Macro 3: clustalw | bionj | atv
This macro aligns the input protein sequences using CLUSTALW,
then builds a phylogenetic tree from this alignment using BIONJ,
then displays the resulting tree using the applet viewer ATV.
- Macro 4: ce | profit -out pdb | jmol
This macro builds a structural alignment of two input protein
structures using CE, then fits them in 3D using PROFIT, then
displays the superimposed structures using the applet JMOL. It
should be noted that if the input data are protein sequences,
the closest homologs found in the Protein Data Bank will be used
as input instead.
- Macro 5: pdbgeo, eval23d, verify3d | color
This macro extracts 3D features using home-made software PDBGEO,
evalutates the 1D-3D compatibility using the statistical potentials
from EVAL23D and VERIFY3D, and then tabulates the gathered information
in COLOR format. 3D data can be automatically inferred as done
in macro 4.
- Macro 6 : wublast2 | cdhit | muscle | mview
This macro searches query similarities using WUBLAST2, selects
representative homologs with CDHIT, aligns them with MUSCLE,
then displays the resulting multiple alignment using MVIEW.
- Macro 7 : wublast2 | cdhit | seqname | dsc | selex
This macro searches query similarities using WUBLAST2, selects
representative homologs with CDHIT, retrieves the corresponding
whole sequences, predict their secondary structures with DSC,
and then prints all sequences and predictions using the SELEX
format.
6. Restrictions
This server is still in development. Due to the variety of
tasks and combinations performed by this server, we are concious
that bugs remain to be corrected. For this reason, bug reports
and suggestions are very welcome and can be submitted to the author
by e-mail. If you detect a bug, please indicate your session number
(check the web address of your last server call and look the "dir"
value) and the concerned query number in Output table.
We log each server query for debugging purpose only. Each session
history and associated data (input, queries, output) are stored
in a temporary server workspace and are periodically deleted to
prevent saturation. The typical lifetime of one session file on
our server is one day but it may vary depending on the server
load. It is therefore safer to copy and save your session results
on your own disks.
Some tasks may require high resource consumption. To prevent
the saturation of our server, we set different limits to the resources
consumed by each process. The processing of your query will be
automatically aborted if any one of the following conditions is
met:
- CPU time is larger than 10 minutes.
- The process requires more than 500Mb of RAM.
- The output files are bigger than 10Mb.
7. Data formats
PAT accepts many data formats in the input text area of each
analysis tool. The major standard bioinformatic formats described
below can be read by PAT. PAT is also able to read as input the
output of many analysis toolswich is necessary for output redirections.
7.1 SEQNAME format:
Each line includes a single protein or family name. The corresponding
protein sequences will be automatically retrieved from the databases
indexed by PAT (currently SWISSPROT, TREMBL, PDB or PFAM). If
one protein is missing or out of date, please e-mail me.
Example:
2ETI
4CPAI
7.2 SEGMENT format:
Each line includes a protein name, followed by a '/' symbol,
followed by the first residue sequence position, followed by a
'-' symbol, followed by the last residue sequence position.
Example:
2ETI/1-15
4CPAI/10-28
7.3 FASTA format:
A sequence in FASTA format begins with a single-line description,
followed by lines of sequence data. The description line is distinguished
from the sequence data by a greater-than (">") symbol
in the first column. It is recommended that all lines of text
be shorter than 80 characters in length. An example sequence in
FASTA format is:
Reference: http://tigrblast.tigr.org/web-hmm/fasta.html
Example:
>PRTZ_BOVIN VITAMIN K-DEPENDENT PROTEIN Z.
AGSYLLEELFEGHLEKECWEEICVYEEAREVFEDDETTDEFWRTYMGGSPCASQPCLNNGSCQDSIRGYACTCAPGYEGP
NCAFAESECHPLRLDGCQHFCYPGPESYTCSCARGHKLGQDRRSCLPHDRCACGTLGPECCQRPQGSQQNLLPFPWQVKL
TNSEGKDFCGGVLIQDNFVLTTATCSLLYANISVKTRSHFRLHVRGVHVHTRFEADTGHNDVALLDLARPVRCPDAGRPV
CTADADFADSVLLPQPGVLGGWTLRGREMVPLRLRVTHVEPAECGRALNATVTTRTSCERGAAAGAARWVAGGAVVREHR
GAWFLTGLLGAAPPEGPGPLLLIKVPRYALWLRQVTQQPSRASPRGDRGQGRDGEPVPGDRGGRWAPTALPPGPLV
Blank lines are not allowed in the middle of FASTA input. Sequences
are expected to be represented in the standard IUB/IUPAC amino
acid codes, with these exceptions: lower-case letters are accepted
and are mapped into upper-case; a single hyphen or dash can be
used to represent a gap of indeterminate length. Before submitting
a request, any numerical digits in the query sequence should either
be removed or replaced by appropriate letter codes (e.g., X for
unknown amino acid residue). The accepted amino acid codes are:
A alanine P proline
B aspartate or asparagine Q glutamine
C cystine R arginine
D aspartate S serine
E glutamate T threonine
F phenylalanine
G glycine V valine
H histidine W tryptophan
I isoleucine Y tyrosine
K lysine
L leucine X any
M methionine
N asparagine - gap of indeterminate length
7.4 PDB format:
The PDB entries consists in many field types. The two required
field types are the HEADER field, which should include the protein
identifier, and all ATOM fields which describe the spatial coordinates
of each protein atom.
Reference: http://www.rcsb.org/pdb/docs/format/pdbguide2.2/guide2.2_frame.html
Example (fragment):
HEADER PROTEIN INHIBITOR 15-JUL-91 2ETI 2ETI 2
ATOM 1 N GLY 1 10.340 0.136 -1.967 1.00 0.00 2ETI 88
ATOM 2 CA GLY 1 9.711 -1.190 -1.866 1.00 0.00 2ETI 89
ATOM 3 C GLY 1 8.303 -1.083 -1.299 1.00 0.00 2ETI 90
ATOM 4 O GLY 1 7.612 -0.086 -1.493 1.00 0.00 2ETI 91
ATOM 5 1H GLY 1 10.288 0.521 -1.034 1.00 0.00 2ETI 92
TER
7.5 SELEX format:
Each line includes a protein name followed by its one-letter
coded amino acid sequence. Protein sequences can be aligned or
not.
Examples:
2ETI GCPRILMRCKQDSDCLAGCVCGPNGFCG
4CPAI ZHADPICNKPCKTHDDCSGAWFCQACWNSARTCGPYV
or
ITR2_MOMCH/1-28 ....RICPRIWMECKRDSDCMAQ..C.ICVD..GHCG...
4CPAI/1-37 ZHADPICN...KPCKTHDDCSGAWFCQACWNSARTCGPYV
7.6 PIR format:
Each entry has a header line begining with '>' followed
by a single description line followed by an arbitrary number of
lines describing the whole protein sequence. The last entry line
is terminated with a * symbol. The header line begins with P1
(for protein) followed by the ';' symbol, followed by the protein
name. The protein description line is composed of fields separated
by ':' symbols. The fields are the entry type, the structure file
path, the first amino acid number, the chain label, the last amino
acid number, a short protein description, respectively.
Example:
>P1;2ETI
structure:/tmp/2ETI.pdb:1: :28: :trypsin inhibitor II - Ecballium elaterium
GCPRILMRCKQDSDCLAGCVCGPNGFCG*
7.7 MSF format:
The entry consists of a header section describing each protein
name and length followed by a block-based multiple sequence alignment.
Description URL: http://www.embl-heidelberg.de/predictprotein/Dexa/optin_msfDes.html
Example:
PileUp
MSF: 115 Type: P Check: 1324 ..
Name: 1REIA oo Len: 115 Check: 7925 Weight: 50.0
Name: 2RHE oo Len: 115 Check: 3399 Weight: 50.0
//
1REIA DIQMTQSPSS LSASVGDRVT ITCQASQDII ..KYLNWYQQ TPGKAPKLLI
2RHE ESVLTQPPS. ASGTPGQRVT ISCTGSATDI GSNSVIWYQQ VPGKAPKLLI
1REIA YEASNLQAGV PSRFSGSGSG TDYTFTISSL QPEDIATYYC QQY.QSLPYT
2RHE YYNDLLPSGV SDRFSASKSG TSASLAISGL ESEDEADYYC AAWNDSLDEP
1REIA .FGQGTKLQI T....
2RHE GFGGGTKLTV LGQPK
7.8 TREE format:
The entry describes a classification tree. Each subtree is
embedded in a parenthesis block. The tree branches or edges are
separated by the symbol ','. Each edge length is labeled by the
symbol ':' followed by a the length measure. Each leaf is labeled
with its associated protein segment name. The line ends with the
';' symbol.
Example:
(1EGF/1-54:459.5,(2ETI/1-54:56.5,2CTI/1-54:56.5):93.5,4CPAI/1-54:217.5);
7.9 XML format:
Each top-level <set> entry is a recipient for one more
<seg> entries, each one describing a protein segment. Each
<seg> entry is associated to a <seq> entry including
the protein identifier, accession number, source database, short
description, and amino acid number with respective tags <id>,
<acc>, <db>, <des> and <len>. The <beg>
and <end> tags delineate the first and last amino acid sequence
positions of the considered protein segment. The protein sequence
is listed in the <str> field.
Example:
<set>
<seg>
<seq>
<id>2ETI</id>
<acc>2ETI</acc>
<db>nrl3d</db>
<des>trypsin inhibitor II - Ecballium elaterium</des>
<len>28</len>
</seq>
<beg>1</beg>
<end>28</end>
<str>GCPRILMRCKQDSDCLAGCVCGPNGFCG</str>
</seg>
</set>
8. Session examples
The links below will direct you to "Output" tables
corresponding to short session examples. Since these are examples
that should not be modified, additional queries or redirections
cannot be executed from there.
Click on the input (Ix) or output (Ox) identifiers in the output
tables to display the corresponding input or output, respectively.
This session computes and displays several structure
predictions on one protein.
This session aligns several proteins, applies
two prediction methods on each protein, makes sequence and prediction
consensus, then displays all results as a colored multiple alignment.
This session aligns two protein sequences with
CLUSTALW, evaluates the compatibility between the aligned sequences
and the structure 1reiA with EVAL23D, EVDTREE and VERIFY3D, and
displays an aligned and colored output of all structural evaluations.
This session applies the predefined macro 4 from
the "Macro" menu to two protein structures.
The two structures are first structurally aligned with CE. Then
the structures are fitted by PROFIT using the CE alignment, and
the superimposed structures are displayed with the interactive
applet Jmol.
This session performs a similarity search for
a Rossmann motif with WUBLAST2.
As many similar sequences were found, a representative sequence
subset was selected with CDHIT. Sequences were then aligned using
CLUSTALW, and PREDATOR was used to predict secondary structure.
Sequences, predictions and consensus are displayed using the
COLOR format (Output O6).
The CLUSTALW result (Output O3) has been then sent to BIONJ and
ATV using the popup menu in the corresponding line of the Redirection
column. This created output O7 (the BIONJ result) and O8. Clicking
the O8 link will display the calculated phylogenetic tree with
the ATV applet.
To get sequence similarities, the CLUSTALW result was also sent
to MATRIX (Output O9).
This session predicts non-globular regions of
a protein (signal peptide, transmembrane segments, coiled-coils,
low complexity regions), then displays all results aligned under
the sequence. Transmembrane segments detected by a consensus
of all transmembrane predictors are indicated in the Consensus.mm
segment and regions detected as potentially non globular by any
tool are marked by 'X' in the gb.GLOBULAR prediction segment.
|