Software

Faculty of Science

Initial Bioinformatic Investigation
Using Bioinformatic Tools to Strategically Design Expression/Purification Projects

The various bioinformatic tools are sorted by:
Rationale Project Design	Troubleshooting	Additional Tools

Your comments are most welcome.

Entries since November 2003

Bioinformatics Tools Sorted according to Rationale Project Design

**Preliminary Search** of Databases **Before** Starting a Project
Using Keywords Most recommended: NCBI, EBI, Genecards, Brenda Additional options: Nucleic Acid Research List	Using Sequence Blast Sequence alignments (pairwise & multiple alignment)

**Sequence** Analysis Using Software & Databases
DNA/RNA Sequences A. Secondary structure of mRNA B. Problems due to differences in genetic code: 1. Rare codons, 2. Different genetic codes) C. Alternative splicing D. Ribosome binding site	Protein Sequences A. Motifs and Repeats in Proteins B. Physicochemical properties C. Experimentally determined protein structures D. Structure and function predictions E. Subcellular localization & signal peptides F. Co- and Post-translational Modifications (including S-S bonds) G. Protein degradation 1. Protein turnover 2. Proteolytic cleavage H. 1. Protein-protein interactions 2. Biological Pathways I. Additional Tools

Back

Bioinformatics Tools Sorted by Expression Problems

Problem	Possible Causes	Bioinformatics Tools
Very low amounts of expressed proteins	Secondary structure of mRNA Rare codons Low t1/2, Secretion signal	Secondary Structure of mRNA Codon Usage and ORFs Protein Turnover Sub-cellular Localization and Signal Peptides
Truncated forms	Rare codons Genetic code differences [trp-stop] Additional RBS [consider GUG too] {alternative reading frame} Proteases during induction or lysis Cloning out of frame	Codon Usage and ORFs Problems due to differences in genetic code Define motif and check Proteolytic cleavage
Insoluble protein	Post-translational modifications S-S bonds Partners in complex Localization signals Transmembrane domains In-frame mutation/s due to rare codons or non-standard genetic code	Co- and Post-translational Modifications S-S bonds Protein-protein interactions, General search Sub-cellular localization & signal peptides Structure and function predictions Problems due to differences in genetic code

Back

Additional Tools & Databases

ExPASy Bioinformatics Resource Portal
SRS Available Analysis Tools .
MyBio - the biologist's wiki workbench
GenomeWeb - (check the Proteins and Nucleic Acids sections)
ONLINE ANALYSIS TOOLS

NAR (Nucleic Acids Research) Database Categories List
Toolbox compiled by the Bioinformatics & Biological Computing Unit, Weizmann Institute of Science.

Back

Preliminary Search of DatabasesBefore Starting a Project

Preliminary Search Using Keywords

NCBI Resources
EBI bioinformatics services
Genecards - A database of human genes, their products and their involvement in diseases.
Brenda - The Comprehensive Enzyme Information System.
UCSC Genome Bioinformatics

Sequence-based Preliminary Search & Sequence Alignments

Blast (Blast program selection guide)
Fasta (EBI, U Virginia)
Multiple sequence alignment tools in EBI , T Coffee, M Coffee and others (CNRS, SIB) [T-Coffee: A novel method for fast and accurate multiple sequence alignment. Notredame C, Higgins DG, Heringa J. J Mol Biol. 2000 Sep 8;302(1):205-17.] M Coffee
Pairwise alignments - Emboss Pairwise Alignment Algorithms , Lalign (EmbNet), Align two sequences using BLAST (nucleotides, proteins).

DNA & RNA Sequence Analysis

Secondary Structure of mRNA

Codon Usage and Translation Frames

Rare Codon Caltor - For expression in E-coli
GenScript Rare Codon Analysis Tool
Search of rare codons in nucleotide sequence - For expression in 6 organisms.
Optimizer - optimizes the codon usage of a DNA sequence to increase its expression level.
Translation tools allowing change of genetic code: Transeq, DNA to protein translation , another DNA to protein translation , Virtual Ribosome - version 1.1. Examine the different genetic codes (NCBI).
Translate a DNA Sequence - paste a nucleic acid sequence and obtain graphical and textual depictions of its possible translations (various ORFs) in all 6 reading frames.
A Graphical Codon Usage Analyser.
Codon usage database .
GenScript Rare Codon Analysis Tool (codon optimization)

Alternative Splicing

Known information - Gene database
Splice Prediction Tools (PhenoSystems)
Splicing prediction tools (GeneInfinity)
Intron and exon databases (GeneInfinity)
WASP - Nature 2010 Website for Alternative Splicing Prediction
Alternative Splice Site Predictor (ASSP)
SpliceNest
Human splicing finder
GeneSplicer
Astalavista - Alternative Splicing transcriptional landscape visualization tool
altanalyze - alternative splicing and functional prediction analysis tool

Ribosome Binding Site (Background)

RBS Calculator (Salis lab)

Protein Sequence Analysis

Motifs and Repeats in Proteins

Interpro - A database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences.
CDD[Conserved Domain Database] - a collection of sequence alignments and profiles representing protein domains conserved in molecular evolution. (NCBI) [also appears as part of Blast output]. Run a BLAST search against the CDD.
ELM - Eukaryotic Linear Motif Resource for Functional Sites in Proteins.
Prosite - A database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.
ProTeUs - (PROtein TErminUS) - a tool for the identification of short linear signatures in protein termini. (About)
QuasiMotiFinder - A server for the identification of motifs and signature-like patterns in protein sequences (based on Prosite and multiple alignments)
Motif Search - for DNA or protein sequences.
Radar - Rapid Automatic Detection and Alignment of Repeats in protein sequences.
SAPS - Statistical Analysis of Protein Sequences.
Additional sites (Expasy list - "Pattern & Profiles Searches").
Protein patterns and motifs - various tools (GeneInfinity)
Protein domains and families - various tools (GeneInfinity)

Physicochemical properties

Protparam - Physico-chemical parameters of a protein sequence (amino-acid and atomic compositions, pI, extinction coefficient, etc.)
Protein Calculator - Generates molecular weight information (including scanning mass spectrometry results), estimated charges (including pI estimation), uv absorption coefficients, crystallographic solvent content percentage and Vm, and counts atoms and residues based on the protein sequence.
EMBOSS sequence statistics - Pepinfo / Pepwindow / Pepstats .
ProtScale (several hydrophobicity scales)

Antigenicity
Emboss - Antigenic - finds antigenic sites in proteins.
Predicting antigenic peptides
Tools for computational immunology
Immunology related web tools (scroll down the page)

Protein Solubility
Aggrescan - a server for the prediction and evaluation of "hot spots" of aggregation in polypeptides.
Recombinant Protein Solubility Prediction - Predicts protein solubility assuming the protein is being overexpressed in Escherichia coli. (Based on: Wilkinson DL, Harrison RG., Predicting the solubility of recombinant proteins in Escherichia coli., Biotechnology (N Y). 1991 May;9(5):443-8.)
SOLPro predicts the propensity of a protein to be soluble upon overexpression in E. coli, part of Scratch protein predictor
PROSO II and PROSO predict protein solubility upon heterologous expression
ESSPRESSO (estimation of protein expression and solubility)
SPpred (soluble protein prediction)
FoldIndex - predicts whether a given protein sequence is intrinsically unfolded, based on the average residue hydrophobicity and net charge of the sequence. Look at additional prediction tools for disordered proteins.
Refold - identifying optimal conditions and methodology for refolding.

Protein Turnover

PESTfind - Polypeptide sequences enriched in Proline (P), glutamic acid (E), serine (S) and threonine (T) target proteins for rapid destruction. PESTfind produces a score ranging form about -50 to +50. By definition, a score above zero denotes a possible PEST region, but a value greater than +5 sparks real interest.
Destruction Box (D box) Finder - characterizes some proteins destined to proteolysis by ubiquitin and the 26S proteasome pathway.
SProtP Human - Short-lived protein prediction in human
Protparam - references to N-end rule (scroll down)
I-Mutant2.0 - a tool for predicting protein stability upon single site mutation.
MUpro - Prediction of protein stability changes for single site mutations from sequences.
Additional tools for assessing protein stability following mutagenesis.

Proteolytic Cleavage

Peptide Cutter - predicts potential protease and cleavage sites and sites cleaved by chemicals in a given protein sequence.
Protease database of: E coli,
MEROPS the peptidase database [Various search options.]
PROSPER - protease specificity prediction server
Classification of peptidase families and index of peptidase entries in Swiss-Prot [Alan Barrett and Neil Rawlings]
InBase - The Intein Database (NEB). Inteins are self-catalytic protein splicing elements. Blast your sequence against InBase. (Read about InBase)
ProP - predicts arginine and lysine propeptide cleavage sites in eukaryotic protein sequences.
GPS-CCD - prediction of calapain (family of the Ca2+-dependent cysteine proteases) cleavage sites.
PMAP - the proteolysis map
Additional tools

Co- and Post-translational Modifications (A short summary)

Phosphorylation

Netphos - Predicts serine, threonine and tyrosine phosphorylation sites in eukaryotic proteins.
KinasePhos - computes the location of the phosphorylation sites and the corresponding catalytic protein kinases.
NetPhosK - produces predictions of kinase specific eukaryotic protein phosphoylation sites. Currently NetPhosK covers the following kinases: PKA, PKC, PKG, CKII, Cdc2, CaM-II, ATM, DNA PK, Cdk5, p38 MAPK, GSK3, CKI, PKB, RSK, INSR, EGFR and Src.
NetPhosYeast - predicts serine and threonine phosphorylation sites in yeast proteins.
Scansite3.0 - searches for motifs within proteins that are likely to be phosphorylated by specific protein kinases or bind to domains such as SH2 domains, 14-3-3 domains or PDZ domains.
Phospho.ELM - a database of S/T/Y phosphorylation sites
PhosphoSitePlus - covers phosphrylation, acetylation, ubiquitination and additional PTMs
GPS: Group-based Phosphorylation Scoring Method - in-silico prediction of phosphorylation sites of specific kinases.
PPSP - Prediction of Protein Kinases-specific Phosphorylation sites.
PhosphoSVM - a non-kinase-specific protein phosphorylation site prediction method that integrates nine different sequence level scores: shannon entropy (SE), relative entropy (RE), predicted protein secondary structure (SS), predicted protein disorder (PD), accessible surface area (ASA), overlapping properties (OP), averaged cumulative hydrophobicity (ACH), and k-nearest neighbor (KNN).
DisPhos - predicts S, T and Y phosphorylation. The observation that amino acid composition, sequence complexity, hydrophobicity, charge and other sequence attributes of regions adjacent to phosphorylation sites are very similar to those of intrinsically disordered protein regions suggests that disorder in and around the potential phosphorylation target site is an important prerequisite for phosphorylation. Thus, DISPHOS uses disorder information to improve the discrimination between phosphorylation and non-phosphorylation sites.
Phosida - posttranslational modifications database - covers phosphorylation, acetylation & N-glycosylation

Glycosylation (Uniprot background)

GlycoEP - prediction of glycosites in eukaryotic glycoproteins.
YinOYang - Predicts O-ß-GlcNAc attachment sites in eukaryotic protein sequences.
NetOGlyc - Predicts mucin type GalNAc O-glycosylation sites in mammalian proteins.
NetNGlyc - Predicts N-Glycosylation sites in human proteins.
DictyOGlyc - Predicts GlcNAc O-glycosylation sites in Dictyostelium discoideum proteins.
GPP - glycosylation predictor (N- & O- linked)
EnsenbleGly - O-, N- & C- linked clycosylation sites
O-glycosylation prediction
ISOGlyP - Isoform Specific O-Glycosylation Prediction
NetCGlyc - predictions of C-mannosylation sites in mammalian proteins.
NetGlycate - predicts glycation of epsilon amino groups of lysines in mammalian proteins.
Phosida - posttranslational modifications database - covers phosphorylation, acetylation & N-glycosylation

Addition of Lipid Moieties

NMT - Predicts N-terminal N-Myristoylation by MyristoylCoA:Protein N-Myristoyltransferase.
Myristoylator - predicts N-terminal myristoylation of proteins by neural networks.
PlantsP - predicts plant specific myristoylation.
PrePS - Prenylation Prediction Suite (additional details).
CSS-Palm - palmitoylation sites prediction
NBA-Palm - prediction of palmitoylation sites.
GPI lipid anchor predictor - animals, plants.
GPI-SOM: Identification of GPI-anchor signals.
PredGPI -
Big-PI predictor - GPI modification site prediction.

S-S bonds or metal binding sites

CYSPRED - Predicts cysteins that are likely to be partners in cysteine bridges. (Program described in: Fariselli P, Riccobelli P, Casadio R PROTEINS(1999) 36:340-346)
DCON - Predictor of Disulfide Connectivity in Proteins.
Disulfind - Cysteines Bonding State and Connectivity Predictor.
Dianna - Cysteine state and Disulfide Bond partner prediction.
EDBCP - Ensemble-based disulfide bonding connectivity pattern
GDAP - disulfide bond prediction by sequence to structure mapping
CysRedox - predicting the redox state of cysteins in proteins from multiple sequence alignments.
CysState - cysteines disulfide bonding state prediction from protein sequence
DiPro - protein disulfide bond prediction
MetalDetector - cysteine and histidine metal binding sites predictor
Dinosolve - disulfide bonding prediction server

Ubiquitination & Sumoylation

BDM-PUB -
UbPred
CKSAAP UbSite prediction
SumoSP
PCI-Sumo
SumoPlot analysis program
GPS-Sumo
JASSA - Joined Advanced Sumoylation Site and Sim Analyser

Others

TermiNator - predicts N-terminal methionine excision, N-terminal acetylation, N-terminal myristoylation and S-palmitoylation of either prokaryotic or eukaryotic proteins originating from organellar or nuclear genomes.
The Sulfinator - Predicts tyrosine sulfation sites in protein sequences
SulfoSite - predicts sulfation sites.
GPS-YNO2 - prediction of tyrosine nitration sites.
NetAcet - predicts substrates of N-acetyltransferase A (NatA). The method was trained on yeast data but it obtains similar performance values on mammalian substrates acetylated by NatA orthologs.
GPS-SNO - prediction of S-nitrosylation sites.
MeMo - predicts arginine and lysine methylation sites in proteins
PhosphoSitePlus - covers phosphrylation, acetylation, ubiquitination and additional PTMs
PTMs peptide scanner - phosphorylation, sumoylation, palmitoylation, methylation and acetylation.
Additional sites of interest (Expasy list, Gene Infinity list)

Sub-cellular Localization and Signal Peptides

Subcellular Compartments

psort - Several programs for subcellular localization prediction (eukaryotic sequences, plant and Gram-positive bacterial sequences, Gram-negative bacterial sequences)
SoftBerry Protein Location Finding - offers different programs for animal/fungi, plant and bacterial proteins.
ESLPred - prediction of subcellular localization of eukaryotic proteins.
PSLPred - for prokaryotic proteins
Cell PLoc 2.0 - a package of web servers for predicting subcellular localization of proteins in different organisms (prokaryotes & eukaryotes including plants)
LocDB - protein localization database for human & arabidopsis
MultiLoc2 - for eukaryotes (University of Tubingen)
eSLDB - covers human, mouse, C. elegans, S. cerevisiae and A. thaliana.
Locate - subcellular localization database for mouse and human
OrganelleDB - for some eukaryotes
BaCello - for eukaryotes
TargetP Server - Predicts the subcellular location of eukaryotic protein sequences. The subcellular location assignment is based on the predicted presence of any of the N-terminal presequences chloroplast transit peptide (cTP), mitochondrial targeting peptide (mTP) or secretory pathway signal peptide (SP).
SubLoc 1.0 - for prokaryotes and eukaryotes (contains less subcellular targets than other programs, and so may bias results)
SecretomeP - Prediction of non-classical and leaderless protein secretion. Produces ab initio predictions of non-classical i.e. not signal peptide triggered protein secretion. The method queries a large number of other feature prediction servers to obtain information on various post-translational and localizational aspects of the protein, which are integrated into the final secretion prediction. (Paper)
Golgi transmembrane predictor - predicts Golgi membrane proteins based on their transmembrane domains. This prediction method is only valid for Type II transmembrane proteins, and output from the method is simply predicted to be Golgi localised or predicted to transit through the Golgi (post-Golgi localisation).
PTS1 predictor - predicts the peroxisomal targeting signal 1.
Additional tools (compiled by GeneInfinity)

Targetting Peptides

psort - various tools for prediction of protein sorting signals in different groups of organisms
SignalP - Predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms (Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes).
Sigcleave - Reports protein signal cleavage sites.
Signal-3L - animals, plants and bacteria
SPEPLip - Predictor of Signal Peptide and Lipoprotein Cleavage Sites in Proteins
LipoP - prediction of lipoproteins and signal peptides in Gram negative bacteria
PredLipo - prediction of lipoproteins and signal peptides in Gram positive bacteria
cNLS mapper - Prediction of importin α-dependent nuclear localization signals
NucPred -
NetNES - predicts leucine-rich nuclear export signals (NES) in eukaryotic proteins.
ChloroP - Prediction of chloroplast transit peptides
MitoProt - Prediction of mitochondrial targeting sequences.
Predotar - Prediction of mitochondrial and plastid targeting sequences.
PTS1 predictor - for Peroxisomal Targeting Signal 1
SecretomeP - predicts non-classical i.e. not signal peptide triggered protein secretion in eukaryotes. The method queries a large number of other feature prediction servers to obtain information on various post-translational and localizational aspects of the protein, which are integrated into the final secretion prediction.
TatP - predicts the presence and location of Twin-arginine signal peptide cleavage sites in bacteria.
ProP - predicts arginine and lysine propeptide cleavage sites in eukaryotic protein sequences.
Aditional tools (compiled by Gene Infinity)

Protein-Protein Interactions

STRING - a database of known and predicted protein-protein interactions. The interactions include direct (physical) and indirect (functional) associations; they are derived from four sources: 1. Genomic Context 2. High-throughput Experiments 3. (Conserved) Coexpression 4. Previous Knowledge .
IntAct - all interactions are derived from literature curation or direct user submissions
HGPRD (Human Protein Reference Database) - examine the sections: "interactions" & "PTMs and Substrates".
DIP - Database of interacting proteins.
MINT - Molecular Interactions Database.
Bind - Biomolecular Interaction Network Database.
iPfam - describes domain-domain interactions that are observed in PDB entries.
InterDom - a database of putative interacting protein domains derived from multiple sources, ranging from domain fusions (Rosetta Stone), protein interactions (DIP and BIND), protein complexes (PDB), to scientific literature (MEDLINE).
Additional tools (compiled by Gene Infinity)

Biological Pathways

KEGG Pathway Database
Reactome
PathGuide - the pathway resource list
BioCyc
PathGuide - the pathway resource list

Experimentally Determined Protein Structures

Search for known structures using:

PDB - RCSB protein data bank.
MMDB - Entrez structures (molecular modeling database)
PDBe -
OCA - provides rich content annotation on structure and function, generating dynamic links to several external sources.
PDBSum - is a pictorial database providing an at-a-glance overview of every macromolecular structure deposited in the PDB. It provides schematic diagrams of the molecules in each structure and of the interactions between them.
iPfam - describes domain-domain interactions that are observed in PDB entries.

Structure Classification

SCOP - structural classification of proteins
CATH - a hierarchical domain classification of protein structures in the Protein Data Bank.

Structure & Function Predictions

Secondary Structure Prediction

Domain Prediction

Disordered Proteins

Topology Prediction
a helices b sheets

Helical Wheels

Function Prediction

Function Prediction

ProtFun - predicts protein function from sequence. The method queries a large number of other feature prediction servers to obtain information on various post-translational and localizational aspects of the protein, which are integrated into final predictions of the cellular role, enzyme class (if any), and selected Gene Ontology categories of the submitted sequence; Paper 1; Paper 2;
PFP - protein function prediction
ConSurf - Server for the identification of functional regions in proteins with or without known structures.
ProFunc - prediction of function from protein structure
GOtcha - a function prediction for your sequence

Secondary Structure Prediction

PsiPred - Protein Structure Prediction Server
Proteus2 - bundles signal peptide identification, transmembrane helix prediction, transmembrane beta-strand prediction, secondary structure prediction (for soluble proteins) and homology modeling (i.e. 3D structure generation) into a single prediction pipeline.
JPred3 - a secondary structure prediction server powered by the Jnet algorithm.
PredictProtein - offers the following: generation of multiple sequence alignments (MaxHom) , detection of functional motifs (PROSITE), detection of composition-bias (SEG), detection of protein domains (PRODOM), fold recognition by prediction-based threading (TOPITS), predictions of: secondary structure (PHDsec, and PROFsec), residue solvent accessibility (PHDacc, and PROFacc), transmembrane helix location and topology (PHDhtm, PHDtopology), protein globularity (GLOBE), coiled-coil regions (COILS), cysteine bonds (CYSPRED), structural switching regions (ASP)
Scratch protein predictor - offers the following predictions: secondary structure, solvent accesibility, transmembrane regions, disordered regions, disulfide bonds, domains, antigenicity.
Expasy tools
Additional tools (compiled by Gene Infinity)

Disordered Proteins

Domain & domain-linker Prediction

DomPred
GlobPlot
Scooby domain prediction
DoBo - domain boundary prediction
DomPro
DomCut - prediction of the linker region between functional domains
Domain linker prediction
ThreadDom

Topology Prediction

a helices

Psipred - You may select one of three prediction methods to apply to your sequence: PSIPRED - a highly accurate method for protein secondary structure prediction, MEMSAT - a widely used transmembrane topology prediction method and GenTHREADER - a sequence profile based fold recognition method.
TMHMM - Prediction of transmembrane helices in proteins, nice graphics
Phobius - a combined transmembrane topology and signal peptide predictor.
SOSUI - Predicts transmembrane helices in proteins and includes the helical wheels in the graphic presentation. Checks for presence of signal peptide to avoid the risk of signal peptides being predicted as putative TM as well.
TMPred - presence of transmembrane helices and their orientation.
MemBrain -
TOPCONS - consensus prediction of membrane protein topology
Octopus
MINNOU

Topo2 - transmembrane protein display software - user has to supply the data about TMDs. Residues of interest can be highlighted.

SACS list of transmembrane prediction sites.
Expasy's tools for topology prediction.

b sheets

ConBBPRED: Consensus Prediction of TransMembrane Beta-Barrel Proteins (about)
TMB-Web - Transmembrane Barrel-web resources.
Boctopus -
Pred-TMBB -
TMBeta-Net -
TBBPred -
PROFtmb - a prediction service for Bacterial Transmembrane Beta Barrels - part of PredictProtein

Helical Wheels

Additional Tools

Suggest an expression system - developed in Weizmann Institute of Sciences. SuggestES takes the protein sequence you provide and scans a large database with protein sequences with known results for different expression systems. At the time of generating a suggestion, suggestES takes into consideration several parameters:

Similarity: how similar is your sequence to the existing data in the database?. The expression systems used on sequences similar to yours are preferred when creating the list of suggestions.
Recentness: how recently was used a given expression system?. The older the record of the usage of a given expression system, the less this system will influence the final result. This will provide visibility to recently appearing system.
Frequency: how frequently a given expression system has been used?

SECRET predicts the chance that a soluble protein will crystallize.
ESSPRESSO (estimation of protein expression and solubility)
SERp aims to aid identification of sites that are most suitable for mutation designed to enhance crystallizability by a Surface Entropy Reduction approach.
CRYSTALP2 - for in-silico prediction of protein crystallization propensity.

This site is maintained by Dr. Nurit Doron . Your comments are most welcome.

entries since November 2003