Identifying which residues in a protein are important to determining its specificity (for example, why the LacI repressor protein is regulated by lactose, whereas its homolog FruR is regulated by fructose) is an important problem in biology. If we could uncover from sequence alone what residues determine specificity, we could: 1) understand the effect or importance of mutations, 2) build predictions for ligands, or 3) understand the complexity of evolutionary divergence. In this work, we utilize previously developed ideas in the sequence alignment field, which is that a position that determines specificity is likely to be one that is conserved within one specificity group, but conserved to something else in other specificity groups. In this software implementation, we make two important changes:
- We relax the requirement that a position must determine specificity in all groups. With this relaxation we greatly increase the ability to identify positions of specificity, where degeneracy (non-conservation) in some groups can be explained by the relaxation of the use of that residue in a particular family.
- We use ensemble alignments to build statistical distributions of SDPs. As the number of sequences in an alignment increases, the quality of the alignment decreases. Therefore, we fix any one alignment to a smaller number of sequences and resample from thousands of sequences to improve the estimation of SDPs.