|
Research Overview
Human Variation
Biological Sequences
Algorithms
My current research focuses on the development of computational tools
that may help in studies of the genetics of complex diseases such as
cancer, Alzheimer's disease or Parkinson's disease. In order to
understand the genetic factors of complex diseases, disease
association studies are performed. In these studies, a set of cases
(individuals carrying the disease) and controls (healthy individuals)
are genotyped, and the genetic variation of the two populations is
compared. The information collected from each individual usually
consists of a set of positions in the genome called single nucleotide
polymorphisms (SNPs). An excellent introduction for non-biologists on
what are SNPs and related issues can be found in the HapMap website .
There are more than three billion nucleotide bases in the human
genome, and as of today, it is not within our reach to sequence every
individual. Therefore, disease association studies usually restrict
themselves to SNPs. Out of the three billion nucleotide bases, only a
few millions are common SNPs. The technology that extracts the SNP
information from an individual is called 'genotyping'. In fact,
current genotyping technology can only extract a subset of the SNPs
for a realistic price.
The question is whether such a partial picture of the genome would be
enough in order to find associations between genetic variants and
disease. The answer to this question is that to some extent we could
learn something about the 'hidden' parts of the genome, or about the
SNPs that we do not genotype, since there are correlations between
SNPs that are in close proximity on the chromosome. Thus, it is
desirable to genotype a subset of the SNPs, such that other SNPs in
the genome are highly correlated with these SNPs. Such SNPs are
usually referred to as Tag SNPs .
Together with Ron Shamir and Gad Kimmel from Tel-Aviv University, we
developed a method, called STAMPA, which
finds a set of tag SNPs that can be used to predict the other hidden
SNPs. We predict the allele value of each SNP based on the two tag
SNPs closet to it on the chromosome. The tag SNPs are chosen to
minimize the prediction error, based on a training dataset. We show
that a simple dynamic programming can be applied in order to find
such a set of tag SNPs.
- Eran Halperin, Gad Kimmel, Ron Shamir
Tag SNP Selection in Genotype Data for Maximizing SNP Prediction
Accuracy, ISMB (Supplement of Bioinformatics), 2005: 195-203.
[pdf]
A haplotype of an individual is the sequence of nucleotide bases
occurring at the SNPs along one chromosome (see the HapMap website for
more information). Haplotypes can be used as proxy for hidden
SNPs, and they play an increasingly important role in association
studies. In STAMPA, we use pairs of SNP to infer the alleles of the
hidden SNPs, but one could potentially use the haplotype information
for such inference. Indeed, such a strategy was suggested by Paul
de Bakker and friends, in the context of their widely used tag SNP
selection method
Tagger.
Unfortunately, genotyping technology provides the genotype
information and not the haplotype information.
As opposed to haplotypes, the genotype gives the bases at each
SNP for both copies of the chromosome, but loses the information
as to the chromosome on which each base appears. The task of
inferring the haplotype information from the genotype information
is called phasing . Inspired by a beautiful paper by Dan
Gusfield, (RECOMB,
2002), we have designed and implemented a phasing algorithm, HAP, which is
currently used by hundreds of researchers worldwide.
|
|
- Eran Halperin and Eleazar Eskin
Haplotype Reconstruction from Genotype Data using Imperfect Phylogeny,
Bioinformatics 20(12): 1842-1849 (2004).
[pdf]
The main advantage of HAP, compared to other phasing methods is its
speed. HAP is very efficient, and it can easily run on genome scale
datasets. For this reason, when the first genome-wide dataset was
released by Perlegen Sciences,
HAP was used to phase the dataset, which facilitated the study of the
structure of haplotype and human variation across three populations.
- David A. Hinds, Laura L. Stuve, Geoffrey B. Nilsen, Eran Halperin, Eleazar Eskin, Dennis G. Ballinger, Kelly A. Frazer, David R. Cox
Whole-Genome Patterns of Common DNA Variation in Three Human Populations, Science, 18 February 2005: 1072-1079.
[Science Website].
Shortly after that, the HapMap project released another large dataset,
that contained a whole-genome coverage of 269 individuals from four
different populations. We thought that this would be an important
opportunity to study the potential of haplotype analysis when
combining the two datasets. In fact, we phased all the datasets in dbSNP (the
largest depository of SNPs), and submitted the haplotypes to
dbSNP so that they could be used by the community. That was the first
submission of haplotypes into dbSNP.
- Noah A. Zaitlen, Hyun Min Kang, Michael L. Feolo, Stephen T. Sherry, Eran Halperin, and Eleazar Eskin
Inference and analysis of haplotypes from combined genotyping
studies deposited in dbSNP, Genome
Research, 2005, 15:1594-1600. [Genome
Research].
Technically, HAP is using a perfect phylogeny model. In this model, we
assume that the ancestral history of the haplotypes follows a
simplistic model, in which the haplotypes can be placed on a
phylogenetic tree. Each vertex of the tree corresponds to either an
existing or ancestral haplotype, and each edge corresponds to a
mutation in one of the positions. In the perfect phylogeny model, we
assume that each SNP has mutated exactly once through history, and
that there were no other events such as recombinations, deletions,
etc. Even though this seems like an oversimplifying assumption, we
found that it holds almost perfectly in practice.
|
|
The implementation of HAP is based on a algorithm that finds a
set of haplotypes that can explain the genotypes, under the constraint
that this set of haplotypes could be placed on a perfect phylogeny.
- Eleazar Eskin, Eran Halperin and Richard M. Karp
Efficient Reconstruction of Haplotype Structure via Perfect Phylogeny, JBCB , 1 (1) 1-20, 2003.
[pdf]
Clearly, restricting the model to the case where each mutation occurs
only once is very restrictive. In HAP, we approach this using a simple
heuristic. It is natural to ask whether a more rigorous approach can
be taken. Therefore, together with a group from CMU, we developed an
algorithm that finds a tree with the minimum number of possible
mutations. This is also called maximum parsimony. The maximum
parsimony problem is of independent interest, since it has been
studied by the bioinformatics community for a while now. It is
NP-hard, so the running time of our algorithm is
exponential in the number of recurrent mutations q. Previously, the
best algorithm known for this problem had an impractical running time
of O(m^q poly(m,n)), where m is the number of SNPs, n is the number of
individuals. We improve this to O(q^q poly(m,n)), and so our algorithm
is practical even if the number of SNPs is large.
- Guy E. Blelloch, Heda Dhamdhere, Eran Halperin, R. Ravi, Russell
Schwartz, and Srinath Sridhar
Fixed Parameter tractability of Binary Near-Perfect Phylogenetic
Tree Reconstructions, ICALP,
2006. [pdf]
Together with Richard Karp, we further explored the possible
reconstructions of trees from haplotypes with missing data or
from genotype data. We showed that under some probabilistic
assumptions, the perfect phylogeny is unique and it can be found
efficiently.
- Eran Halperin and Richard M. Karp
Perfect Phylogeny and Haplotype Assignment, RECOMB , 2004.
[pdf]
Phasing is only needed if one wants to find a haplotype explanation
for every given genotype. In practice, in many case-control studies,
it is much more important to find the distribution of haplotypes in
each of the populations than to find the haplotype composition of
every individual. There is a natural likelihood function that is used
for this scenario, which has been extensively studied, and is
generally solved using an EM algorithm. Clearly, the EM is not
guaranteed to converge to the correct solution, so we came up with a new
likelihood function, in which we assign a vector to each of the
haplotypes, and the frequency of this haplotype is interpreted as the
projection of the vector on a constant unit vector. We show that
asymptotically the two likelihood functions converge to the same
solution, and we developed an efficient algorithm to find the maximum
likelihood solution (HAPLOFREQ). HAPLOFREQ is extremely fast in practice, and we are
currently incorporating it to the HAP webserver.
- Eran Halperin and Elad Hazan
HAPLOFREQ - Estimating Haplotype Frequencies Efficiently,
RECOMB,
2005. Also in a special issue of
Journal of
Computational Biology,March
2006, Vol. 13, No. 2: 481-500.
[pdf]
The maximum likelihood function for phasing is a somewhat
different. We show that maximizing this function is equivalent to
finding a set of haplotypes that explain the genotypes, and such that
the entropy of the distribution of haplotypes in the population is
minimized. This is also true when considering haplotypes with missing data.
In the following paper we show that finding a solution to this problem
is NP-hard in general, but we find that a simple greedy algorithm
gives a constant additive approximation.
- Eran Halperin and Richard M. Karp
The Minimum-Entropy Set Cover Problem, ICALP
2004, 733--744. [pdf]
Another interesting issue is how to infer haplotypes over long
regions. We use a general framework that takes the solutions of any
phasing method over short overlapping windows, and combines these
solutions using dynamic programming.
- Eleazar Eskin, Roded Sharan and Eran Halperin
Optimally Phasing Long Genomic Regions using
Local Haplotype Predictions, JBCB
,4, pp. 639-647, 2006. [pdf]
Given the correct phase, and the haplotype information, it is still
important to make efficient and correct use of the haplotype
information in association studies. Together with Noah Zaitlen, Hyun
Min Kang and Eleazar Eskin from UCLA, we found an optimal haplotype
test for association, which leverages the HapMap information in order
to 'prioritize' the HapMap SNPs for consideration in a followup
study. Our test is based on a weighted combination of the haplotypes,
vs. the standard tests that test one haplotype, or a group of
haplotypes that are given the same weight. We show
analytically and empirically that our test is more powerful than
single SNP tests or single haplotype tests for association. To
facilitate the analysis, we set up a web-server (WHAP) that analyzes and
graphically presents the results.
- Noah Zaitlen, Hyun Min Kang, Eleazar Eskin, Eran Halperin
Leveraging the HapMap Correlation Structure in Association Studies
The American Jounal of Human
Genetics, 80:683-691, 2007. [AJHG]
One technology which is not applied to its full potential is DNA
pools. In this technique, the DNA of a group of individuals is pooled,
and then genotyped. The result is the allele frequency in the group of
individuals. Seemingly, the individual genotype information is
lost. We found though, that when mother-father and child are pooled
using two DNA pools, one can extract the information of the three
genotypes, thus reducing the cost of genotyping the trio by 50%. The
method (TrioPhase)
is based on a combinatorial observation, combined with a clustering
algorithm to deal with noise in the allele frequency reads.
Together with Kenny Beckman from Children Hospital Oakland's Research
Institute, we performed wet experiments on the Sequenom genotyping
platform, and demonstrated that our suggested protocol provides the
genotyping with an error rate comparable to the one of standard genotyping.
- Kenneth B. Beckman, Kenneth A. Abel, Andreas Braun and Eran Halperin
Using DNA Pools for Genotyping Trios, Nucleic Acids Research, 2006; doi:
10.1093/nar/gkl700. Nucleic Acids Research.
When DNA pools are used, the individual information is lost, and in
particular, the haplotype information of each individual is not
given. In case-control studies that utilize imputation and haplotype
analysis, it is critical to know the haplotype frequencies in the case
and control populations. In the following paper, we introduce a method that
uses perfect phylogeny and linear regression to estimate the haplotype
frequencies in a population based on datasets consisting of pools of
two or three individuals at a time.
- Bonnie Kirkpatrick, Carlos Santos Armendariz, Richard M Karp,
and Eran Halperin
HAPLOPOOL: Improving Haplotype Frequency Estimation through DNA
Pools and Phylogenetic Modeling, Bioinformatics, in press.[Bioinformatics
Advanced Access].
Disease association studies may erroneously find links between SNPs and
a disease, simply because there are some confounding effects such as
population substructure. If the cases and the controls are collected
from different population, different ethnic groups, etc., there is a
risk that the discrepancies observed in the genetic variation between
the cases and the controls is due to the genetic difference between
the populations, and not due to the difference in the disease status.
It is therefore extremely important to find population substructure,
if one exists.
In order to rigorously answer this question, we considered the problem
in which the genotypes of a set of individuals from two populations are
given, and the goal is to classify the individuals based on their
genetic data. We assume a model in which the SNPs are independent and
we show bounds of the number of individuals and SNPs needed in order
to classify the set of genotypes correctly.
- Kamalika Chaudhuri, Eran Halperin, Satish Rao and Shuheng Zhou
A Rigorous Analysis of Population Stratification with Limited
Data, appeared in SODA ,
2007.
[pdf]
In practice, many of the population stratification algorithms do not
perform well, in particular when two populations of different sizes
are mixed together. In the following paper, we suggest a new
dissimilarity measure between individuals, based on their SNP
information and on the size of the sub-populations. We leverage on
this measure to build a very efficient method that clusters
populations based on their SNP data. The efficiency of the method
allows us to run permutation tests in order to measure the
significance of the partition.
- Srinath Sridhar, Satish Rao, Eran Halperin
An Efficient and Accurate Graph-based Approach to Detect
Population Substructure, appeared in
RECOMB, 2007.
[pdf]
Figuring out the populatiopn structure is not the end of the story. An
impotant issue is how to adjust the p-values based on the population
structure. In the following paper, we suggest a method that
incorporates the population structure into a permutation test.
This way we are able to adjust for population structure and for
multiple hypothesis at the same time.
- Gad Kimmel, Michael I. Jordan, Eran Halperin, Ron Shamir and
Richard M. Karp
A randomization test for controlling population stratification in
whole-genome association studies, The
American Jounal of Human Genetics, 81:895-905, 2007. [AJHG].
Finding population structure is especially difficult when admixed
populations are considered. These populations have originated from two
or more ancestral populations who have been mixing for the last couple
of generations (e.g., African Americans, or Latin Americans). In such
populations, every position in each of the chromosomes may originate
from a different ancestral population. Methods for 'locus specific
ancestry' are currently not very accurate. We have developed a method
called LAMP to infer the ancestry of each locus on a genome-wide
scale. We demonstrate on HapMap data how LAMP is more accurate and
efficient than other methods used to infer locus-specific ancestral
information.
- Sriram Sankararaman, Srinath Sridhar, Gad Kimmel, and Eran
Halperin,
LAMP: Local Ancestry in adMixed Populations, The American Journal of Human Genetics,
Volume 82, Issue 2, 290-303, 2008. [AJHG]
LAMP is very different from previous methods that tried to find the
locus specific ancestry of recently admixed populations in that it
does not use a Hidden Markov Model (HMM) and Markov Chain Monte Carlo (MCMC)
techniques in order to optimize the model. Since these methods try to
optimize the model directly, it is surprising that LAMP is performing
so much better. We tried to understand this phenomenon by implementing
different MCMC and HMM models, and measuring their performance. One
such implementation, called SWITCH, achieves slightly better
performance than LAMP, when starting an EM optimization of the HMM
from a starting point provided by LAMP.
Sriram Sankararaman, Gad Kimmel, Eran Halperin, and Michael
I. Jordan,
On the inference of ancestries in admixed populations, Genome Research, (special issue of RECOMB, 2008). In press.
DNA, RNA and protein sequences are the building blocks of computational
biology. In the pre-human genome project era, many of the problems
that involved these sequences were related to the assembly and
clustering of DNA sequences. My first bioinformatics project, in
Compugen's
research team, was to cluster an assemble large databases of mRNA. This
task is much easier today, when a genome reference is available. At
the time, the simple task of clustering these huge databases was
challenging, not to mention the many obstacles such as repeats,
chimera sequences, alternative splicing, and noise. With a team work,
using innovative hashing techniques and graph algorithms, we managed
to address all these challenges, and produced an algorithm which is
the infrastructure of today's LEADs, the main product
of Compugen.
- Mor Amitai, Raveh Gill-More, Eran Halperin, Avner Magen, Sarah
Pollock.
United States Patent 6,625,545: Method and apparatus for mRNA assembly. (1998)
The main idea in the mRNA clustering algorithm is in hashing the
sequences in a clever way. In the following paper, we suggest a method
to hash protein sequences into bins, using metric embedding. We embed
the protein space into the Euclidean space, and show that this
embedding may facilitate the applications of known hashing techniques
for DNA sequences (in this case random projections).
- Eran Halperin, Jeremy Buhler,Richard Karp, Robert Krauthgamer and Ben Westover
Detecting Protein Sequences Via Metric Embeddings, ISMB 2003, 122-199.
[pdf]
My formal background is in math, and in theory of computer science. I
spent many years developing algorithms, and trying to understand the
limitations of computational tools. I highlight briefly a few
papers which I hope you will find interesting mathematically.
In the following paper we study an extension of the
coupon collector problem. In the coupon collector scenario, a
person collects random coupons (numbers between 1 and n), until he
has all n coupons - then he wins a prize. It is known that such a
process ends after O(nlog n ) steps. We analyze a variant of this
process which is applicable to peer to peer networks.
- Micah Adler, Eran Halperin, Richard Karp and Vijay Vazirani
A stochastic process on the hypercube with applications to peer to peer networks,
STOC , 2003, 575--584. [ps]
In the following paper I show that the vertex cover problem can be
better approximated using semidefinite programming. In the vertex
cover problem we are given a graph, and we search for a set of
vertices that is incident with all the edges. The bound I give in
this paper improves the previous bound that was set in 1985 (15 years
prior to this paper). In 2005, this bound was improved
again.
- Eran Halperin
Improved approximation algorithms for the vertex cover
problem in graphs and hypergraphs, SICOMP , 31, 1608-1623, (2002).
Also, in SODA , 2000, 329-337. [ps]
As opposed to algorithms design, complexity theory tries to show what
the limits of computation are. I started studying the asymmetric
k-center problem in an attempt to develop an efficient constant
approximation algorithm. In this problem, an asymmetric metric is
given (the distance from a to b could be different than the distance
from b to a - think about flight times), and the goal is to find a
set of k points such that the distance from these points to all
other points is minimized (same idea as k-means). After thinking
about it for a while, we proved that such an efficient algorithm is
impossible, and in fact, the best you can do is approximate it with
a factor of log*k (log* is the number of times that a log should be
applied to a number until you get a constant). That was the first
time that log* was shown to be an approximation threshold for a
natural problem.
- Julia Chuzhoy, Sudipto Guha, Eran Halperin, Sanjeev Khanna, Guy
Kortsarz, Robert Krauthgamer and Seffi Naor
Tight lower bounds for the asymmetric k-center
problem, Journal of the ACM,
52(4):538-551, 2005. Also, in
STOC ,
2004. [ps].
|