Research - Eran Halperin

Research Overview

Human Variation Biological Sequences
Algorithms

Human Variation

My current research focuses on the development of computational tools that may help in studies of the genetics of complex diseases such as cancer, Alzheimer's disease or Parkinson's disease. In order to understand the genetic factors of complex diseases, disease association studies are performed. In these studies, a set of cases (individuals carrying the disease) and controls (healthy individuals) are genotyped, and the genetic variation of the two populations is compared. The information collected from each individual usually consists of a set of positions in the genome called single nucleotide polymorphisms (SNPs). An excellent introduction for non-biologists on what are SNPs and related issues can be found in the HapMap website .

Which SNPs should we look at?

There are more than three billion nucleotide bases in the human genome, and as of today, it is not within our reach to sequence every individual. Therefore, disease association studies usually restrict themselves to SNPs. Out of the three billion nucleotide bases, only a few millions are common SNPs. The technology that extracts the SNP information from an individual is called 'genotyping'. In fact, current genotyping technology can only extract a subset of the SNPs for a realistic price.

The question is whether such a partial picture of the genome would be enough in order to find associations between genetic variants and disease. The answer to this question is that to some extent we could learn something about the 'hidden' parts of the genome, or about the SNPs that we do not genotype, since there are correlations between SNPs that are in close proximity on the chromosome. Thus, it is desirable to genotype a subset of the SNPs, such that other SNPs in the genome are highly correlated with these SNPs. Such SNPs are usually referred to as Tag SNPs .

Together with Ron Shamir and Gad Kimmel from Tel-Aviv University, we developed a method, called STAMPA, which finds a set of tag SNPs that can be used to predict the other hidden SNPs. We predict the allele value of each SNP based on the two tag SNPs closet to it on the chromosome. The tag SNPs are chosen to minimize the prediction error, based on a training dataset. We show that a simple dynamic programming can be applied in order to find such a set of tag SNPs.
  • Eran Halperin, Gad Kimmel, Ron Shamir
    Tag SNP Selection in Genotype Data for Maximizing SNP Prediction Accuracy, ISMB (Supplement of Bioinformatics), 2005: 195-203. [pdf]

Haplotypes and Phasing

A haplotype of an individual is the sequence of nucleotide bases occurring at the SNPs along one chromosome (see the HapMap website for more information). Haplotypes can be used as proxy for hidden SNPs, and they play an increasingly important role in association studies. In STAMPA, we use pairs of SNP to infer the alleles of the hidden SNPs, but one could potentially use the haplotype information for such inference. Indeed, such a strategy was suggested by Paul de Bakker and friends, in the context of their widely used tag SNP selection method Tagger.

Unfortunately, genotyping technology provides the genotype information and not the haplotype information. As opposed to haplotypes, the genotype gives the bases at each SNP for both copies of the chromosome, but loses the information as to the chromosome on which each base appears. The task of inferring the haplotype information from the genotype information is called phasing . Inspired by a beautiful paper by Dan Gusfield, (RECOMB, 2002), we have designed and implemented a phasing algorithm, HAP, which is currently used by hundreds of researchers worldwide.

  • Eran Halperin and Eleazar Eskin
    Haplotype Reconstruction from Genotype Data using Imperfect Phylogeny, Bioinformatics 20(12): 1842-1849 (2004). [pdf]

The main advantage of HAP, compared to other phasing methods is its speed. HAP is very efficient, and it can easily run on genome scale datasets. For this reason, when the first genome-wide dataset was released by Perlegen Sciences, HAP was used to phase the dataset, which facilitated the study of the structure of haplotype and human variation across three populations.

  • David A. Hinds, Laura L. Stuve, Geoffrey B. Nilsen, Eran Halperin, Eleazar Eskin, Dennis G. Ballinger, Kelly A. Frazer, David R. Cox
    Whole-Genome Patterns of Common DNA Variation in Three Human Populations, Science, 18 February 2005: 1072-1079. [Science Website].

Shortly after that, the HapMap project released another large dataset, that contained a whole-genome coverage of 269 individuals from four different populations. We thought that this would be an important opportunity to study the potential of haplotype analysis when combining the two datasets. In fact, we phased all the datasets in dbSNP (the largest depository of SNPs), and submitted the haplotypes to dbSNP so that they could be used by the community. That was the first submission of haplotypes into dbSNP.
  • Noah A. Zaitlen, Hyun Min Kang, Michael L. Feolo, Stephen T. Sherry, Eran Halperin, and Eleazar Eskin
    Inference and analysis of haplotypes from combined genotyping studies deposited in dbSNP, Genome Research, 2005, 15:1594-1600. [Genome Research].

Phasing - how does it work?

Technically, HAP is using a perfect phylogeny model. In this model, we assume that the ancestral history of the haplotypes follows a simplistic model, in which the haplotypes can be placed on a phylogenetic tree. Each vertex of the tree corresponds to either an existing or ancestral haplotype, and each edge corresponds to a mutation in one of the positions. In the perfect phylogeny model, we assume that each SNP has mutated exactly once through history, and that there were no other events such as recombinations, deletions, etc. Even though this seems like an oversimplifying assumption, we found that it holds almost perfectly in practice.

The implementation of HAP is based on a algorithm that finds a set of haplotypes that can explain the genotypes, under the constraint that this set of haplotypes could be placed on a perfect phylogeny.
  • Eleazar Eskin, Eran Halperin and Richard M. Karp
    Efficient Reconstruction of Haplotype Structure via Perfect Phylogeny, JBCB , 1 (1) 1-20, 2003. [pdf]

Variants on the perfect phylogeny model

Clearly, restricting the model to the case where each mutation occurs only once is very restrictive. In HAP, we approach this using a simple heuristic. It is natural to ask whether a more rigorous approach can be taken. Therefore, together with a group from CMU, we developed an algorithm that finds a tree with the minimum number of possible mutations. This is also called maximum parsimony. The maximum parsimony problem is of independent interest, since it has been studied by the bioinformatics community for a while now. It is NP-hard, so the running time of our algorithm is exponential in the number of recurrent mutations q. Previously, the best algorithm known for this problem had an impractical running time of O(m^q poly(m,n)), where m is the number of SNPs, n is the number of individuals. We improve this to O(q^q poly(m,n)), and so our algorithm is practical even if the number of SNPs is large.
  • Guy E. Blelloch, Heda Dhamdhere, Eran Halperin, R. Ravi, Russell Schwartz, and Srinath Sridhar
    Fixed Parameter tractability of Binary Near-Perfect Phylogenetic Tree Reconstructions, ICALP, 2006. [pdf]

Together with Richard Karp, we further explored the possible reconstructions of trees from haplotypes with missing data or from genotype data. We showed that under some probabilistic assumptions, the perfect phylogeny is unique and it can be found efficiently.
  • Eran Halperin and Richard M. Karp
    Perfect Phylogeny and Haplotype Assignment, RECOMB , 2004. [pdf]

Other haplotype inference issues.

Phasing is only needed if one wants to find a haplotype explanation for every given genotype. In practice, in many case-control studies, it is much more important to find the distribution of haplotypes in each of the populations than to find the haplotype composition of every individual. There is a natural likelihood function that is used for this scenario, which has been extensively studied, and is generally solved using an EM algorithm. Clearly, the EM is not guaranteed to converge to the correct solution, so we came up with a new likelihood function, in which we assign a vector to each of the haplotypes, and the frequency of this haplotype is interpreted as the projection of the vector on a constant unit vector. We show that asymptotically the two likelihood functions converge to the same solution, and we developed an efficient algorithm to find the maximum likelihood solution (HAPLOFREQ). HAPLOFREQ is extremely fast in practice, and we are currently incorporating it to the HAP webserver.
  • Eran Halperin and Elad Hazan
    HAPLOFREQ - Estimating Haplotype Frequencies Efficiently, RECOMB, 2005. Also in a special issue of Journal of Computational Biology,March 2006, Vol. 13, No. 2: 481-500. [pdf]
The maximum likelihood function for phasing is a somewhat different. We show that maximizing this function is equivalent to finding a set of haplotypes that explain the genotypes, and such that the entropy of the distribution of haplotypes in the population is minimized. This is also true when considering haplotypes with missing data. In the following paper we show that finding a solution to this problem is NP-hard in general, but we find that a simple greedy algorithm gives a constant additive approximation.
  • Eran Halperin and Richard M. Karp
    The Minimum-Entropy Set Cover Problem, ICALP 2004, 733--744. [pdf]

Another interesting issue is how to infer haplotypes over long regions. We use a general framework that takes the solutions of any phasing method over short overlapping windows, and combines these solutions using dynamic programming.
  • Eleazar Eskin, Roded Sharan and Eran Halperin
    Optimally Phasing Long Genomic Regions using Local Haplotype Predictions, JBCB ,4, pp. 639-647, 2006. [pdf]

Downstream analysis of haplotypes in association studies.

Given the correct phase, and the haplotype information, it is still important to make efficient and correct use of the haplotype information in association studies. Together with Noah Zaitlen, Hyun Min Kang and Eleazar Eskin from UCLA, we found an optimal haplotype test for association, which leverages the HapMap information in order to 'prioritize' the HapMap SNPs for consideration in a followup study. Our test is based on a weighted combination of the haplotypes, vs. the standard tests that test one haplotype, or a group of haplotypes that are given the same weight. We show analytically and empirically that our test is more powerful than single SNP tests or single haplotype tests for association. To facilitate the analysis, we set up a web-server (WHAP) that analyzes and graphically presents the results.
  • Noah Zaitlen, Hyun Min Kang, Eleazar Eskin, Eran Halperin
    Leveraging the HapMap Correlation Structure in Association Studies The American Jounal of Human Genetics, 80:683-691, 2007. [AJHG]  [new]

Genotyping technology - DNA pools.

One technology which is not applied to its full potential is DNA pools. In this technique, the DNA of a group of individuals is pooled, and then genotyped. The result is the allele frequency in the group of individuals. Seemingly, the individual genotype information is lost. We found though, that when mother-father and child are pooled using two DNA pools, one can extract the information of the three genotypes, thus reducing the cost of genotyping the trio by 50%. The method (TrioPhase) is based on a combinatorial observation, combined with a clustering algorithm to deal with noise in the allele frequency reads. Together with Kenny Beckman from Children Hospital Oakland's Research Institute, we performed wet experiments on the Sequenom genotyping platform, and demonstrated that our suggested protocol provides the genotyping with an error rate comparable to the one of standard genotyping.
  • Kenneth B. Beckman, Kenneth A. Abel, Andreas Braun and Eran Halperin
    Using DNA Pools for Genotyping Trios, Nucleic Acids Research, 2006; doi: 10.1093/nar/gkl700. Nucleic Acids Research.  [new]

When DNA pools are used, the individual information is lost, and in particular, the haplotype information of each individual is not given. In case-control studies that utilize imputation and haplotype analysis, it is critical to know the haplotype frequencies in the case and control populations. In the following paper, we introduce a method that uses perfect phylogeny and linear regression to estimate the haplotype frequencies in a population based on datasets consisting of pools of two or three individuals at a time.
  • Bonnie Kirkpatrick, Carlos Santos Armendariz, Richard M Karp, and Eran Halperin
    HAPLOPOOL: Improving Haplotype Frequency Estimation through DNA Pools and Phylogenetic Modeling, Bioinformatics, in press.[Bioinformatics Advanced Access].  [new]

Population Stratification.

Disease association studies may erroneously find links between SNPs and a disease, simply because there are some confounding effects such as population substructure. If the cases and the controls are collected from different population, different ethnic groups, etc., there is a risk that the discrepancies observed in the genetic variation between the cases and the controls is due to the genetic difference between the populations, and not due to the difference in the disease status. It is therefore extremely important to find population substructure, if one exists.
In order to rigorously answer this question, we considered the problem in which the genotypes of a set of individuals from two populations are given, and the goal is to classify the individuals based on their genetic data. We assume a model in which the SNPs are independent and we show bounds of the number of individuals and SNPs needed in order to classify the set of genotypes correctly.
  • Kamalika Chaudhuri, Eran Halperin, Satish Rao and Shuheng Zhou
    A Rigorous Analysis of Population Stratification with Limited Data, appeared in SODA , 2007. [pdf]  [new]

In practice, many of the population stratification algorithms do not perform well, in particular when two populations of different sizes are mixed together. In the following paper, we suggest a new dissimilarity measure between individuals, based on their SNP information and on the size of the sub-populations. We leverage on this measure to build a very efficient method that clusters populations based on their SNP data. The efficiency of the method allows us to run permutation tests in order to measure the significance of the partition.
  • Srinath Sridhar, Satish Rao, Eran Halperin
    An Efficient and Accurate Graph-based Approach to Detect Population Substructure, appeared in RECOMB, 2007. [pdf]  [new]

Figuring out the populatiopn structure is not the end of the story. An impotant issue is how to adjust the p-values based on the population structure. In the following paper, we suggest a method that incorporates the population structure into a permutation test. This way we are able to adjust for population structure and for multiple hypothesis at the same time.
  • Gad Kimmel, Michael I. Jordan, Eran Halperin, Ron Shamir and Richard M. Karp
    A randomization test for controlling population stratification in whole-genome association studies, The American Jounal of Human Genetics, 81:895-905, 2007. [AJHG].  [new]

Finding population structure is especially difficult when admixed populations are considered. These populations have originated from two or more ancestral populations who have been mixing for the last couple of generations (e.g., African Americans, or Latin Americans). In such populations, every position in each of the chromosomes may originate from a different ancestral population. Methods for 'locus specific ancestry' are currently not very accurate. We have developed a method called LAMP to infer the ancestry of each locus on a genome-wide scale. We demonstrate on HapMap data how LAMP is more accurate and efficient than other methods used to infer locus-specific ancestral information.
  • Sriram Sankararaman, Srinath Sridhar, Gad Kimmel, and Eran Halperin,
    LAMP: Local Ancestry in adMixed Populations, The American Journal of Human Genetics, Volume 82, Issue 2, 290-303, 2008. [AJHG]  [new]

LAMP is very different from previous methods that tried to find the locus specific ancestry of recently admixed populations in that it does not use a Hidden Markov Model (HMM) and Markov Chain Monte Carlo (MCMC) techniques in order to optimize the model. Since these methods try to optimize the model directly, it is surprising that LAMP is performing so much better. We tried to understand this phenomenon by implementing different MCMC and HMM models, and measuring their performance. One such implementation, called SWITCH, achieves slightly better performance than LAMP, when starting an EM optimization of the HMM from a starting point provided by LAMP.
  • Sriram Sankararaman, Gad Kimmel, Eran Halperin, and Michael I. Jordan,
    On the inference of ancestries in admixed populations, Genome Research, (special issue of RECOMB, 2008). In press.  [new]

  • Biological Sequences

    DNA, RNA and protein sequences are the building blocks of computational biology. In the pre-human genome project era, many of the problems that involved these sequences were related to the assembly and clustering of DNA sequences. My first bioinformatics project, in Compugen's research team, was to cluster an assemble large databases of mRNA. This task is much easier today, when a genome reference is available. At the time, the simple task of clustering these huge databases was challenging, not to mention the many obstacles such as repeats, chimera sequences, alternative splicing, and noise. With a team work, using innovative hashing techniques and graph algorithms, we managed to address all these challenges, and produced an algorithm which is the infrastructure of today's LEADs, the main product of Compugen.
    • Mor Amitai, Raveh Gill-More, Eran Halperin, Avner Magen, Sarah Pollock. United States Patent 6,625,545: Method and apparatus for mRNA assembly. (1998)
    The main idea in the mRNA clustering algorithm is in hashing the sequences in a clever way. In the following paper, we suggest a method to hash protein sequences into bins, using metric embedding. We embed the protein space into the Euclidean space, and show that this embedding may facilitate the applications of known hashing techniques for DNA sequences (in this case random projections).
    • Eran Halperin, Jeremy Buhler,Richard Karp, Robert Krauthgamer and Ben Westover
      Detecting Protein Sequences Via Metric Embeddings, ISMB 2003, 122-199. [pdf]

    Algorithms

    My formal background is in math, and in theory of computer science. I spent many years developing algorithms, and trying to understand the limitations of computational tools. I highlight briefly a few papers which I hope you will find interesting mathematically.

    In the following paper we study an extension of the coupon collector problem. In the coupon collector scenario, a person collects random coupons (numbers between 1 and n), until he has all n coupons - then he wins a prize. It is known that such a process ends after O(nlog n ) steps. We analyze a variant of this process which is applicable to peer to peer networks.
    • Micah Adler, Eran Halperin, Richard Karp and Vijay Vazirani
      A stochastic process on the hypercube with applications to peer to peer networks, STOC , 2003, 575--584. [ps]

    In the following paper I show that the vertex cover problem can be better approximated using semidefinite programming. In the vertex cover problem we are given a graph, and we search for a set of vertices that is incident with all the edges. The bound I give in this paper improves the previous bound that was set in 1985 (15 years prior to this paper). In 2005, this bound was improved again.
    • Eran Halperin
      Improved approximation algorithms for the vertex cover problem in graphs and hypergraphs, SICOMP , 31, 1608-1623, (2002). Also, in SODA , 2000, 329-337. [ps]

    As opposed to algorithms design, complexity theory tries to show what the limits of computation are. I started studying the asymmetric k-center problem in an attempt to develop an efficient constant approximation algorithm. In this problem, an asymmetric metric is given (the distance from a to b could be different than the distance from b to a - think about flight times), and the goal is to find a set of k points such that the distance from these points to all other points is minimized (same idea as k-means). After thinking about it for a while, we proved that such an efficient algorithm is impossible, and in fact, the best you can do is approximate it with a factor of log*k (log* is the number of times that a log should be applied to a number until you get a constant). That was the first time that log* was shown to be an approximation threshold for a natural problem.
    • Julia Chuzhoy, Sudipto Guha, Eran Halperin, Sanjeev Khanna, Guy Kortsarz, Robert Krauthgamer and Seffi Naor
      Tight lower bounds for the asymmetric k-center problem, Journal of the ACM, 52(4):538-551, 2005. Also, in STOC , 2004. [ps].