hapConstructor details

  • HapConstructor Method Details

It has been suggested that analyzing multiple loci may have more power to identify an association, which is especially true for rare disease risk variants (minor allele frequency, MAF < 0.05). Hence, it is important to consider haplotype analyses; however, it is usually impractical to perform all possible analyses for multi-locus SNP sets. For each set of n SNPs there are 2n full length haplotypes, and many more sub-haplotypes derived from SNP subsets. Additionally, these haplotypes can be modeled and tested in a variety of ways. Usually, there is little or no a-priori information on how to select loci or how to model them for powerful haplotype analyses. HapConstructor is a data-mining technique that provides a method to explore possible loci sets and models with the objective to identify the multi-locus SNP sets and model that best extracts association evidence from the data. The application uses a set of heuristics to try to achieve this objective.

This haplotype-mining approach is a novel application and provides an automated utility that previously did not exist. There are two main uses for hapConstructor. First, it can be used to identify de-novo association evidence. In this case, the approach can either be considered as a screening phase in a two-step study design, where positive findings are confirmed at a second phase, or, as long as the multiple testing inherent in the data-mining approach is accounted for, a single step technique. Second, it can be used to further refine already established associations. The aim in this second type of application is to identify the “risk haplotype” that best differentiates the cases from controls in regions already proven to harbor significant associations. This latter use has value for selecting individuals for sequencing panels to search for underlying causal variants on the risk haplotype.

  • Incorporation of hapConstructor into Genie

HapConstructor has been incorporated in a general association analysis software package, Genie. Genie requires users to specify parameters for analyses, which indicate the loci, genetic models and statistics to test in addition to the data sets to access, and the type of output. Although initially designed for single marker association analyses, Genie is composed of a set of modules that have been extended to other types of analyses. HapConstructor further extends these modules. The three core modules that were amended in the development of hapConstructor were the analysis definition, statistics and the simulation modules.

Analysis definition: Association analyses in Genie are defined to test the relationship between a dichotomous or continuous trait variable and a categorical genetic variable. The analysis definition module is used to define how to construct the contingency table. In Genie, contingency tables are populated with counts from the input data set. For single marker case-control association analyses, 2xn contingency tables are created. The two rows represent disease status (case/control) and the n columns represent each user-specified genetic variable category. The genetic variable categories can be defined in a number of ways based on different hypotheses regarding the genetic model for the disease locus, and whether alleles or genotypes are counted. There are five standard models that are considered: allele, global, dominant, recessive, and additive. A test can be based on counting alleles, each individual contributing two counts. With an allele test and biallelic marker (SNP) there are two categories which correspond to the two possible allele values for the SNP (say, allele 1, common, and allele 2, rare). A test can also be based on counting genotypes, since each individual has a pair of alleles, or genotype, for each SNP. For a SNP, there exist 3 genotypes (11, 12, 22). A global genotype association test considers the three genotypes as three separate genetic variable categories. A dominant model groups genotypes 12 and 22 together and a recessive model groups genotypes 11 and 12 together. An additive model considers the three genotype categories as ordinal variables weighted by user-defined values (usually the number of rare alleles in the genotype).

Statistics: Genie contains a statistics module which is used to specify the statistics that will be carried out on the defined contingency tables. Most statistics offered are applicable to 2xn contingency tables as described above (i.e. chi-squared test, odds ratio). Multiple standard statistics can be performed on the 2xn contingency tables setup for case-control analyses, these include statistics to test for non-independence (i.e. chi-squared test, chi-squared trend test) and effect size (i.e. odds ratio). With an odds ratio test, a reference category is defined, which represents the category hypothesized to not confer an increase in disease risk. Some statistics are carried out on contingency tables that count transmissions from parents, such as transmission/disequilibrium tests. Meta statistics are also available to analyze multiple datasets using chi-squared association statistics and odds ratios.

Simulation: Genie was developed specifically to allow for valid association analyses when data include related subjects. This is achieved using a Monte Carlo approach. Rather than accounting for the familial relatedness by amending the statistic and maintaining the standard distribution, the Monte Carlo approach maintains the standard statistic and derives an empirical null distribution. The observed statistics are calculated in the normal way, ignoring familial relationships (i.e. treating them as independent subjects). To appropriately assess the statistical significance, Genie provides a simulation module to generate null empirical distributions for each statistic. This Monte Carlo procedure involves the simulation of null genotype configurations (based on the familial structure) and the calculation of each statistic of interest from these null genotype data. This process is repeated n times. The observed statistic is compared to the distribution of null statistics to estimate an empirical p-value. The key to generating a valid null distribution is that the characteristics of the null genotype data must appropriately match the real data, but that the genetic variables are based on allele or haplotype frequencies under the null hypothesis of no association. For unrelated individuals, a simple permutation of affection status will accomplish a null genotype data set. For related individuals, permutation leads to mismatches between the real and null data. Permutation of affected status will lead to different familial correlation structures between cases and controls, and genotypes cannot be permuted, as this will lead to Mendelian errors. One method that will maintain both disease status configurations and avoid Mendelian incompatibilities is a gene drop. A gene drop is initiated by first assigning genotype (or haplotype) data to the founder individuals of a pedigree based on allele (or haplotype) frequencies. The assigned founder genotypes are segregated through the pedigree to all the descendants based on rules of Mendelian transmission. The affection status for individuals in the pedigree remain the same as in the real data. The simulation module in Genie is used to specify the type of gene-drop that is performed and the number of simulations in the null distribution.

  • Algorithm development

The hapConstructor algorithm is a program that uses the general Genie analysis package. The heuristics guide the construction of multi-locus SNP sets and specific analyses to perform for each SNP set. The statistical analysis is performed using the Genie components.

General algorithm background: The first step is to establish the maximum likelihood estimate (MLE) haplotypes for the observed data (which will lead to imputed genotype values based on the full data). The second step is to establish the MLE haplotype estimates for all the null data sets. This is matched with missing data structure, pedigree structure as was described in detail in Chapter 2. For hapConstructor, however, all null simulation MLEs are stored and used as a reference for determining significance for whichever statistic and SNP-set is being considered. The algorithm establishes which SNPs to analyze using a stepwise process, where step-m indicates an analysis with m SNPs, that is, an “SNP set” of size m. The process starts by analyzing all single SNPs independently. As described in Chapter 3, for any SNP surpassing a user-specified step-1 significance threshold, all SNP- pairs including that SNP are considered at step-2. Similarly all 2-SNP analyses that surpass the step-2 threshold will be considered in SNP sets of size 3, and so on. For each SNP set, the allelic values across the SNP set can be examined as haplotypes (phase is important) or composite genotypes (specific combination of genotypes across loci, phase is unimportant). A SNP set of size m has 2m possible haplotypes and 3m composite genotype combinations (or 2m combinations if a genetic model is imposed on the loci, dominant/recessive). HapConstructor uses a set of heuristics to consider a more limited set of contingency tables and statistics from all those possible. The models and tests considered are user-defined and are detailed below:

** Analysis options

Composite genotype tests: When considering composite genotypes, the phase across the multiple loci is not important. It is simply a combination of genotypes for each locus in the set. A global composite genotype test is one where all possible composite genotypes for the m-SNP set are considered as separate columns in one contingency table. A test for independence could be performed with a chi-squared test or every column could be compared to the first column (representing homozygous wild type composite genotypes) with an odds ratio. Alternatively, specific lower dimension contingency tables can be constructed by restricting to dominant or recessive models at each locus. For example, there are 4 possible composite genotype combinations with a two-locus set: dominant-dominant, dominant-recessive, recessive-dominant and recessive-recessive. For the dominant-dominant combination, individuals with at least one rare allele at both loci are binned together, and all other genotype combinations are grouped together.

Haplotype tests: Diplotype models: A diplotype refers to a pair of haplotypes, in an analogous way that genotype refers to a pair of alleles. A global association test for diplotypes considers all the possible haplotype pairs (each having its own column) in one contingency table. A test of independence can be performed with a chi-squared test or each column can be compared to the first column using an odds ratio. Alterna- tively two-dimensional specific haplotypes tests can be examined using diplotype dominant, recessive, and additive models. For these specific haplotype tests, the haplotype of interest is treated as the “risk haplotype” and all other haplotypes are considered non-risk, thus creating two categories. The risk haplotype can then be tested in a dominant, recessive or additive test (2x2 or 2x3 contingency tables). For example, a diplotype dominant model for haplotypeA would group together individuals with at least one copy of haplotypeA, and compare to a second group of all individuals without haplotypeA.

Monotype models: A monotype test considers single haplotypes (i.e. two ob- servations per individual), in an analogous way that an allele test counts both alleles from a genotype. That is, a monotype test considers the chromosome as the unit of study rather than the individual. The global monotype test considers all haplotypes (one column for each) in one contingency table, and tests independence with a chi-squared test or each monotype can be compared to the first column using an odds ratios. Alternatively, the global table can be reduced to 2x2 contingency tables for specific haplotypes. These specific monotype tests compare a specific haplotype to all other haplotypes grouped together.

  • Construction-wide significance

The hapConstructor process of construction of contingency tables and testing is a data mining procedure and involves multiple testing. The empirical p-values calculated for each statistic do not account for the multiple testing inherent in the data mining procedure. It may be important in the interpretation of findings to consider the multiple testing. A critical value, &alpha, of 0.05 would not control for a 5% family wise error rate (FWER) or probability of making one or more false discoveries from all the hypotheses tested. A Bonferroni correction is a simple and conservative approach to maintain the proper FWER from n tests by using &alpha / n as the critical value to assess statistical significance. The Bonferroni correction is most appropriate for multiple independent tests, and can be overly conservative for correlated tests. The tests conducted in hapConstructor are correlated and should be assessed with a more appropriate multiple testing procedure.

To assess the “construction-wide” significance of the most significant result from the hapConstructor procedure that accounts for the entire construction process, the program generates a null distribution of minimum p-values by performing a matched construction based on null data. The null constructions are easily generated using the stored null simulations. A null simulation is selected and considered to be the “real” data. The construction process is performed on these data and the most significant result identified. This is repeated 1,000 times to produce a distribution of minimum p-values. The real data can then be compared to this distribution to estimate an empirical construction-wide p-value.

While the construction-wide p-value provides multiple testing correct for the most significant result, it may be more valuable to assess whether there is a group of findings that is unexpected under the null. This is the purpose of the false discovery rate (FDR) approach. Rather than controlling the probability for a single false positive out of all tests performed with an FWER, a FDR approach controls the proportion of false positives among a group of findings. In an analogous way to the construction-wide significance, the null construction distribution can be used to estimate an empirical FDR q-value. This is performed by ranking all the null construction p-values in decreasing order of significance. The p-values for the observed data are similarly ranked. For the ith p-value in the observed data (Pi), the q-value for the group of p-values <Pi is estimated by Si/n of null constructions with p-values <Pi at rank i.

  • Gene-gene testing

HapConstructor has been extended with the addition of interaction odds ratios (IOR) and correlation statistics along with modifications to the framework to support analyses of two genes. The IOR and correlation statistics are straightforward to calculate and are well suited for our method. Each statistic naturally allows for pairwise examination of multi-marker sets, such as haplotype-haplotype interactions, when specific haplotypes are defined at each gene. Current methods to perform such an analysis are very limited. Only two approaches have attempted interaction of haplotypes. Zhang et al. (2003) describes interaction of unlinked haplotype blocks associated with disease risk using entropy as a metric for interaction between the two regions. Becker et al. (2005) developed a general method to test a global hypothesis of disease association with any possible combination of interactions between haplotypes of unlinked regions. A limiting factor in these approaches is that the specific haplotypes to test must be known, and usually it is unclear which haplotypes to consider in the interaction test. Our approach directly handles this with the stepwise process to search the interaction space and provides capability to handle datasets that include pedigree structure, which neither previous method allows for.

Similarly to hapConstructor described above, the implementation of this bioinformatics tool made use of core components of Genie in addition to the modifications to these components discussed previously for hapConstructor. Further, extensions were also required to track both genes, for example, when SNPs are added or removed in each gene in the forward and backward steps in the stepwise process. Consideration had to be made to determine which models and statistical tests to perform based on which gene the SNPs selected resided in. For example, if the first three SNPs selected for inclusion reside in the same gene, then only multi-locus single gene tests are performed, if the SNP set includes SNPs in both genes, then interaction effects are tested.

  • Single gene illustration

To illustrate the procedure, consider an example with five SNPs, S1,…,S5. To start, each marker is tested for association. Each of the SNP’s p-values are compared to the critical significance threshold, T1. If S1 and S4 each have p-values <T1, then all paired SNP sets that include S1 and S4 are created and considered at step-2: SS2={{S1,S2},{S1,S3},{S1,S4}, {S1,S5},{S2,S4},{S3,S4},{S4,S5}}. Each SNP set is tested based on the models and statistics defined by the user. For example, if a diplotype additive model is specified, then four analyses will be tested at step-2 based on the four possible haplotypes H1-H4 as risk haplotypes. Each analysis will consist of a 2x3 contingency table where the columns will be defined by diplotypes with 0 copies of Hi, 1 copy of Hi, and 2 copies of Hi (1 < i < 4).

Consider that a diplotype additive model considering haplotype H2 for SNP set {S1,S3} had a p-value <T2. Also consider that a composite genotype combination Dominant-Recessive across SNP set {S2,S4} achieved a p-value <T2.Then, the three locus sets involving these two SNP sets are constructed: SS3={{S1,S2,S4},{S2,S3,S4}, {S2,S4,S5},{S1,S2,S3},{S1,S3,S4}, {S1,S3,S5}}.

The three locus sets containing S1 and S3 will only be considered using haplotype analyses because a haplotype-based test was used to create these SNP sets. Furthermore, the 3-locus specific haplotypes considered must be extensions of the alleles specific to haplotype H2. For example, if H2 was haplotype 1-2, then for SNP set {S1,S3,S4} only haplotypes 1-2-1 and 1-2-2 would be considered specific haplotypes to test. A similar rule is applied for composite genotypes. This specific extension of the specific haplotype or composite genotype model from the prior step significantly reduces the number of possible analyses to create and test at the next step, which can be an important consideration for data sets with increasing numbers of SNPs.

Similar rules are applied at step-n to determine which SNP sets, models and statistics to consider and test at step (n+1). An optional backwards step can be performed at steps n < 3. Using SNP sets in SSn, the backwards step considers (n-1)-locus subsets of sets in SSn that had not been analyzed at step n. From the previous example, if SNP set {S2,S3,S4} had passed the third critical threshold, then the set {S2,S3} would be considered in the backward step since it had not been previously analyzed. Once the analyses for the backwards step have been completed, they are assessed against the appropriate critical threshold value for the (n-1)-step and advanced to the next step. The backward step analyses that pass the critical threshold for step (n-1) are added to SSn.