Abstract

Motivation: Methods for detecting positive option relied on finding evidence of long haplotypes to place candidate regions under selection. However, these methods generally do not identify the length and form of the selected haplotype.

Results: Nosotros present HapFinder, a method which can observe the common longest haplotype under three dissimilar settings from a database, which is relevant in the analysis of positive selection in population genetics and also in medical genetics for finding the likely haplotype form conveying the causal allele at the functional polymorphism.

Availability: A java program, implementing the methods described in HapFinder, together with R scripts and datasets for producing the figures presented in this article are publicly available at http://world wide web.nus-cme.org.sg/sgvp/software/hapfinder.html. The site also hosts an online browser for finding haplotypes from the International HapMap Project and the Singapore Genome Variation Project.

Contact:g0801900@nus.edu.sg; statyy@nus.edu.sg

1 INTRODUCTION

Haplotypes refer to the specific combinations of alleles at different locations on a chromosome. Diploid organisms such equally humans carry two copies of chromosomes, and thus 2 haplotypes are present for each private when because the chromosomal organisation of the alleles at several variant sites, such as single nucleotide polymorphisms (SNPs). At the fundamental level, the haplotype carries well-nigh of the genetic information, particularly for the assessment of allelic correlation in the genome and in situations where the verbal sequence system on the chromosome is important. However, the technical ease of assaying a single base position in the genome means information technology is more common to obtain the aggregate information across the two chromosomes of an private, also known as the genotype. The appearance of affordable big-scale genotyping technologies has permitted the genotypes of up to a one thousand thousand positions across the human genome to be assayed simultaneously, facilitating the studies on the genetic etiology of mutual diseases and circuitous traits across thousands of samples (Donnelly, 2008; McCarthy et al., 2008). When only genotype information is available, resolving the exact arrangements of the alleles on the 2 haplotypes for an individual requires sophisticated statistical machinery in a process known equally haplotype phasing. A number of statistical procedures take been formulated for inferring the haplotype phases from genotype data, such as Stage (Stephens and Scheet, 2005), fastPHASE (Scheet and Stephens, 2006) and Beagle (Browning and Browning, 2007). Public databases similar the International HapMap Projection (Consortium, 2007, 2010), Human Genome Diversity Project ( Jakobsson et al., 2008) and Singapore Genome Variation Project ( Teo et al., 2009) accept generated reference genotypes for a considerable number of populations globally, with statistically inferred haplotypes likewise available for a number of these populations.

Generally, the lengths of common shared ancestral segments of chromosomes in a population are short since recombination acts over fourth dimension to break down long haplotypes. An exception is in genomic regions experiencing positive evolutionary pressure of natural selection ( Sabeti et al., 2002), where greater fitness in survival and procreation results in a higher propensity that offsprings in subsequent generations will increasingly carry the advantageous mutations. This can increase the frequency of an advantageous allele and, due to the hitch-hiking effects of neighbouring alleles and insufficient time for recombination to occur, result in haplotypes that are uncharacteristically long for a given haplotype frequency. A number of sophisticated statistical methods, for example, XP-EHH ( Sabeti et al., 2007) and iHS ( Voight et al., 2006), have relied on finding such genomic signatures of long haplotypes for identifying candidate regions that are experiencing positive selection. However, these methods by and large do not identify the length and form of the selected haplotype.

The presence of linkage disequilibrium (LD) in the human genome implies that there will exist numerous SNPs that are correlated with each other. In large-scale genotype–phenotype clan studies, the discovery of a trait-associated region is often accompanied by a listing of neighbouring SNPs displaying similar degrees of statistical evidence. Due to the nature of LD, about, if not all of the identified alleles carried on these SNPs are likely to be institute on a haplotype that also carries the functional allele at the unknown causal variant. This would be useful for narrowing the probable candidate region where the functional variant may be located, particularly when prove from multiple genetically various populations is available. We can and then localize the candidate region to genomic segments where these population-specific implicated haplotypes are consistent beyond the diverse populations.

Here, we innovate a novel method for finding haplotypes in three scenarios from a haplotype database: (i) identifying the longest haplotype for a user-defined core haplotype frequency that is carrying a specific allele at a focal SNP; (two) around a focal position with a given core haplotype frequency; (iii) matching a specific combination of alleles from a set of user-defined SNPs. Blazon 1 is especially useful when the functional allele at a causal or positively selected SNP is known a priori. Type 2 is relevant when only the approximate genomic region of the functional variant is known, such as the situation where there is preliminary bear witness of positive natural selection in the region from iHS or XP-EHH without explicit cognition of the exact focal SNP and causal allele. Type 3 can exist used to find the haplotype form that carries most, if not all, of the implicated alleles associated with illness onset or increased severity at SNPs that are identified from genome-wide association studies (GWAS).

In Section 2, we will describe in details how the method works in the three settings. Department 3 demonstrates the utility of HapFinder on known positively selected genomic regions in various population groups from the HapMap, and also via simulated case–control studies. Finally, in Section 4, we talk over how the method will exist useful in both population and medical genetics studies.

ii METHODS

All iii applications of HapFinder crave phased haplotype information in the format of a Northward × L matrix which nosotros denote as H, where each row represents a phased haplotype chromosome of an individual and each column represents a unique biallelic SNP. Thus, North = iinorth with n denoting the number of individuals in the dataset, since humans are diploid and each individual possesses ii chromosomes. Let h il denote the (i, fifty) entry of the matrix H, where h il ∈ {0, 1}, representing the two possible alleles for each SNP. Note that we assume in that location is no missing allele information for any haplotype after the phasing procedure. As HapFinder searches for haplotypes conveying specific alleles, it is important to define accurately what the '0' and '1' alleles for each h il map to. In all our example applications, we have causeless the alleles are mapped to the positive strand while following the definitions of the '0' and '1' alleles according to Phase 2 of the HapMap as encoded in the legend files. A schematic overview of the three applications in HapFinder tin be seen in Effigy ane.

Fig. ane.

Schematic overview of the algorithm behind the three applications of HapFinder.

Schematic overview of the algorithm behind the three applications of HapFinder.

Fig. i.

Schematic overview of the algorithm behind the three applications of HapFinder.

Schematic overview of the algorithm behind the three applications of HapFinder.

two.1 Algorithm for Blazon 1

In searching for the longest haplotype that is at a user-specified core haplotype frequency f in the haplotype database and is specifically carrying a particular target allele at the focal SNP, we commencement make up one's mind the disquisitional number of chromosomes c = floor[f × N]. The algorithm first assesses whether the allele frequency of the target allele is at least f, and returns an error message when the target allele frequency is less than f. When the number of chromosomes conveying the target allele is at least c, the SNP on the immediate left of the focal SNP is appended to the haplotype form. This means there are at most two possible haplotypes for these two SNPs that comport the target allele at the focal SNP, if we assume the neighbouring SNP is either monomorphic or biallelic. When the number of chromosomes carrying the more than mutual haplotype is at least c, the next SNP on the left is appended. The algorithm iterates betwixt adding another SNP on the left and checking whether the number of chromosomes carrying the virtually common haplotype is at least c. When the number of chromosomes carrying the almost common haplotype falls below c, the near recently appended SNP is removed from the haplotype, and the SNP to the firsthand correct of the focal SNP is appended. The algorithm at present proceeds to append SNPs on the right until the number of chromosomes carrying the nigh common haplotype is less than c, where the most recently appended SNP on the right is so removed. The longest haplotype spanned is then returned as the output.

2.2 Algorithm for Type 2

The algorithm for Type 2 is similar to that for Type 1, except that a focal position is specified instead of a focal SNP and there is no pre-determined target allele. In this instance, the SNP on the chromosome that is closest to the specified focal position is chosen equally the focal SNP, while either of the two alleles is immune to exist the target allele. The algorithm gain according to the procedure for Blazon 1, effectively running ii separate operations for each of the two target alleles and identifying the longest haplotype form out of the two analyses. In this case, an fault bulletin is produced when the user-specified core haplotype frequency f is larger than the major allele frequency.

2.3 Fuzzy matching in Type 1 and Type 2

Large-scale genotyping inevitably introduces errors in the chosen genotypes that propagate downstream to the haplotype phasing, thus generating phased haplotypes that are more likely to be dissimilar at genomic sites affected past genotyping errors. To allow for such spurious errors in the phased haplotypes, we permit a small proportion of mismatches when counting the number of chromosomes carrying the most common haplotype. Every bit before, the most common haplotype grade is get-go identified, and the similarity score betwixt this haplotype form and each of the N chromosomes is calculated. The similarity score between two haplotypes is calculated as the proportion of SNPs where the alleles are identical across the two haplotypes. When the similarity score between the most common haplotype form and a sample chromosome is greater than the user-specified threshold due south*, the sample chromosome is considered to exist sufficiently similar to the most common haplotype. Thus, instead of counting the number of chromosomes carrying the nearly common haplotype, fuzzy matching with southward* < 1 counts the number of chromosomes that are similar to the most common haplotype.

2.4 Algorithm for Type iii

Type 3 of HapFinder allows multiple focal SNPs with corresponding target alleles to be specified, and the algorithm aims to identify the haplotype forms that acquit most, if non all, the target alleles. The importance of matching the allele at each of the K target SNPs is defined by either the SNP weightings or the genetic distances betwixt the SNPs, or as a composite function of both. When attempting to identify the haplotype that is carrying the high-chance alleles at implicated SNPs, the weightings (q i, q ii, …, q M ) can be the statistical show of phenotypic association (e.chiliad. −log10 P-values or log Bayes factors). This effectively prioritizes the matching of the target alleles on SNPs with strong evidence of phenotypic association, and the SNP possessing the strongest weighting is defined as the central SNP. When SNP weightings are not provided, the genetic distances of the SNPs (d 1, d ii,…, d G ) must then be provided and the focal position is defined as the average of the genetic distances of the commencement and last focal SNP d center. The contribution of each SNP is thus divers as

with the summation notations in the expressions performed over the fix of Thousand SNPs. This allows a match score (between nada and 1 inclusively) to be calculated for each chromosome in the database, with the match score for the i-thursday chromosome defined as the sum of W k over the set of focal SNPs where the chromosome carries the exact target alleles. HapFinder locates the chromosomes with scores above a user-specified threshold: when the threshold is 1, only the haplotype forms which carry all the target alleles at the focal SNPs are identified; when the threshold is less than 1, mismatches with the target alleles for some focal SNPs may exist permissible as long as the lucifer score for the chromosome is greater than the threshold.

2.5 Software availability and output

A Coffee program for HapFinder is freely available for download from http://www.nus-cme.org.sg/sgvp/software/hapfinder.html, along with scripts for producing graphical displays in R. An online web application is besides bachelor for finding haplotypes in the populations in Phase ii of the International HapMap Project (Consortium, 2007) and in the Singapore Genome Variation Projection ( Teo et al., 2009). Each analysis generates four files: (i) a .haps file that contains the identified haplotypes in 0/1 format; (2) a .fable file containing the rs identifiers, coordinates and the 0/i allele maps for the SNPs on the identified haplotypes; (iii) a .sample file that indicates which samples and which haplotype of the sample (suffixed by −1 and −two to the sample ids) do the identified haplotypes represent to; (iv) a .log file containing all the haplotype forms that are identified for that assay (which is not restricted to only the longest haplotypes for Types one and 2). An additional fifth file is besides output for Type three: (5) a .snp file containing the rs identifier, coordinate, the allele that is tagging the identified haplotype and the LD measured in r two between each SNP and the identified haplotype.

3 APPLICATIONS AND RESULTS

To illustrate the utility of HapFinder, nosotros applied the method for the three settings described on publicly bachelor haplotype data from the International HapMap Project (Consortium 2007). All genomic coordinates quoted are on NCBI build 36.

The sickle cell allele (adenine allele) at rs334 on the p15.5 arm of chromosome 11 has been well established to be under balancing selection in the Yoruba population of HapMap (YRI) ( Feng et al., 2004; Hedrick, 2004), conferring upward to x-fold protection confronting malaria ( Ackerman et al., 2005; Hill et al., 1991; Jallow et al., 2009) while providing a recessive Mendelian risk of sickle jail cell anaemia. Due to positive selection, the sickle prison cell allele is expected to reside on an uncharacteristically long haplotype compared with other alleles in the genome at the same frequency. We examination this hypothesis by running HapFinder Type 1 on phased haplotypes of Hapmap Stage 2 YRI data, which included a straight assay of rs334 (Fig. two). Past specifying rs334 equally the focal SNP and performing two Type i analyses on the wild-type allele (thymine, allele T) and the sickle allele (allele A) at a core haplotype frequency of ten%, HapFinder identified a haplotype form carrying the sickle allele that spans most 0.4 Mb, while the haplotype grade carrying the wild-type T allele spans <forty kb (Fig. 2A). By comparison, at that place were only 3 SNPs out of k randomly selected SNPs [each with modest allele frequency (MAF) of 12.5% in YRI] that displayed similar extent of differences between the lengths of the shorter and longer haplotypes (Fig. 2B).

Fig. 2.

(A) An example application of the Type 1 application of HapFinder for finding the longest haplotypes carrying the two alleles at the sickle cell locus (rs334) in YRI, which has a MAF of 12.5%. The alleles on each haplotype have been coloured accordingly for each of the four possible bases (A, green; C, blue; G, yellow; T, red). The vertical black lines on the horizontal grey bar represents where each of the SNPs is located, with the vertical red line indicating the focal SNP. (B) Comparing the lengths of the two haplotypes identified by Type 1 HapFinder for each of the 1000 randomly chosen SNPs with MAFs of 12.5% in YRI, where the red triangle represents rs334. The three random SNPs that also display the largest extent of differences between lengths of haplotypes are shaded in orange. The grey dashed diagonal lines represent the boundaries for the various sizes of the ratio of the haplotype lengths. All examples were run with parameters core haplotype freq, f = 0.10, and similarity score, s* = 0.98.

(A) An case application of the Type 1 application of HapFinder for finding the longest haplotypes carrying the two alleles at the sickle cell locus (rs334) in YRI, which has a MAF of 12.5%. The alleles on each haplotype have been coloured accordingly for each of the four possible bases (A, green; C, blue; G, yellow; T, crimson). The vertical black lines on the horizontal grey bar represents where each of the SNPs is located, with the vertical ruddy line indicating the focal SNP. (B) Comparing the lengths of the two haplotypes identified by Type i HapFinder for each of the yard randomly chosen SNPs with MAFs of 12.5% in YRI, where the red triangle represents rs334. The three random SNPs that as well display the largest extent of differences betwixt lengths of haplotypes are shaded in orange. The grey dashed diagonal lines correspond the boundaries for the diverse sizes of the ratio of the haplotype lengths. All examples were run with parameters core haplotype freq, f = 0.10, and similarity score, s* = 0.98.

Fig. 2.

(A) An example application of the Type 1 application of HapFinder for finding the longest haplotypes carrying the two alleles at the sickle cell locus (rs334) in YRI, which has a MAF of 12.5%. The alleles on each haplotype have been coloured accordingly for each of the four possible bases (A, green; C, blue; G, yellow; T, red). The vertical black lines on the horizontal grey bar represents where each of the SNPs is located, with the vertical red line indicating the focal SNP. (B) Comparing the lengths of the two haplotypes identified by Type 1 HapFinder for each of the 1000 randomly chosen SNPs with MAFs of 12.5% in YRI, where the red triangle represents rs334. The three random SNPs that also display the largest extent of differences between lengths of haplotypes are shaded in orange. The grey dashed diagonal lines represent the boundaries for the various sizes of the ratio of the haplotype lengths. All examples were run with parameters core haplotype freq, f = 0.10, and similarity score, s* = 0.98.

(A) An example application of the Type 1 awarding of HapFinder for finding the longest haplotypes carrying the ii alleles at the sickle cell locus (rs334) in YRI, which has a MAF of 12.5%. The alleles on each haplotype have been coloured accordingly for each of the four possible bases (A, green; C, blue; G, yellow; T, red). The vertical blackness lines on the horizontal grey bar represents where each of the SNPs is located, with the vertical carmine line indicating the focal SNP. (B) Comparing the lengths of the two haplotypes identified by Type one HapFinder for each of the yard randomly chosen SNPs with MAFs of 12.five% in YRI, where the cerise triangle represents rs334. The three random SNPs that too display the largest extent of differences betwixt lengths of haplotypes are shaded in orange. The grey dashed diagonal lines represent the boundaries for the various sizes of the ratio of the haplotype lengths. All examples were run with parameters core haplotype freq, f = 0.ten, and similarity score, s* = 0.98.

The sickle cell locus provides a user-friendly example where the functional polymorphism that is experiencing evolutionary pressure of positive choice is really known. In practice, imperfect SNP coverage in genetic databases like the HapMap means contempo discoveries on candidate regions undergoing positive selection generally do not identify the selected functional polymorphisms. Instead, discoveries using selection metrics like iHS and XP-EHH tend to highlight candidate regions that are displaying putative evidence of choice. In such situations where only the broad region is known, without specific cognition of the focal SNP or the selected allele, nosotros can rely on Type two of HapFinder to identify the likely haplotype where the unknown functional allele sits on. We use the sickle cell instance to show that similar haplotype forms are identified whether the assay is performed at the functional polymorphism with the known selected allele (at position v 204 808 on chromosome 11), or when the analysis is performed at neighbouring focal positions of 5 200 000 and 5 210 000 (Fig. iii). This is directly relevant for extracting the selected haplotype underpinning a candidate signature of positive selection identified by iHS or XP-EHH.

Fig. 3.

Visual representations of the haplotypes identified by the Type 2 application in HapFinder, searching at three separate locations around the sickle cell locus (rs334, represented as the vertical red line in the horizontal bars indicating the SNP location) at a core haplotype frequency of 5%. The alleles on each haplotype have been coloured accordingly for each of the four possible bases (A, green; C, blue; G, yellow; T, red). The outcome from the analysis with the focal position of 5 204 808 corresponds to searching for the longest haplotype with rs334 as the focal SNP. The bottom panel shows that the three identified haplotypes are almost identical regardless of the focal position used in the Hapfinder analysis, while the top panel zooms into a 40 kb window around rs334 and illustrates that all three haplotypes carry the A allele at rs334 that is experiencing positive selection.

Visual representations of the haplotypes identified by the Type 2 awarding in HapFinder, searching at three dissever locations around the sickle cell locus (rs334, represented every bit the vertical ruby line in the horizontal bars indicating the SNP location) at a cadre haplotype frequency of 5%. The alleles on each haplotype have been coloured appropriately for each of the four possible bases (A, light-green; C, blueish; Grand, yellow; T, ruby). The outcome from the analysis with the focal position of 5 204 808 corresponds to searching for the longest haplotype with rs334 as the focal SNP. The bottom panel shows that the three identified haplotypes are almost identical regardless of the focal position used in the Hapfinder assay, while the peak panel zooms into a 40 kb window effectually rs334 and illustrates that all three haplotypes conduct the A allele at rs334 that is experiencing positive selection.

Fig. iii.

Visual representations of the haplotypes identified by the Type 2 application in HapFinder, searching at three separate locations around the sickle cell locus (rs334, represented as the vertical red line in the horizontal bars indicating the SNP location) at a core haplotype frequency of 5%. The alleles on each haplotype have been coloured accordingly for each of the four possible bases (A, green; C, blue; G, yellow; T, red). The outcome from the analysis with the focal position of 5 204 808 corresponds to searching for the longest haplotype with rs334 as the focal SNP. The bottom panel shows that the three identified haplotypes are almost identical regardless of the focal position used in the Hapfinder analysis, while the top panel zooms into a 40 kb window around rs334 and illustrates that all three haplotypes carry the A allele at rs334 that is experiencing positive selection.

Visual representations of the haplotypes identified by the Blazon 2 awarding in HapFinder, searching at three carve up locations around the sickle jail cell locus (rs334, represented as the vertical red line in the horizontal confined indicating the SNP location) at a core haplotype frequency of 5%. The alleles on each haplotype have been coloured accordingly for each of the 4 possible bases (A, green; C, blue; Yard, yellow; T, red). The outcome from the analysis with the focal position of v 204 808 corresponds to searching for the longest haplotype with rs334 as the focal SNP. The lesser panel shows that the three identified haplotypes are nigh identical regardless of the focal position used in the Hapfinder analysis, while the top panel zooms into a 40 kb window around rs334 and illustrates that all three haplotypes carry the A allele at rs334 that is experiencing positive selection.

By performing multiple iterations of the Type ii assay at the aforementioned genomic site but beyond different core haplotype frequencies, it may be possible to identify a range of frequencies that the frequency of the selected allele may be found in. As the specified core haplotype frequency changes, the identified haplotypes can switch between conveying the non-selected allele to the selected allele, resulting in a meaning increase in the length of the identified haplotype (Fig. 4). For case, we performed three divide sets of Blazon two analyses in the HBB cistron region in YRI, with focal positions specified as the position of rs334 (5 204 808 bp), and two neighbouring positions (v 200 000 bp) and (5 210 000 bp), respectively. Nosotros iterate through core haplotype frequencies of 20 to v% in footstep-size reduction by 1%, observing significant increases in the haplotype lengths at core haplotype frequencies of 12, xi and nine% (Fig. 4A, summit panel). By measuring the proportion of discordant sites at the unambiguous overlapping regions between the haplotypes identified at two consecutive frequencies, we observed that the discordance was particularly large at the frequency where there was a substantial increase in the spanned distance (Fig. 4A, bottom panel). A large value for discordance typically indicates that dissimilar haplotypes have been identified between the two consecutive analyses, and this likely reflects the switch to the selected haplotype that is carrying the advantageous allele. We also performed similar analyses at the LARGE gene (Fig. 4B) that has been previously reported to be positively selected for protection against lassa fever in North Fundamental Africa ( Sabeti et al., 2007). The putatively selected allele at the SNP (chr22: rs481698) with potent show of positive selection (via iHS) has a frequency of 33.three% in YRI, and our analyses observed a large haplotype discordance and significant increases in the length of the identified haplotypes when the frequency decreases from 34 to 33%.

Fig. 4.

Two example applications of Type 2 in HapFinder for identifying the longest haplotypes by specifying a focal position, rather than a focal SNP. For each of the two genomic regions, we selected the physical position of the known/putative SNP undergoing positive natural selection (lines and circles in red in top panels) and two other physical positions in the neighbourhood (in grey) as focal positions, and recorded the length of the longest haplotype across a range of core haplotype frequencies. The bottom panel of each region shows the change in haplotype distances (red lines and circles) and the degree of haplotype discordance (grey lines and circles) with a 1% reduction in allele frequency for the known/putative SNP. (A) Looks at three positions in the locality of the HBB gene in YRI, with the frequency of the selected allele (at rs334) to be known at 12.5% in YRI; (B) looks at three positions in the locality of the LARGE gene in YRI, with the frequency of the non-synonymous substitution (rs4481698) to be known at 33.3% in YRI.

Two example applications of Type 2 in HapFinder for identifying the longest haplotypes by specifying a focal position, rather than a focal SNP. For each of the two genomic regions, we selected the physical position of the known/putative SNP undergoing positive natural selection (lines and circles in crimson in pinnacle panels) and ii other physical positions in the neighbourhood (in greyness) equally focal positions, and recorded the length of the longest haplotype across a range of cadre haplotype frequencies. The bottom panel of each region shows the change in haplotype distances (red lines and circles) and the degree of haplotype discordance (grayness lines and circles) with a i% reduction in allele frequency for the known/putative SNP. (A) Looks at three positions in the locality of the HBB gene in YRI, with the frequency of the selected allele (at rs334) to be known at 12.5% in YRI; (B) looks at three positions in the locality of the Big gene in YRI, with the frequency of the non-synonymous substitution (rs4481698) to be known at 33.iii% in YRI.

Fig. 4.

Two example applications of Type 2 in HapFinder for identifying the longest haplotypes by specifying a focal position, rather than a focal SNP. For each of the two genomic regions, we selected the physical position of the known/putative SNP undergoing positive natural selection (lines and circles in red in top panels) and two other physical positions in the neighbourhood (in grey) as focal positions, and recorded the length of the longest haplotype across a range of core haplotype frequencies. The bottom panel of each region shows the change in haplotype distances (red lines and circles) and the degree of haplotype discordance (grey lines and circles) with a 1% reduction in allele frequency for the known/putative SNP. (A) Looks at three positions in the locality of the HBB gene in YRI, with the frequency of the selected allele (at rs334) to be known at 12.5% in YRI; (B) looks at three positions in the locality of the LARGE gene in YRI, with the frequency of the non-synonymous substitution (rs4481698) to be known at 33.3% in YRI.

2 example applications of Blazon 2 in HapFinder for identifying the longest haplotypes by specifying a focal position, rather than a focal SNP. For each of the two genomic regions, nosotros selected the physical position of the known/putative SNP undergoing positive natural selection (lines and circles in ruddy in acme panels) and two other concrete positions in the neighbourhood (in grey) as focal positions, and recorded the length of the longest haplotype across a range of core haplotype frequencies. The bottom panel of each region shows the modify in haplotype distances (red lines and circles) and the degree of haplotype discordance (greyness lines and circles) with a 1% reduction in allele frequency for the known/putative SNP. (A) Looks at three positions in the locality of the HBB gene in YRI, with the frequency of the selected allele (at rs334) to be known at 12.five% in YRI; (B) looks at iii positions in the locality of the Large cistron in YRI, with the frequency of the non-synonymous substitution (rs4481698) to be known at 33.3% in YRI.

To illustrate the application of Type iii in HapFinder, we simulated 2000 cases and 2000 controls in each of the three HapMap panels from Phase 2 using HAPGEN ( Spencer et al., 2009) at an artificial causal SNP (rs2206734) located in the CDKAL1 gene on chromosome 6, previously established to display significant variation in patterns of LD ( Teo et al., 2009). In each of the three simulations, nosotros introduced a multiplicative effect size equivalent to an allelic relative risk of ane.five at the T allele and limited the association analysis to just SNPs present on the Illumina1M array while masking the causal SNP (Fig. v). Genetic markers displaying association P < 10−6 are extracted as input SNPs to HapFinder, along with the corresponding alleles that are associated with higher risks and the respective observed P-values as SNP weightings (Refer to Fig. i). Nosotros observed that different implicated SNPs emerged from each of the iii simulations. In each population panel, we ran Type iii of HapFinder to identify the haplotype forms that are carrying most of the loftier-risk alleles at the associated SNPs, subject to matching scores of at to the lowest degree 98%. While the association analyses discovered different implicated SNPs in the three population panels, the haplotype forms that are identified by HapFinder across all iii populations correctly carried the high-take a chance allele T at the simulated causal SNP (Fig. 5). We also performed the same simulation experiment at 1000 randomly called SNPs that are nowadays in all three HapMap population panels, in order to assess how oft does the haplotype identified by HapFinder carry the risk allele at the faux causal variant. Out of these one thousand simulations, nosotros observed that 84.4% of the haplotype forms identified by HapFinder across all 3 populations carried the causal allele that is associated with a higher disease adventure.

Fig. 5.

Example application of Type three in HapFinder in identifying the haplotype grade that is conveying the implicated risk alleles from the associated SNPs. The vertical dashed line in each of the iii panels on the left represents the position of 'causal' SNP (Chr6:rs2206734), where 2000 cases and 2000 controls are simulated with HAPGEN in the three HapMap populations with a multiplicative effect of RR = 1.five at allele T. Only SNPs on the Illumina 1M array are shown in the region plots, and SNPs with P < 10−6 in each population are extracted as input SNPs for Hapfinder. The haplotype spanning a pre-specified start and end position, and is also carrying the implicated alleles at the input SNPs for each population is shown on the right. The alleles on each haplotype have been coloured accordingly for each of the four possible bases (A, green; C, blue; G, yellow; T, red). As there may be multiple haplotypes that carry the implicated alleles, SNPs with non-unique alleles on these haplotypes are represented as vertical grey lines. While different SNPs are identified in the association analyses for the three populations, all three identified haplotypes correctly carry the high-risk allele (allele T) at the simulated SNP.

Case application of Type iii in HapFinder in identifying the haplotype form that is carrying the implicated gamble alleles from the associated SNPs. The vertical dashed line in each of the three panels on the left represents the position of 'causal' SNP (Chr6:rs2206734), where 2000 cases and 2000 controls are imitation with HAPGEN in the 3 HapMap populations with a multiplicative effect of RR = 1.v at allele T. Only SNPs on the Illumina 1M array are shown in the region plots, and SNPs with P < 10−6 in each population are extracted as input SNPs for Hapfinder. The haplotype spanning a pre-specified start and end position, and is also carrying the implicated alleles at the input SNPs for each population is shown on the right. The alleles on each haplotype have been coloured appropriately for each of the four possible bases (A, dark-green; C, blue; K, yellow; T, red). As there may be multiple haplotypes that bear the implicated alleles, SNPs with not-unique alleles on these haplotypes are represented as vertical grayness lines. While dissimilar SNPs are identified in the association analyses for the 3 populations, all three identified haplotypes correctly carry the loftier-hazard allele (allele T) at the simulated SNP.

Fig. 5.

Example awarding of Blazon 3 in HapFinder in identifying the haplotype grade that is conveying the implicated risk alleles from the associated SNPs. The vertical dashed line in each of the three panels on the left represents the position of 'causal' SNP (Chr6:rs2206734), where 2000 cases and 2000 controls are faux with HAPGEN in the 3 HapMap populations with a multiplicative event of RR = 1.5 at allele T. Only SNPs on the Illumina 1M array are shown in the region plots, and SNPs with P < 10−6 in each population are extracted as input SNPs for Hapfinder. The haplotype spanning a pre-specified start and end position, and is also carrying the implicated alleles at the input SNPs for each population is shown on the right. The alleles on each haplotype have been coloured accordingly for each of the four possible bases (A, green; C, blue; G, yellow; T, red). As there may be multiple haplotypes that carry the implicated alleles, SNPs with non-unique alleles on these haplotypes are represented as vertical grey lines. While different SNPs are identified in the association analyses for the three populations, all three identified haplotypes correctly carry the high-risk allele (allele T) at the simulated SNP.

Example application of Type 3 in HapFinder in identifying the haplotype course that is conveying the implicated risk alleles from the associated SNPs. The vertical dashed line in each of the three panels on the left represents the position of 'causal' SNP (Chr6:rs2206734), where 2000 cases and 2000 controls are fake with HAPGEN in the iii HapMap populations with a multiplicative effect of RR = ane.5 at allele T. Simply SNPs on the Illumina 1M assortment are shown in the region plots, and SNPs with P < 10−six in each population are extracted equally input SNPs for Hapfinder. The haplotype spanning a pre-specified offset and end position, and is also carrying the implicated alleles at the input SNPs for each population is shown on the right. The alleles on each haplotype accept been coloured appropriately for each of the four possible bases (A, light-green; C, blue; G, yellowish; T, red). As in that location may exist multiple haplotypes that comport the implicated alleles, SNPs with non-unique alleles on these haplotypes are represented as vertical greyness lines. While different SNPs are identified in the association analyses for the three populations, all three identified haplotypes correctly carry the high-risk allele (allele T) at the simulated SNP.

4 Word

Nosotros take introduced a strategy for finding haplotypes nether three scenarios that are relevant to the analysis of positive natural pick and GWAS. The classical example of the sickle cell locus was also used to highlight the utility of the method, both in finding the selected haplotype and at estimating the likely frequency of the selected allele. By simulating case–command data around an bogus causal SNP for each of the three HapMap panels, the method is able to place the haplotype forms that deport the detrimental alleles at the associated SNPs that had emerged from the association analysis in each population separately. All three haplotype forms identified from the corresponding HapMap populations correctly carried the functional allele at the simulated causal variant. Further simulations indicate at least fourscore% ability in identifying the haplotype form which the functional allele resides on.

Our method is able to identify the longest haplotype at a specified core frequency effectually a focal position. Theoretically, this can be applied beyond the genome at every available SNP and across a range of core haplotype frequencies. For a particular core frequency, one may look the lengths of the identified haplotypes to be distributed within a specific range, and haplotypes that are uncharacteristically long for a detail core frequency could be the effect of positive natural selection happening inside those genomic regions. This may provide another strategy in searching for candidate signals of positive selection, although there is a need to account for local effects of LD that may potentially bias the assay (work in progress).

The field of medical genetics is at present focusing its attention towards the fine-mapping of the functional polymorphisms that explain the biological mechanisms and underpin the genotype–phenotype association signals emerging from GWAS. Notwithstanding, while LD has benefitted the initial examination for implicated regions in the genome, long stretches of high LD have paradoxically confounded the fine-mapping process by yielding numerous virtually-perfect surrogates of the unknown causal variant. This complicates the procedure of distinguishing the causal variant from neighbouring correlated markers. When GWAS data and high-resolution haplotypes (e.k. those from the 1000 Genomes Project, http://www.1000genomes.org) for multiple populations are available, our method can identify the haplotype forms that are carrying almost of the implicated detrimental alleles in the different populations. Past framing such trans-population analysis within a rigorous statistical framework, it may exist possible to identify the genomic regions that are consistent with the association findings and the population-specific reference haplotype structure. This could subsequently be adult to localize the candidate positions of the functional polymorphism ( Teo et al., 2010) (piece of work in progress).

The popularity of genome-wide studies coupled with fast haplotype phasing software like fastPHASE (Stephens and Scheet, 2005) and Beagle (Browning and Browning, 2007) means information technology is realistically possible to statistically construct the haplotypes from the genotype information of thousands of samples. This is likely to be extremely useful in both population and medical genetics, as second-order information involving the arrangement of alleles on a chromosome can be more informative than first-order information like the allele frequencies from genotypes of individual SNPs, peculiarly in agreement genomic diversity across multiple populations. In fact, several major population genetics studies have relied on the haplotype diversity plots as empirical evidence on the extent of inter-population dissimilarities ( Conrad et al., 2006; Jakobsson et al., 2008). Nosotros thus introduce HapFinder, a novel methodological development specifically designed to find haplotypes within a population setting, which complements statistical tools for detecting positive natural choice and facilitates progress in medical genetics by locating the likely haplotype structure that the functional allele will sit on. This method has been implemented in a Java program which is packaged together with scripts for producing graphical displays in R and is freely available from http://www.nus-cme.org.sg/sgvp/software/hapfinder.html. This URL too contains an interactive awarding for submitting online queries to discover haplotypes from populations in Phase 2 of the HapMap and the Singapore Genome Variation Project.

ACKNOWLEDGEMENTS

We thank two anonymous reviewers for their insightful comments and suggestions, which greatly improved the manuscript and the software.

Funding: NUS Graduate School for Integrative Science and Technology (to R.T.-H.O. and X.L.); the Yong Loo Lin School of Medicine from the National University of Singapore (to X.Due south. and K.-S.C.); the Singapore National Research Foundation (NRF-RF-2010–05 to Y.-Y.T. and W.-T.P.).

Conflict of Interest: none declared.

REFERENCES

, et al.

A comparison of case-control and family-based clan methods: the example of sickle-cell and malaria

,

Ann. Hum. Genet.

,

2005

, vol.

69

 (pg.

559

-

565

)

,  .

Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies past use of localized haplotype clustering

,

Am. J. Hum. Genet.

,

2007

, vol.

81

 (pg.

1084

-

1097

)

, et al.

A worldwide survey of haplotype variation and linkage disequilibrium in the human genome

,

Nat. Genet.

,

2006

, vol.

38

 (pg.

1251

-

1260

)

.

Progress and challenges in genome-wide association studies in humans

,

Nature

,

2008

, vol.

456

 (pg.

728

-

731

)

, et al.

Coupling ecology and evolution: malaria and the S-gene across time scales

,

Math. Biosci.

,

2004

, vol.

189

 (pg.

1

-

19

)

.

Estimation of relative fitnesses from relative risk data and the predicted future of haemoglobin alleles S and C

,

J. Evol. Biol.

,

2004

, vol.

17

 (pg.

221

-

224

)

, et al.

Common Westward African HLA antigens are associated with protection from severe malaria

,

Nature

,

1991

, vol.

352

 (pg.

595

-

600

)

, et al.

Genotype, haplotype and copy-number variation in worldwide human populations

,

Nature

,

2008

, vol.

451

 (pg.

998

-

1003

)

, et al.

Genome-wide and fine-resolution association assay of malaria in Due west Africa

,

Nat. Genet.

,

2009

, vol.

41

 (pg.

657

-

665

)

, et al.

Genome-wide association studies for complex traits: consensus, uncertainty and challenges

,

Nat. Rev. Genet.

,

2008

, vol.

nine

 (pg.

356

-

369

)

, et al.

Detecting recent positive pick in the human genome from haplotype construction

,

Nature

,

2002

, vol.

419

 (pg.

832

-

837

)

, et al.

Genome-wide detection and characterization of positive selection in homo populations

,

Nature

,

2007

, vol.

449

 (pg.

913

-

918

)

,  .

A fast and flexible statistical model for big-scale population genotype data: applications to inferring missing genotypes and haplotypic phase

,

Am. J. Hum. Genet.

,

2006

, vol.

78

 (pg.

629

-

644

)

, et al.

Designing genome-broad association studies: sample size, power, imputation, and the pick of genotyping scrap

,

PLoS Genet.

,

2009

, vol.

v

 pg.

e1000477

 

,  .

Bookkeeping for disuse of linkage disequilibrium in haplotype inference and missing-data imputation

,

Am. J. Hum. Genet.

,

2005

, vol.

76

 (pg.

449

-

462

)

, et al.

Singapore Genome Variation Project: ahaplotype map of three Southeast Asian populations

,

Genome Res.

,

2009

, vol.

xix

 (pg.

2154

-

2162

)

, et al.

Identifying candidate causal variants via trans-population fine-mapping

,

Genet. Epidemiol.

,

2010

, vol.

34

 (pg.

653

-

664

)

Consortium

A 2nd generation human haplotype map of over 3.1 1000000 SNPs

,

Nature

,

2007

, vol.

449

 (pg.

851

-

861

)

Consortium

Integrating common and rare genetic variation in diverse human populations

,

Nature

,

2010

, vol.

467

 (pg.

52

-

58

)

, et al.

A map of recent positive choice in the human genome

,

PLoS Biol.

,

2006

, vol.

4

 pg.

e72

 

Author notes

Acquaintance Editor: Jeffrey Barrett