MIT CompBio Lecture 13 - Population Genomics

MIT CompBio Lecture 13 – Population Genomics

all right welcome welcome so today's lecture is the first one of the next module on population genetics disease genomics so they were talking about population genetics and then on Thursday we're gonna talk about disease associations next Tuesday about quantitative trade mapping and then next Thursday on missing heritability and publicity so the goal for this set of modules is to basically start understanding the fundamentals of human variation and how we can then exploit that both understand something about our ancestry as a species about what matters in your genome and about the regions that are associated with different types of disease so for today we're gonna basically start with understanding the basis of genetic variation in the human population what are common polymorphisms what are the different types of polymorphisms how do we detect them how would how do we call them from signals and reads then we're gonna talk about haplotypes and recombination and linkages Librium and phasing basically going beyond individual variants to how these variants are inherited we're gonna talk about human relatedness and ancestry something that's been in the news lately and then we're going to talk about human demographic history so how did we you know get out of Africa and oblate the world and then how do we use population genomics to actually start measuring human selection at multiple scales let's start with identifying and measuring genetic variations as you know between any two people in this room just like any two people in this planet we are 99.9% identical it's it's incredible just how closely related humans are to each other but there's two ways to seeing this we still have three billion nucleotides if we are you know I mean I understand identical that basically means that there's still millions of differences between us so what are these differences these are the differences I mean the vast majority of these differences are benign they don't do anything it you know simply our passengers and then you know they don't have any phenotypic consequences it's largely the vast majority of her genome is actually not functional you know it doesn't matter if you change then the remaining you know differences are in fact the ones that actually explain a lot of our predisposition to diseases you know different phenotypic signatures and then of course there's a lot of remaining phenotypic variation between us not all of it is genetic all of it is environment basically you know I would look very different if I stay in the Sun for hours I eat you know ten times more than I'm currently eating it so there's a lot of sort of phenotypic differences that arise from the environment if I sleep next hour client or lecture client and so forth every night you're gonna start seeing different seeing you know all kinds of ways so anyway so what are the genetically driven you know variants that explain part of that phenotypic difference on one hand we have the most common variant is basically single nucleotide polymorphisms why is this the most common because when a DNA polymerase goes along happily copying DNA from one cell to the next in the germline that basically gave rise to you know modern humans it basically makes a mistake every now and then and it is very common to make a mistake that replaces one nucleotide by another nucleotide just because it basically misread one of those nucleotides and there is about one such single nucleotide polymorphism every thousand bases in the human genome so everybody comfortable with snips I've heard of a lot of them so another type of variation is in bills that's when polymerase instead of just making a mistake in one nucleotide it basically inserts a bunch of nucleotides instead of seeing this particular case you might insert C first and then dat GG that's basically an endo some individuals have you know five extra bases here some individuals don't ok here I'm repeating the seed to basically make it look like an in Dale rather than you know nothing and then that and there's about one such email every 10,000 bases another very common mistake that polymers makes when it copies is when it has a stretch of repeating Amer's very often of length 2 or 3 or 4 it will end up making more copies than it should this is known as slipping basically as it's copying along it's like you know lining itself on the wrong triplets and it's inserting or deleting some of those trim cells and these are known as short tandem repeats and then there's and there's their frequency is roughly the same as that of Windows and then there's structural variants in copy number variants in these are dramatically more impactful these are much larger deletions duplications or inversions which median length is about 5,000 nucleotides and there's one of those every we're really comfortable with these so far so Alice turn a little bit to single income polymorphism so you know here's one such example where basically this happens in the middle of a protein coding region and you have the second nucleotide of the codons eg the coach for glutamic acid basically changes into a teen changing the code into a gtp which is not a synonymous change and that basically results in a valine being inserted instead of glutamic acid oh you know this might have no impact if that region of the protein is not important or in the case of sickle-cell anemia this will actually have a dramatic impact you know typically for basically your blood red blood cells very often have this sickle cell this mutation is actually remarkably common in the human population do you guys know where it is more common than in other places exactly exactly where you find malaria so basically in sort of West Africa and Middle Africa basically have a lot of this and in fact the range of sickle cell anemia is exactly overlapping the range where malaria has been indicating that in fact is seemingly you know add mutation was in fact extremely good in that environment again this example goes to show that you know risk and non risk is always phenotype dependent if you're talking about malaria the protective allele is GTG talking about single cell anemia then the protective allele is actually G AG basically and also at different points in time something that might be very beneficial for example storing a lot of calories from your daily intake might actually become very detrimental because what used to make you survive famines is now making you chubby and diabetes so these are single night nucleotide polymorphisms there's basically a huge focus on snips because they're very easy to measure and they have been the workhorse of genome-wide Association studies very frequently we will assume that every snip has exactly two alleles you know if you have a G a G and a GTG chances are you don't also have a CG CG or a G GG and the reason is because these polymorphisms arose very early in human ancestry and because the genome is so big and because there's actually so few of these mutations it's very uncommon to find multi allelic snips or generally Multi allelic variants at the same locus that makes sense basically most of the time we're gonna be assuming that there's only one of those alternatives so there's only two versions they're very common so they've been cataloged so instead of basically saying oh you know this polymorphism a position you know so-and-so of this chromosome we're gonna be saying you know this is the particular bearing have an RS ID and there's databases that have basically catalogued all these variants and you know more than a hundred million known variants already and you know they're of course range greatly in frequency so we're gonna be talking about common variants at more than 5% frequency low frequency variants between 0.5 and 5% frequency we're gonna be talking about rare variants if they're less than 0.5% frequency and some of those will only be found in one person and those who will do define as either private or de novo berry you can't quite calculate a frequency just because you know it's very small but you can't quite estimated because when you get to these numbers the denominators not very stable and then you know it gets worse than that we're now actually started to study somatic variants so this is not even one person so basically this is a variant that arose during the cell divisions that gave rise to one person now in my brain there are mutations that happened as my cells were dividing give rise to you know the full organism and some of those mutations are in fact predisposing me to disease or you know sometimes leading to different outcomes everybody called me this so now how do we distinguish the two alleles you know how do we refer to the T versus da we talked about risk and non risk you know that's again phenotype dependent so you can always say oh the risk first so in fact there's many ways of distinguishing them one of them is to basically talk about reference allele versus alternately oh why because the reason reference human genome out there there's some dude from Buffalo that basically became the reference human genome is no different than you and me from any of us but if you happen to have the same allele as that random person and you have the reference of you if you happen to have a different Leo than that person have a loud turn delete very often the reference allele will be the common allele just because it's more likely that that person had the common allele but sometimes that reference allele from the human genome reference will actually be the alternate allele so that's one way of distinguishing the two alleles reference versus altering another way a and that's completely unambiguous because there's only one reference human genome and another way to do it is to basically discuss major and minor allele that's not as biased because this reference genome I mean so what you're the same as that reference you know mean that position that doesn't say anything about your phenotype its history whether your associate another way to do it is to talk about major and minor allele basically if you know that allele is that 50% and like I don't know 80% of the population and the other allele is that 20% of the poop in the population we're going to talk about sort of the major allele there's a problem with that which is that in different populations the major allele will differ but basically you know in Africa the major allele actually be the minor allele Europe yeah yeah we have stored a few and that's why having RSI DS is really important because you know if you realize that you know you have to include a few more nucleotides you don't have to change all of the reference nibs you have to just simply update your database of our site the other way to distinguish the two alleles is to talk about the ancestral versus the derived allele basically if you go to the most common ancestor between you know different humans you look at you and basically ask is this it's one of the two versions matching the chimp right this won't always be the case some places the chimp simply has no worth alone other places both of them are actually different from in other cases you know basically we may not have access to that also be heterozygous in chimp as well although this is very unlikely because most human genetic variation arose very very recently therefore these snips will not be shared as I mentioned in the previous slide you can also distinguish the two alleles based on their disease association but that is very disease dependent and very phenotype dependent very environmental okay everybody with me so far right so basically here one example here's this name here's the reference allele here's the minor allele and the the other thing to realize is that common alleles have really not very helpful and doesn't generally small effects so if you look at the 2d scatterplot of allele frequency whether something is you know common rare or private versus effect size whether something will have a strong impact on a phenotype or a weak impact anything to type or sometimes no impact then what you're going to realize is that there's this inverse relationship basically most common variants that have typically been discovered in genome-wide Association studies I've generally very very small effect by contrast variants that have strong effects such as you know two-fold higher risk or threefold higher risk or higher they're actually very very rare but why is that do you want to take a guess that's exactly right yeah on a phrase it in evolutionary terms yeah sorry lower fitness yeah yeah election yeah exactly exactly the basically mutations happen completely by chance along this effect size spectrum those mutations that are quite benign evolution will tolerate them so they can rise to high frequency those mutations that have strong detrimental effect sizes evolution will notice them quite rapidly and these individuals will either die directly or have your children or generally have lower fitness and that basically leads to this inverse relationship where you know those strong effects detrimental variants are simply not allowed to rise the type okay and then very few examples of high effect common variants that influence common disease and one such example is sickle cell anemia and the reason for that is because they were actually selected for and rose at high frequency in a different environment now when you put them in a new environment and on the other side rare variants of small effects are actually simply impossible to recognize so there you know there's probably plenty of those that have you know very small affection or simply rare we just don't know about them just because they they don't rise in you know our detection threshold that's good yeah so the way that you define the effect is by basically looking at individuals that have one version or the other version of a particular own you know and then you basically ask those people that have saved the risk version of the snip how much sicker are they they're you know how how much more frequently they show the disease if they have three times the frequency of the disease yeah so so there's very few variants that influence height by a lot but how do you define the effect on height in a quantitative trait like height you basically measure a bunch of individuals that have you know one version another version you may see you know raw distribution but then their means might be offset by one millimeter and then you can say well that variant has a one millimeter effect in phenotype very difficult to quantify in any one person simply having that variant doesn't necessarily mean that that person will have you know one millimeter or less height but you know the effects are actually that small I mean what's crazy that we can detect them not that they have small effect that we can detect such small thing basically now ask what kind of methods do we use to discover very different ranges and then we're going to talk about both linkage and Association in the next lecture or actually discovering common variants or rare variants and that's very highly dependent on their effect they were focusing mostly on sniffs much more than so beyond sniffs we mentioned tandem repeats and also in Dells so here's one example of a variable tandem repeat in the huntington gene basically have CG CG CG CG you know a bunch of times those individuals have nine or ten or twelve copies of these triplet repeat perfectly fine those who have more than 30 of Huntington disease and that basically leads to an abnormal protein which actually damages neurons leads to brain cell death changes initially in mood and coordination ultimately speaking dementia so again these look you know like they're tolerated but then as soon as you go beyond a certain level you have a huge huge you know the big consequence there's another example of cystic fibrosis basically and it a deletion of three nucleotides an in-frame deletion FDR gene are changing that protein and no longer act as the transmembrane doctors regulator that it normally and it basically leads ultimately to much more secretion of mucus and infection and system support in huge respiratory problems and again these are you know hugely important variant but the vast vast majority of genetic variants in human genome so how do we represent and store these genetic variants so as you know every person has two copies of you know so one you inherited from one of your mom copies the other one you inherited from one of your dad's to go visit every one of us is deployed and we'll pass on only one children each individual carries two hormone homologous copies of each chromosome and therefore we carry two copies of each variants so when we talk about risk and non risk individuals well that's for one of their alleles you have talked about both of you so very often we're gonna be talking about genotype rather than the allele basically you know we're gonna be calling these either maternal or paternal allele and then the variants co-occur in the haplotypes which are then inherited as a unit so a haplotype is basically the series of nucleotides in the same chromosome that you inherited from the same parent or at least that has now ended up in the same side of your two chromosomes looking for example if your maternal upload type is 0 0 1 0 1 1 0 0 1 1 0 0 1 0 and your genotype will be 0 1 2 0 1 2 notice here that I'm not longer using ACGT I'm assuming that every polymorphism as either the 0 version or the 1 version and we're asking here how many copies of the alternate allele did I inherit okay basically there's the reference human genome that means I'm a reference reference alternate reference alternate alternate reference or the maternal haplotype journal haplotype and then here it basically says I'm homozygous reference I'm heterozygous homozygous non-reference I'm almost I guess reference heterozygous homozygous alternates homozygous reference everybody with me so far yeah it's yeah one of the common ways to measure that yeah but it's the most common way okay so basically now we can start talking about genotypes and then of course when you carry out a snip array for a person you don't get haplotypes what you get is genotypes then you don't know whether you inherited one allele from mom or from dad all you know is that you know you've inherited to one or several copies of reference versus the alternate position any questions so far okay so it is experimentally possible but currently it's not very practical it's too expensive to in to directly measure haplotypes over the whole genome and you should also recognize that here I'm calling a paternal and maternal but in fact you know you may have had a you know recommendation event here and it's not necessarily actually it is necessarily but basically it's not exactly one of your mom's chromosomes because mom's chromosome may have received you know grandpa versus grandma have a recombination event here basically this will not necessarily match your maternal one of your maternal haplotypes but it will match you know it will be a consequence of your maternal genotype okay everybody with me here it is cheaper and much more efficient to measure genotypes to basically simply count minor alleles using genotyping arrays you basically have microarrays that basically have you know two versions at every snip one matching the reference one matching the alternate then you hybridize them and you see if you in fact you're a one or both so basically the genotype loses information and you need algorithms to statistically recover that information and that's what both phasing and imputation so basically there's been a lot of effort to systematically catalog through human variation basically sequencing a lot of individuals to discover genetic variants and then cataloging common variants and also upload type blocks much more into that in the next section and then once you have cataloged these common variants you can simply genotype many many more individuals by simply re measuring variants and then you can use that to estimate population specific properties and then maybe refine your genotyping array for the specific population at hand there's been a lot of projects to do that the two most important ones are half map the haplotype mapping project and then it's a thousand genomes project so how do you discover these genetic variants in the first place by sequencing so you basically sequence using initially traditional Sanger sequencing and more recently next generation sequencing sequence a large large large number of reads you map them to the genome and then you basically recognize places where these reads differ for each other and then you can call a variant there so high-throughput sequencing is very commonly used to measure molecular phenotypes such as gene expression communications previously we ignored mismatches and simply matched reads that were very similar but these might actually represents true sequence variants so then you can statistically distinguish true variants from simply errors in the sequencing variant calling and there's a lot of ferrant calling pipelines a lot of methods for that one very commonly used one is the jdk genome analysis toolkit I'll put that color it basically uses heuristics to find mismatches that are not simply explained by noise and then uses an assembly graph identify possible haplotypes and then for each haplotype it basically estimates the probability of obtaining particular read given a haplotype using a probabilistic sequence alignment a model based on a hidden Markov model who states our insertion deletion substitution whose emissions are pairs of aligned nucleotides or apps and then those transitions are equivalent to the insertion deletion or gap penalties from your dynamic programming alignment algorithms that you see and then what you end up with is the probability of observing a particular read even and then you can use you know expectation maximization to estimate these haplotypes from the observed data and you can use Bayes rule basically reverse the directionality of even that I know how to produce reads from other type I can now infer how to produce Auto types what other types from your license have given rise to specific reads and then I can assign genotypes to each sample based on maximum a-posteriori haplotypes basically builds you know absolute type graph and in a specific reads through these couple types and then you know use a pair hmm to basically determine the parade likelihood and then end up with appetites so that's of common variants then for exomes there's been also a lot of work in being able to resequenced exomes and high-capacity this is one way of identifying those rare variants that we talked about very all some have stronger safe the motivation is that the exome has different sequence properties and the rest of the genome its substitution rates are different each GC content is different so the approach there has been to train a logistic regression classifier to predict its matches and classify them as errors or not and based in which of these mismatches are Ruby variant these have been trained on a lot of data and then you know used true positives where the mismatch has been discovered in another independent project two negatives and of the reads and then the features are what is the quality score from the sequencing machine at the position of the mismatch what is the quality score of the flanking basis whether any of the neighboring nucleotides were in fact in the incorrect order what is the distance to the three prime end of the read because sequencing reactions happen from the five prime end of the read to the three prime end of the Reed Pollock at the five prime end is higher and as we get closer to the three prime end the quality decreases methods have been much faster the information modeling and have lower false positive rates so this has been done systematically over the you know thousands of genomes so the thousand genomes project for example sequenced two thousand five hundred old genomes at low depth or X across twenty six different populations spanning the globe in order to basically catalog human genetic variation that if I expect to get three billion letters for the human genome I'm gonna sequence 12 billion letters from that person how am I gonna sickness that if every read is about a hundred bases I've been a sequence forty million reads so or you know 120 million reads so the you know the idea there is that if I wanted to get the sequence of one person at extremely high quality for X would not be sufficient because then in one position I would only have on average four reads in some position so you could basically say that I want super high quality single genomes but the goal of that project was to identify and catalog human variation any one person contains only a small fraction of the total variation so you are better off sequencing you know instead of spending 30x capacity one person to ten people with that capacity and then you will discover many more variants and then if you see the multiple in multiple individuals much more certain that this is a true variant rather than a somatic mutation or you know sequencing systematically sequence biases any are a PCR artifact any other questions all right so then that project he developed not only a reference dataset but it brought together huge teams of statisticians to basically develop sophisticated tools or phasing and imputation figured methods to account for noise or known patterns variation and then once you've catalogued among variation by sequencing most genetic variants in an individual will be recurrent in the population and once they've been discovered in catalog you can just build a common snip array or measuring them systematically and then these of course DNA microarrays were the key technological advance of the 90s they were initially used for measuring gene expression levels but now are primarily used for measuring tonight variations fragments of DNA of DNA that have you know a particular variant will hybridize either to one version or the other version giving rise to a call basically the risk or the non race the common or the reference is a major of the minor at the reference or the alternate in terms of work and this is still the main technology that's being used to being across most u.s. studies and also for direct to consumer services like and then the next goal is to study associations across populations and then sometimes this will require a new array designs that are specific to that population and I take advantage also both okay that's basically just introducing the variability in your genome basically you know snips indels copy number variants structural variants from tandem repeats but we detect them how do we call them from sequencing reads now let's talk about some properties beyond individual variants we're going to talk about how to type blocks for combination so the first thing to develop is basically method for measuring whether two locations are independently inherited okay Mendel you know for all the wonderful things he did major assumption which was an independent assortment he basically had his P's and he was measuring you know different phenotypes on those B's master's phenotypes as to how they were being inherited and then every now and then the data wouldn't quite match he sort of fudged the data basically you know truly show independence work it's not that he had measured things more correctly than he realized in fact the measurements that he had were showing deviation from random assortment that were exactly indicative of Ko inheritance of alleles that were sitting on the same chromosome which of course was you know about a hundred years ahead of you so you can actually quantify that deviation from independent random assortment and therefore the co inheritance patterns of two alleles by calculating a coefficient of linkage disequilibrium between alleles a and B if you have two different alleles in a particular locus and you would like to know if they are Co inherited and where you can basically ask is you know how often do I see you know one combination versus another combination basically you can calculate this coefficient of linkage disequilibrium basically tells you well how often do I see you one one zero zero versus basic our friend do I see the couplet at the haplotype one one zero zero – how often do I see then come to type 1 zero zero 1 this is basically saying if there was any kind of bias I would basically be able to capture it using this formula because his formula is basically saying well if they're independent then the product p1 one should be no more than p1 R times P star one basically at this locus inheriting you know version 1 or version zero at this particular location if it's independent I can simply compute that as the product of these independent and I can compute that as the product of any better probability and I can compute that and compute that and if all of these products of independent probabilities are in fact independent and that will become zero okay raise your hands if you're with me on this one awesome right basically this is just a trick to basically say am i deviating from zero okay and this is actually profitable specific alleles and different alleles may actually have if they are truly independent then this room between Los inv will actually be zero and linkage disequilibrium basically measures the degree of departure from Mendel's law of independent assortment so if you find a non-zero value that basically means that these two alleles are inherited in a you know biased way so basically if you expect these particular haplotypes of this particular frequency and you observe them at that particular frequency and you can basically say AHA there's linkage disequilibrium between with these values the problem is that great if it's zero everybody knows how to interpret it but if it's nonzero and how do I hurt at point O seven versus a point three at a lot or little the way to do that because these numbers actually depend on the allele frequencies basically how common are these in the earth then what we can do is calculate a number D prime which is a normalized how do you normalize it by basically saying what is the maximum linkage disequilibrium that I could possibly expect based on the allele frequencies of these individual steps and if I you know calculate that then you know in this particular case it's just simply you know a v-max is the product of 0 star star 1 base to each of these haplotypes and then I can just normalize D by D max I mean I basically obtain point I 1 and that is directly interminable it basically says that I have 51% of the maximum possible disequilibrium regard which is independent now of their frequencies who's got any questions raise your hands if you with me ah sorry this is one location this is another location this is one snip and this is you know some others nearby basically having zero star being basically means that I've observed zero here but I don't know what the other one is I don't care and that one you know basically he one one means that I inherited the one one haplotype which contains both observe that make sense thank you other questions all right that's one way of measuring it basically by asking you know this very simple metric of deviation from you know zero but that's not really a very intuitive metric an alternative is basically saying let's measure effectively the correlation between these two positions like how correlated are they and you can define this correlation or at the square of that correlation R squared as you know is the square that we saw previously divided by each of the probability of the alleles and that give you 30 several that basically says that the square Pearson correlation of the two snips is actually 37 in practice the Pearson correlation is very effective efficiently computed for all snips and then the Pearson correlation is also very fundamental quantity or modeling to us this course in fact the r-square correlation for individual is exactly the r-square of the corresponding G was association summary statistic that basically means that when in the next lecture we talk about G was and having you know an association summary statistic or one particular snip if I know the R square of nearby snips I can just extrapolate and alkylate that for those just because of the way that so that's another way of measuring independent random circles so here's visualizing our square and visualizing recombination events in a particular region across populations so basically here's one such region here's another region and you can see now this region in different populations will actually have different structures so what am i visualizing here if you look across the horizontal line here this is genomic positions and then the red values that are you know above that line are basically telling you what is the r-square of two pairs of positions at the you know ends of these trying okay raise your hand if you follow this representation awesome any questions so basically means that you know this particular region is very heavily Co inherited that region is very heavily Co inheritance on and so forth and that can actually vary across populations for example here the haplotype block or the block of heritance here's to be larger in individuals and in Chinese and Japanese individuals and here this particular region appears to be much less pronounced in your Rubin individuals and in individuals so what causes these patterns that basically means that this region here is Co inherited in the European population but it stops being Cohen heritage after this particular segment okay so the boundaries between these segments basically tell you about where our recurrence recombination events that are occurring in the human genome across generations and what you notice when you look at these patterns is that these combination events are in fact happening in very specific hotspots and that these hotspots are in fact sometimes different from population the other thing to note is that these red dot here don't represent the physical order or the linkage of snips in a chromosome what they represent is the historical order in which mutations arose so that's why it's not just a continuous measure but there's gaps for example here these regions are less correlated that's because that particular snip arose later and then broke the nice log that you had so there can be snips and variations that arise after a haplotype block has been around for a while and I will actually break the correlation pattern and this is what leads to these sort of non fully fielding triangles cool so what causes these recombination hotspots okay so what causes them is that across generations over and over again recommendations appear to be in the same locations you a little bit of biology this is very closely linked to the very process of meiosis in order to line up the chromosomes that I will then send to the next generation I have to make sure that I send a complete human genome and not just some quickly wrapped up thing that might contain you know multiple anyone for example so what I'm gonna do is line up the chromosomes during meiosis or gamete formation and recombination basically starts with these double-stranded breaks that are then repaired by strand invasion of the homologous chromosome basically I purposefully make breaks in my chromosomes which then lead to repair by homology and then lining up all these chromosomes and then half the time these we will be resolved in a no fashion that actually or there's no crossover that will end up with a blue chromosome and a red chromosome end to end and the other half of the time I'm gonna get a crossover and therefore I'm gonna start you know blue chromosome here and the red chromosome there or a red comes from here and blue chromosome there and what happens if the middle is also a little bit problematic basically what happens in the middle is that I will repair this blue chromosome using the copy from the red chromosome in this particular case and in this particular case in those cases it's a little more complicated we'll end up with some versions of copying blue on red or red under so these can lead to gene conversion where basically now having you know to read copies instead of one blue and one red copy and also recombination leading into red and red leading this is actually thought to be the fundamental selective advantage for sexual reproduction so basically you can repair places in the genome that unfunctional because you always have a functional copy sitting around okay everybody with me so far so now where do these breakpoints happen so basically recombination does not happen uniformly over each chromosome there's recombination hotspots and these are occurring hundred thousand nucleotides and recriminations recurred hundreds of times more frequently in hotspots than elsewhere and then mouse studies have revealed that this protein prdm9 is in fact instrumental endemic aiding these hotspots let's talk a little bit about how prdm9 finds these hotspots it basically has a motif that it recognizes and it has a very long protein domain with a finger array that recognizes this and then has you know all the right machinery for actually recruiting the double-stranded tragic love story so basically prdm9 basically knows and loves a particular motif it likes to bite but the problem is that every time it finds that what even it is destined by some horrendous Greek god I guess cut that motif motif is then repaired by the other chromosome remember the blue was cut and he was repaired by the red one if blue has a motif that purity of nine knows and loves it will be repaired by the red one and the motif will then so clear game 9 will basically start cutting the genome every time it finds the motif and then that motif will disappear every time it catches it and it will be repaired by homology someone who doesn't contain the motif as well so that basically makes pyridine of the fastest evolving protein in the genome because if you start losing all your motifs recombination doesn't work as well the lining of the chromosome doesn't work as well your double-stranded breaks doesn't don't work well and therefore you end up not being very able to reproduce which basically means that as the motif is getting lost here nay m9 will now star recognizing a different motif it doesn't matter which one it just your a lot in the genome but as soon as it's now falls in love with a different motif that motif will start disappear it's a tragic love story you're doing everybody with me on this one so now we can start talking about you know first of all how are mutations recurring our recombination is occurring and horror mutations passed on a long day's haplotype block so let's now look at you know once at region of the genome this is a paper about mark daily at all who basically said hey we're gonna go and study this particular region which is involved in chromatid we're gonna basically see what are all of the potential haplotypes associated with and when Mark and his colleagues did that they basically realized that even though they had only other types 258 individuals finding 258 possible applicants they found only a handful in any one region they basically realize that the entire diversity pattern of that very very long region spanning you know many of these you know almost a Megabass many of these jeans all of the genetic variants in that region could be explained by simply having you know a few apatite blocks which are and weaved together in some kind of ready so that basically implies a very high level of genotype sharing even for unrelated individuals that means that you know a human population is still a small little you know sisterhood and Brotherhood rather than you know these massive it was billion people on the planet we are still just a small gang of 10,000 people that basically left Africa and these haplotype blocks are in fact carrying the same you know polymorphisms in just a small number of versions and the only way that these polymorphisms are broken are through these recombination hotspots and the arrangements relation events and through the arising of new mutations that then come in the context of ancient blocks you basically have these ancient blocks they're still being passed on some of them are broken broken up progressively over time and then within them new mutations arise so you can then trace the history of the old mutations to the new mutations coexisting in the same space so here's one example of how you can actually understand this region you basically have you know a phylogeny of more like a demography of individuals you know perhaps coming out of Africa and then populating Asia and Europe and maybe here staying in Africa and then going then you have mutations that arise over time and then these mutations are basically occurring in the haplotype blocks that are associated with the different populations then you have these ancient mutations here in orange and then you have younger mutations in different colors that are happening in the context in so far raise your hands great so basically the hapmap project realizing this very unique structure then set out to systematically catalog all haplotype logs in the human genome and this basically led to this fundamental knowledge that ultimately enabled you know more Association studies it isn't a systematically cataloguing millions of snips and then studying multiple subpopulations inferring haplotypes based on their coin heritance patterns and then genotyping additional individuals further so a major so who feels that I've learned something today good so basically this is very fundamental knowledge and what's really remarkable is that we saw knowledge has only been around for the last few years anyone who studied you know genomics or genetics a decade ago didn't even know about these you know pattern of public eye blocks and there you know crossing overs and even the existence of prdm9 and these double-stranded breaks and all that I mean the folks who established principles of genetics and genetic inheritance know like seminal work by Fischer in 1918 that preceded even knowing what the structure of the DNA was and what the genetic material was that actually carried out these genes which were simply purely theoretical so the fact that I can actually teach a class in 2018 and tell you the very molecular basis that leads to all of these subtleties about human inheritance is you know a very privileged moment in time when were you actually have a lot of these building blocks that did not exist even when these classes started being taught so anyway I mean it's funny to have to just adjust the lectures here by year based on our completely renewed understanding but anyway it's kind of cool to sort of now understand all these very fundamental processes challenges that arise with all of that how do we raise up low types we you know we now understand a lot of what this is about but there's still the challenge of hey edges genotype or another individual and then I I find 0 1 to 0 1 to 0 well what's on one chromatin what's in another chromosome first of all why does it matter it matters because of compound you know that means that if I have a good coffee in a bad copy of a gene I'm probably fine because one working copy is usually okay if my wife has a good coffee in a bad copy of the same gene he's also fine because one hop is usually okay but when we have children our children we have one-fourth probability of inheriting that which is ah 60% probability to be hairy only one good and one bad coffee from you know alternating between each of us which is also fine and uh 25% chance of inheriting too bad copy now if my wife and I were first cousins or something like that that would likely be the same bad copy and then you know we would be able to see it in a genotype would basically say okay that person has to that copy but if we're not first cousins or if we're not sort of sharing that same bad copy that basically means that our children will inherit a different good cop in Bangkok or two different coffees and looking at the genotype of our child you might be able to say oh that child has one version of the gene that has two really bad mutations sorry or that child has actually two different add versions of that gene basically if say this here is you know if this one includes a bad version and this one includes another bad version I would like to know if these two bad versions are sitting on the same chromosome or on different chromosomes as if the two bad versions are actually sitting on the same chromosome that's fine because that child still has a working copy but if the two versions are sitting on different chromosomes that's not fine because that means that both copies who's with me on this one raise your hands awesome break so that's why phasing is so incredibly important because then it allows hey you know is this non-coding variant for example that causes this expression of this gene inherited in the same haplotype then that other variant that you know in the coding region for example that causes that into function correct then the goal here is to resolve the genotypes into the underlying haplotypes basically figure out you know what did one versus the other and that is a problem that actually requires auxiliary information and namely parent genotypes so this is the most typical approach for phasing a particular set of parents which is gosh I lost my point yeah sorry guys in fact basically tree or phasing is how you basically typically face genotype the parents you don't have the type the parents because that's hard genotype the parent and then you genotype the child and what you'd like to know is that looks like so homozygous sites can be trivially phased so if I'm homozygous 0 almost I guess to you know that's kind of trivial right so if I am homozygous ear here your inherited the zero from both parents and then if I'm homozygous too then I must have inherited one it is the heterozygous sites that are difficult right based if I inherited two copies of this particular or if this alternately of this particular snip and that means that both parents gave me their alternately and for a heterozygous sites that's where it becomes so then the you know if at least one parent is homozygous then you know there's simply no ambiguity left so basically if in this particular location dad had no copies of the alternate allele and mum had one it's clear that I got mom's car okay and hopefully the slide ah it's my left if both parents are heterozygous well that's when we actually need some of these information and that's where linkage disequilibrium comes so in this particular location dad had a one mom had a one that basically means that dad is zero wanting the other chromosome mommy's hearing one comes from wanting the other chromosome it's unclear who gave me the one that I have raise your hands if you're with me yeah so then we're gonna basically use LD information linkage to clear information to basically look nearby you say well what is more likely you know have reason so this is for phasing relating to visuals basically here you know by looking at this is your zero one zero etc I can basically figure out what did mom give me mom have that location and which capsule type is mom more likely to have and is this particular haplotype that I got from mom more likely to be a common haplotype with a one here or a zero basically knowing you know the reference panel of haplotype blocks basically allows me to then infer what did I get from mom and we're basically at this point I'm almost at the level of her which basically then allows me to also go and have no types of parents but basically raising the child I then have additional information but the parent allows me to infer which most of it simply by you know these very simple rules and then the remainder I can just fill in by matching the resulting almost complete haplotypes with the reference haplotype blocks from thousand genomes okay everybody with me all right so that's for phasing related individuals so you know most of the work it can do just by reasoning about for phasing unrelated individuals basically I need to probabilistic lien for most of the information basically modern analysis very often consider collections of unrelated individuals and we don't have pedigree information we can only use the patterns of linkage disequilibrium so the input for these problems is based up the types in a reference panel and then observed genotypes in a population so the observe the unobserved haplotypes underlying the observed genotyped can be traced back to a common ancestor with the reference panel and we're going to talk about answers for combination graphs and then we can directly fit these answers for combination graphs for you know smaller samples but not for large samples we have to approximate I mean the ideas we're going to be generating each of the unobserved haplotypes by copying segments from the reference haplotypes such that the resulting genotypes match the observations basically I have an underlying hidden set up the types like a sample from this is a finite set as we saw in the previous light there's only a small number of ancestral haplotypes that in one location and then as I'm going through explaining my genotypes away based on these appetites I can basically say what is the transition probability ie you know recombination great point between one half of type in the other haplotype nation event that's the transitions between these ancestral States and then the probability of the ancestral States themselves is simply you know the hidden state that I made so I can basically use a hidden Markov model in first the most likely ancient have the type that I'm in in every position and then it first the locations where I have to switch between ancestral haplotype block other with me on this raise your hands awesome so that's basically for phasing unrelated individuals and then I can impute you know types basically on one hand I can face the information and then once I have phased it I now have you know this and this and this and that known already on my haplotype but then I can simply complete the rest of the information in trivial fashion because I know which haplotype I used orink or phasing but then I already know what have the type I have at every one location but then I can fill in the missing information based on the hard part is inferring the haplotype that I'm in once I have the haplotype imputation is actually quite easy okay so there's a huge advantage of recognizing this haplotype structure was that we could then use only markers net for every table type and then once I had that marker I could then infer the rest and by having you know two or three genotypes names within a particular haplotype block I can then infer which haplotype which ancestral have to type I have then Bill then all of the other snips that I don't even need to go and observe imputation does service keen intuition is that the same haplotype copying model or phasing applies and then you can phase alleles for two new types names and then also copy the alleles for the unobserved Aryans korver genotype specifically summing up inferred haplotypes I'm basically going from up from genotype to phenotype I'm going do that through the intermediate of the specific great so basically we talked about genetic variation at the level of individual snips and then we talked about genetic variation the level of haplotype blocks based to this recognition of the inheritance patterns and these blocks are very interred very closely intertwined with recombination events that happen during meiosis hotspots of meiosis ajik love story of the prdm9 protein basically initiates this double-stranded breaks and then in between losses and then measures of linkage disequilibrium definitions of haplotype blocks phasing everybody with me so far great so now let's talk about what we can do with this arsenal we can basically start studying identity by descent in related individuals you know as you know every one of your parents had two chromosomes one from their mom one from their dad and then passed on one of them to you and so did mom and so did dad so basically parents share 50% of their DNA with their children and then siblings share 50% in name with each other okay these are very very different 50% so every child with their dad here's exactly 50% of every location every child with each other shares either zero percent or 50 percent or 25 percent of or a hundred century so how basically this is a comparison between parents and each child this particular is me and my son when my son came out belong people say oh where did he get his hair color and I would reply overall he's dad and yep teasing my life until she sent me this picture basically with no comments simply saying you're Jonathan Christopher not so anyway needless to say that a hundred percent of Jonathan's Gina was basically a few percent identify the piece is actual comparison between my genome and my brother's genome and what you can see is that in about 50 percent of k of the genome we share exactly 50% of chromosomes that's the places where you know one of us inherited one copy the Biscay where we both inherited the same coffee from either mom or dad but a different coffee in some places we both inherited the same coffee both from mom and from dad and in other places we just inherited different copies that got moms or I got basically mom's mom and he got mom's dad and so forth okay so they both have urge outs to about 50% but in fact these 50 percents are very very different you know that's basically you know based on identical by descent then and and for relating the video but then as you start going further in time can basically start asking well how many variants are shared by any two random people know necessarily members of the same family and what you can basically do is start asking well how many genetic variants are there in you know every population thousand genomes project and what he find is that as you vary from population to population basically have dramatic differences in the number of variable sites so you know African populations here have you know on average 5 million of arable sites whereas European populations have an average only 4.1 million are both sides why is that because African populations have basically stayed put and they have accumulated genetic variation over the entire history of the human population whereas known African populations basically went through a bottleneck very small number of individuals left Africa capturing only a very small fraction of the total population of Africa and then these individuals of course continued accumulating but then they also went through bottlenecks and a diversity was further reduced so every one of us cares about you know four to five million positions are different and two to three thousand structural variants when you make a basis of the table and then hundreds of protein truncating variants being deployed basically means that we can tolerate hundreds of broken proteins in any one of us in hint don't marry your first cousins because you know chances are some of those will actually appear and then we have tens of thousands of nonsynonymous mutations so African invasions have much more variation again I mention the other thing to note is the estimated population size at different point in time you can see here the very recent you know expansion of population size and then you know how as you go back through time you basically have you know bottlenecks and so forth basically here is present is here past is there and you can see here the expansion of the population usually this is in Africa and then getting you know through some kind of population bottleneck which is shared by all and then different populations expanding in so if you can actually start recognizing segments that are shared with different populations and you can actually start painting the chromosomes of any one person based on the population that it most closely resembles you can see here an individual that is 80 percent sub-saharan African but 18 percent European and about two percent East Asian and Native American and you can actually start doing that using these Lincoln diversity allocation processes but basically were initially proposed for document classification whereby every document topics some governments are associated with multiple topics depending on which queries they contain and then the individuals or document variants or words published topics are assigning the words that you find most frequently in specific topics so if your topic is Native American your the words they're associated with that particular topic in orange here and then if your topic is Europeans your the words in blue they're associated with that and then you can classify every region as to one of topics depending where you are in the genome and you know you can infer that basically using pacifiers so once you know if you do that on the thousand genomes that were sequence from the thousand genomes project you can basically classify regions as being from different places like you have the brown regions and sorry the black or the brown applets in you know yellow helper types and so forth from each of the different continents and what you realize is that you know for example Japanese individuals are very clearly you know East Asian but then Chinese individuals are already up next so they are you know about half and half between East Asian and Central Asian and so forth so basically would you what you see is that almost every person in each of these conical pure populations was an admixture and even for Native American you can see that you know there simply no individuals or a hundred percent Native American you could see clearly the Native American chromosomes but they were always mixed in with you know European ancestry in so you can actually recognize these very broad patterns variation by doing what we learned lectures ago which was principal components analysis you can basically do a singular value decomposition of your data basically recognize the major patterns of variation and when you do that in a collection of European samples and you look at the first principal component versus the second principal component you end up with something dramatic which is that the alignment of these samples on these two axis of genetic variation is in fact very closely matching a map of Europe where the colors of the samples correspond to the colors the corresponding countries basically somehow you know genetics mirrors geography and that actually makes a lot of sense and the axis is slightly tilted which actually corresponds to the east-west and north-south migration patterns in the ancient colonization of the European continent quite remarkable which basically means that even to this date we can recognize these patterns of ancient settlements going through Europe and actually you know they make a big part of our genome the other thing to recognize is that you know typically we differ greatly from different regions of Europe to other regions of Europe and part of that is actually environmental you know I don't know it rains more here than it rains there for exam and you know you have more snow here than you have there and so forth and that could actually impact you know tipic differences in say depression or you know I don't know tolerance of Sun and not all of that is genetic some of that is genetic but not all of it so in order to account or socioeconomic factors and you know climate factors and other factors that can impact your genome-wide Association studies correcting for these principal components of variation is usually the first step when you do a genome-wide Association studies are basically recognizing the major forces that are driving these genotypes back not functional but in fact you know they're demographic and you can actually start measuring differences and divergence between populations and you can actually start studying you know the patterns of change that different publishers went through the path that we took in migrating Out of Africa boom you know these straight and found to marry and South America recent migrations into different continents and subcontinent's presented mixtures and you can even start studying ancient DNA basically recognized migrations of Tunisia and the undertones you can actually start rewriting human history based on other thing too Rico to recognize is that embedded within our genome are signals of ancient selection by recognizing the diversity in different parts of the genome you can actually recognize evidence for selective events and impacted huge fractions of the human population example you can look at the proportion of functional changes or the frequency of rare alleles basically recognize selective events that happen very ancient you can recognize the frequency of derived alleles in different populations to basically a more recent event you can recognize the difference in allele frequencies across different sub populations basically studying population differentiation which is actually evidence of adaptation of one population or both the different environments you can also study the length of these haplotype blocks basically infer a very recent selection the idea is the following that up low type blocks were initially very very large and as you go through human history these blocks are progressively broken up by recombination events if you find a very large haplotype block that basically means that it's a very recent and you design short hop the dialogues means that they're more ancient but now new mutations within a haplotype block arise as I mentioned by chance within the history of that law if you find a very large block with few mutations that's normal because that's a recent login from Lady few mutations but if you find a very large block with sparse with with many mutations that basically means that some kind of selective pressure has been maintaining this block to be large even though it has you know clearly been around for a while based on the number so you can look at the discrepancy between the length of a haplotype block and the number of mutations in that block basically recognize signs of other selection so here's one example of the lactase gene which basically is sitting in a very very long haplotype even though this is you know it's this haplotype rose to very high frequency even though it's you know still very very long and there what that basically suggests is that even though it's a young haplotype based on its length its frequency suggest that it was audibly selected not a number of mutations but the frequency of the haplotype you would expect basically long haplotypes and relatively rare in this particular case you know what you can see from the relationship between the age and the length of the hubba type is that there was in fact a positive selection event and that was basically with agriculture and domestication of animals you basically have a milk production providing a separate source of food and therefore a mutation that actually cause the lactase gene to be expressed into adulthood and therefore not be lactose and taller intolerance which was not normal thing you basically have that mutation now carry through the human population so there has been now hundreds of these regions of recent selection that have been detected and they have all kinds of very very interesting functions us that's where I'll stop today to basically talk about you know the basis of population genomics an end variation where it comes from haplotypes and how they are evolving across history and then human relatedness necessary painting demographic history and lastly using different measures of population genetics to infer selection at different time scales so we'll talk again on Thursday to talk about G wasps

Leave a Reply

Your email address will not be published. Required fields are marked *