– [Lar] Our next speakers Anita Pandit and she is the Precision

Health Data Analyst in the Center of Statistical Genetics. And she’s going to present

on how to use GWASs on thousands of phenotypes from MGI to determine a better

genome-wide significant threshold for GWASs. – Thanks, Lars for that introduction. Thanks, everyone for sticking around. Alright, let’s jump right in. So in recent years, the

scientific community has seen rapid growth in human biobanks. So very exciting time. All the bio being shown

here combined genomic data with health information, usually derived from

patient medical records, although there are some that utilize self reported information as well. So these biobanks have amassed

sample sizes of over 500,000. And most importantly for this talk, a lot of them are now running analysis on thousands of traits. Including us. So today I’m all of course

be talking about our work on the Michigan Genomics Initiative for which we’ve run GWASs

on almost 1800 treats. So for this analysis, we took all 42,000 European ancestry participants

that were available in for us to have MGI. And we analyzed 23 million variants which were included out of HRC. For the phenotype data, we started with the electronic health records and from Barry parsed out the

ICD nine and ICD 10 codes, which we then collapse

down into 1766 V codes, binary traits representing

various traits and conditions. And from there, we ran our genome-wide association studies arduous. So one thing that I wanna know here is that while a typical

GWASs analysis might look at. One trade or a handful of

trades, here we have almost 1800. And so here’s an example

of the Manhattan plot for type two diabetes and MGI. And here you can see we

have four clear peaks that have reached

genome-wide significance. The two in the middle

there, TCF7L2 and FTO these two are both known as

for the ones on the outside these are not previously

reported in the literature, so potentially novel findings. And as we go through these

1800 Manhattan plots, we wanna have some sort of metric that allows us to place a certain level of confidence in each finding. Because while it’s

possible, certainly possible that the two of these are

in fact, Novel Associations. It’s also very possible

that they’ve popped up simply by random chance. And the reason for that is because we have a multiple testing problem. So as many of you know, the

currently accepted genome-wide significance threshold, the

minimum P value that’s needed to determine whether or

not something’s interesting and a GWASs is five times

10 to the negative eight. However, this has been

used for many years. This is calibrated for 1

million independent snips across one trait. So now in our situation,

we have 23 million snips and 1766 traits. So with more tests being conducted, we expect to see you more false positives. And so as biobanks and imputation

panels continue to grow, it’s time for us to take a

step back and sort of evaluate our current methods to see

whether they’re still relevant appropriate or if they can be improved. And so with so many

snaps and so many traits, why don’t we do a simple

Bonferroni correction. So doing so would give us

the genome-wide significance threshold of 1.2 times 10 the negative 12. However, this seems a little bit strict and we don’t wanna be too stringent because then this we run the

risk of potentially losing out on true novel associations. Another issue is that

Bonferroni assumes independence of traits and genotypes, but we know that MGI is highly correlated. So for instance, we’re

analyzing skin cancer and also the subtypes of skin cancer. So we’ve got the correlation

structure among the phenotypes. And also Bonferroni treats all

traits and variants the same. So you end up with one’s

threshold that you apply across all of your analysis,

which may not be optimal. That’s one of the things that we’re trying to discover in this analysis. And so we decided to

do a little experiment through Permutation Analysis. So while Bonferroni controls

the Family-Wise Error Rate, permutation analysis allows us to obtain an expected false discovery

rate for our data. And because we’re running

it on our actual data set with our trades and our variants, it automatically accounts

for any correlation structure and software artifacts

that we might observe in our analysis. And so in a traditional

permutation analysis setting you will be running

thousands of permutations. That seemed like a lot of work. So we decided to just run one. And this is how we did it. So here we have a data set for MGI, where each row represents an individual. And the first thing that we do is we split the data set

into males and females. We do this because we

want to permute separately within sex to accomplish

sex specific traits. And from there, we simply

shuffle the phenotype vectors. So we take each row of phenotypes and we assign that to a new individual. And we do that throughout

the entire data set. And what we’ve essentially done

is we broken that connection between the genotypes and phenotypes. And so when we go and we run our analysis, then we expect that all

associations that cross the genome-wide

significance threshold line now are expected to be false positives. And so this allows us to obtain the false discovery rate for data. Here we have the results for

our permutation analysis. So on the x axis, we have

minor allele frequency. On the y axis, we have the number of independent false positives observed. And I’d like everyone to

focus on the dark blue bars in the middle there. So we can see that for

the more common snips, we’ve got about 80 false positives in either category that have

popped up just by chance, 240 for the low frequency variants and 584 for the rare variants. And so suddenly across all these, we’ve got about 1000 false positives that have showed up in

our data just by chance across the 1766 phecodes. So that’s a lot. And that’s telling us

that in our original GWASs that we ran on the non-permuted data, we probably expect about 1000

of those to be false as well. So going back to those original GWASs, just to recap, we ran GWASs on 1766 traits 23 million variants. From there, we took out everything

that met that genome-wide significance threshold and five

times 10 the negative eight. We parsed out the independent hits. Here are defining independence

simply as the lowest P-value within a one megabase region. And this is what we found. And so now what we’re gonna do, is we’re gonna take

our permutation results and compare those two these results, to calculate a proportion of these hits that we expect to be false positives. And this is what we find. We find out that the

high frequency variants, we can expect about a fifth

of them to be false positives, about 50% for the common variants, 91% for the low frequency variants and finally, for the rare variants 75%. So these numbers are quite alarming, especially as the

scientific community begins to become more interested in rare variant association analysis. And these results are

telling us two things. First, it’s telling us that

the currently used genome-wide significant threshold of five

times 10 to the negative eight does not appear to be appropriate. When you’re running

biobank level analysis. So when you’re analyzing

thousands of traits. And secondly, it’s telling us that the optimal genome-wide

significance threshold may be different depending on the characteristics of the variants. So in this case, we’ve noticed differences cross minor allele frequency. Here we have a hit map

showing the proportion of associations expected to be true at various P-value

categories on the x axis and minor allele frequency

categories on the y axis. And so I’d like everyone

to look over to the right and we can see that at the more

stringent P-value threshold of five times 10 to negative 12. The proportion of associations expect to be true is stable at 100%. So if you see an association that meets that P-value threshold, you can be pretty sure

that that’s a true finding and that it is, in fact

associated with the disease. That variant is associated with disease. Most interesting is that the rate at which these variants reach 100% is not the same throughout

minor allele frequency bins. So for instance, we can see that for associations reaching a threshold of five times 10 the negative eight, if it’s a common variant, meaning it has a minor allele

frequency of 5% or more, we can be pretty sure. That is, in fact, a true finding

that confidence goes down to about 5054 things with

minor allele frequency between one and 5%, and even lower for anything rarer than that. Meanwhile, if your

association reaches a P-value of five times 10 to the negative 10. As long as it has a minor

low frequency of 1% or more, we can be pretty sure that

that’s a true finding. But again, that confidence goes down for anything greater than that. So going back to that Manhattan

plot for type two diabetes. We have these two heads in the middle which we know are true

associations TCF7L2 and FTO. And the top snips for these two low side, they’re common variants, the P-values well above that genome-wide

significance threshold line, putting both of these variants

in the category based on that he met from before

for which we expect 100% true discovery. And sure enough, these are

known they’ve been replicated report in the literature. Meanwhile, these two on the outside here, we’ve got one on chromosome one, which has a minor allele

frequency of point 1% P-value. That’s just above that

genome-wide significance threshold line, putting it in a

category for which we expect 12% true discovery rate. Meanwhile, the one on chromosome x, even though it’s in a

similar P-value category as the one on chromosome

one, it’s a common variant. So we expect that to have

80% true discovery rate. And so we place a lot

more confidence on the hit on chromosome x as opposed

to the one on chromosome one. And in practice, this might also dictate how much resources we spend, trying to replicate

one hit over the other. And so we wanted to

have some sort of metric going through all of our MGI heads about whether how many of these hits were actually true findings. And so we decided to do

that through replication. So we replicated using UK Biobank because of course UK

Biobank is widely available. They also use EHR derived phecodes which makes for a

straightforward comparison between the two data sets. And we can see here that

until we get to a P-value of 1.6 times 10 to the negative ninth, we don’t see very good

replication before that. And so this is again

driving home the point that the currently used,

genome-wide significance threshold by 10 times the negative eight, does not seem to be appropriate when you’re running

biobank level analysis. So hundreds and thousands of phenotypes, hundreds or thousands of phenotypes. So we wanted to investigate next whether single iteration of permutations was actually sufficient before just saying that it was. And so we actually read four permutations, which are represented

here by the blue lines. The red line represents the average across all four permutations. We have minor allele frequency

category on the x axis and the number of independent false positives observed on the y axis. And the takeaway here is simply that all four of these individual

permutations tracked quite closely with the average

of the four permutations, and there’s not much variation as you go from permutation to permutation. Similarly here we have

across various P-values, the average false discovery rate for all four permutations

combined, as well as the range across the individual permutations. And we can see that, again,

there’s not much variability as you go from permutation, it varies by almost 4%. And this variation goes down as you get to the more stricter P-value thresholds. And so this is telling us

that a single permutation for our purposes, is sufficient

to get you an estimate of the false discovery rate. Next, let’s talk a little

bit about Computation Costs. So for our 1766 traits, we spent about 51,000 CPU hours. This translates to a cost

between 1000 and $9,000. It was on the much lower end for us because we were using a local cluster with brand new hardware’ 9000 on the upper end would be more if you were to use a cloud based platform like Google Cloud or Amazon Web. But the important thing

to note here is that the computation time and cost

for around the permutations is of course gonna be the

same as the computation time and cost for rounds, GWASs. So before looking at 1800 Manhattan plots, we wanted to have some sort of metrics. So for our purposes, we

found that it was useful to do this experiment. And that it was worth the

extra computation time. So in summary ran a GWASs as usual, we also ran a set of

GWASs using permuted data, so the shuffled data. And from there we found their associations and our false discovery rate. And then from there, we

calculated the proportion of hits in our original GWASs that

we expected to be true at given parameter values. In this case, we looked

at minor allele frequency, but you could look at

other characteristics of the variant as well. And we plan to publish

both the GWAS alongside the permutation results, to give readers a little bit of context and confidence to play some huge finding. So in summary GWAS is using biobank data from EHR-derived phenotypes, seem to require a more

stringent genome-wide significance threshold and five

times 10 the negative eight. That supplies when you’re running hundreds or thousands of phenotypes. And the optimal genome-wide

significant structural may be different depending on the characteristics of the variants. Computation analysis can

account for technical issues not accounted for by association tools. So I didn’t get to talk too

much about this in the talk, But, for instance, they

were using a method that was not well calibrated for our data. And we would end up with

some false positives. Because of that our permutation analysis would pick up on that as well. And further, it shows the

expected number of false positives for our specific data set, because it’s tailored to our data. And going forward, we

would recommend reporting these expected false discovery rates along with the GWAS results in the context of biobank level analysIs. Thank you. I’d like to thank everyone, who made this (mumbles)

(students clapping) – Since the talk, I have a question for the distribution for the phecodes. Did you find out the

distribution of phecodes have any effect on the

final expected P-value? – That’s a really great questions. So we didn’t look depending

on the characteristics of the trait, we did look at case count. And we found that case count didn’t really have an effect on the number of false positives that we observed. But that is definitely something

that you could look into. – [Student] It’s like

a positive correlation, like linear regression or just like to order

magnitude in correlation? – Oh, I see thuogh we

didn’t look into that. – [Student] Okay, thank you. – But that’s a great suggestion. – How much would it cost

to do the same analysis and above and you have an idea? Does it scale with size? – so we use sage and so I

know that first steps… So step one of sage would

take considerably longer. I’m not sure how much longer. But as for step two, I think

that it would take around the same amount of time

because it’s just streaming of data, right? And so it’s based on the

number of variants.(mumbles) But step one of sage will

take considerably longer, but I’m not sure by how much. We are running it from

permutation UK Biobank. So have the answer or

not (mumbles) (laupghs) – If you have the resources to ran this one’s for you,(mumbles)

you can do probably twice. – Yeah, that’s why we figure. – Okay. – Thank you. – [Lar] If not, let’s

play the speaker again. – [Anita] Thank you.