2019 MGI Workshop- Anita Pandit

2019 MGI Workshop- Anita Pandit


– [Lar] Our next speakers Anita Pandit and she is the Precision
Health Data Analyst in the Center of Statistical Genetics. And she’s going to present
on how to use GWASs on thousands of phenotypes from MGI to determine a better
genome-wide significant threshold for GWASs. – Thanks, Lars for that introduction. Thanks, everyone for sticking around. Alright, let’s jump right in. So in recent years, the
scientific community has seen rapid growth in human biobanks. So very exciting time. All the bio being shown
here combined genomic data with health information, usually derived from
patient medical records, although there are some that utilize self reported information as well. So these biobanks have amassed
sample sizes of over 500,000. And most importantly for this talk, a lot of them are now running analysis on thousands of traits. Including us. So today I’m all of course
be talking about our work on the Michigan Genomics Initiative for which we’ve run GWASs
on almost 1800 treats. So for this analysis, we took all 42,000 European ancestry participants
that were available in for us to have MGI. And we analyzed 23 million variants which were included out of HRC. For the phenotype data, we started with the electronic health records and from Barry parsed out the
ICD nine and ICD 10 codes, which we then collapse
down into 1766 V codes, binary traits representing
various traits and conditions. And from there, we ran our genome-wide association studies arduous. So one thing that I wanna know here is that while a typical
GWASs analysis might look at. One trade or a handful of
trades, here we have almost 1800. And so here’s an example
of the Manhattan plot for type two diabetes and MGI. And here you can see we
have four clear peaks that have reached
genome-wide significance. The two in the middle
there, TCF7L2 and FTO these two are both known as
for the ones on the outside these are not previously
reported in the literature, so potentially novel findings. And as we go through these
1800 Manhattan plots, we wanna have some sort of metric that allows us to place a certain level of confidence in each finding. Because while it’s
possible, certainly possible that the two of these are
in fact, Novel Associations. It’s also very possible
that they’ve popped up simply by random chance. And the reason for that is because we have a multiple testing problem. So as many of you know, the
currently accepted genome-wide significance threshold, the
minimum P value that’s needed to determine whether or
not something’s interesting and a GWASs is five times
10 to the negative eight. However, this has been
used for many years. This is calibrated for 1
million independent snips across one trait. So now in our situation,
we have 23 million snips and 1766 traits. So with more tests being conducted, we expect to see you more false positives. And so as biobanks and imputation
panels continue to grow, it’s time for us to take a
step back and sort of evaluate our current methods to see
whether they’re still relevant appropriate or if they can be improved. And so with so many
snaps and so many traits, why don’t we do a simple
Bonferroni correction. So doing so would give us
the genome-wide significance threshold of 1.2 times 10 the negative 12. However, this seems a little bit strict and we don’t wanna be too stringent because then this we run the
risk of potentially losing out on true novel associations. Another issue is that
Bonferroni assumes independence of traits and genotypes, but we know that MGI is highly correlated. So for instance, we’re
analyzing skin cancer and also the subtypes of skin cancer. So we’ve got the correlation
structure among the phenotypes. And also Bonferroni treats all
traits and variants the same. So you end up with one’s
threshold that you apply across all of your analysis,
which may not be optimal. That’s one of the things that we’re trying to discover in this analysis. And so we decided to
do a little experiment through Permutation Analysis. So while Bonferroni controls
the Family-Wise Error Rate, permutation analysis allows us to obtain an expected false discovery
rate for our data. And because we’re running
it on our actual data set with our trades and our variants, it automatically accounts
for any correlation structure and software artifacts
that we might observe in our analysis. And so in a traditional
permutation analysis setting you will be running
thousands of permutations. That seemed like a lot of work. So we decided to just run one. And this is how we did it. So here we have a data set for MGI, where each row represents an individual. And the first thing that we do is we split the data set
into males and females. We do this because we
want to permute separately within sex to accomplish
sex specific traits. And from there, we simply
shuffle the phenotype vectors. So we take each row of phenotypes and we assign that to a new individual. And we do that throughout
the entire data set. And what we’ve essentially done
is we broken that connection between the genotypes and phenotypes. And so when we go and we run our analysis, then we expect that all
associations that cross the genome-wide
significance threshold line now are expected to be false positives. And so this allows us to obtain the false discovery rate for data. Here we have the results for
our permutation analysis. So on the x axis, we have
minor allele frequency. On the y axis, we have the number of independent false positives observed. And I’d like everyone to
focus on the dark blue bars in the middle there. So we can see that for
the more common snips, we’ve got about 80 false positives in either category that have
popped up just by chance, 240 for the low frequency variants and 584 for the rare variants. And so suddenly across all these, we’ve got about 1000 false positives that have showed up in
our data just by chance across the 1766 phecodes. So that’s a lot. And that’s telling us
that in our original GWASs that we ran on the non-permuted data, we probably expect about 1000
of those to be false as well. So going back to those original GWASs, just to recap, we ran GWASs on 1766 traits 23 million variants. From there, we took out everything
that met that genome-wide significance threshold and five
times 10 the negative eight. We parsed out the independent hits. Here are defining independence
simply as the lowest P-value within a one megabase region. And this is what we found. And so now what we’re gonna do, is we’re gonna take
our permutation results and compare those two these results, to calculate a proportion of these hits that we expect to be false positives. And this is what we find. We find out that the
high frequency variants, we can expect about a fifth
of them to be false positives, about 50% for the common variants, 91% for the low frequency variants and finally, for the rare variants 75%. So these numbers are quite alarming, especially as the
scientific community begins to become more interested in rare variant association analysis. And these results are
telling us two things. First, it’s telling us that
the currently used genome-wide significant threshold of five
times 10 to the negative eight does not appear to be appropriate. When you’re running
biobank level analysis. So when you’re analyzing
thousands of traits. And secondly, it’s telling us that the optimal genome-wide
significance threshold may be different depending on the characteristics of the variants. So in this case, we’ve noticed differences cross minor allele frequency. Here we have a hit map
showing the proportion of associations expected to be true at various P-value
categories on the x axis and minor allele frequency
categories on the y axis. And so I’d like everyone
to look over to the right and we can see that at the more
stringent P-value threshold of five times 10 to negative 12. The proportion of associations expect to be true is stable at 100%. So if you see an association that meets that P-value threshold, you can be pretty sure
that that’s a true finding and that it is, in fact
associated with the disease. That variant is associated with disease. Most interesting is that the rate at which these variants reach 100% is not the same throughout
minor allele frequency bins. So for instance, we can see that for associations reaching a threshold of five times 10 the negative eight, if it’s a common variant, meaning it has a minor allele
frequency of 5% or more, we can be pretty sure. That is, in fact, a true finding
that confidence goes down to about 5054 things with
minor allele frequency between one and 5%, and even lower for anything rarer than that. Meanwhile, if your
association reaches a P-value of five times 10 to the negative 10. As long as it has a minor
low frequency of 1% or more, we can be pretty sure that
that’s a true finding. But again, that confidence goes down for anything greater than that. So going back to that Manhattan
plot for type two diabetes. We have these two heads in the middle which we know are true
associations TCF7L2 and FTO. And the top snips for these two low side, they’re common variants, the P-values well above that genome-wide
significance threshold line, putting both of these variants
in the category based on that he met from before
for which we expect 100% true discovery. And sure enough, these are
known they’ve been replicated report in the literature. Meanwhile, these two on the outside here, we’ve got one on chromosome one, which has a minor allele
frequency of point 1% P-value. That’s just above that
genome-wide significance threshold line, putting it in a
category for which we expect 12% true discovery rate. Meanwhile, the one on chromosome x, even though it’s in a
similar P-value category as the one on chromosome
one, it’s a common variant. So we expect that to have
80% true discovery rate. And so we place a lot
more confidence on the hit on chromosome x as opposed
to the one on chromosome one. And in practice, this might also dictate how much resources we spend, trying to replicate
one hit over the other. And so we wanted to
have some sort of metric going through all of our MGI heads about whether how many of these hits were actually true findings. And so we decided to do
that through replication. So we replicated using UK Biobank because of course UK
Biobank is widely available. They also use EHR derived phecodes which makes for a
straightforward comparison between the two data sets. And we can see here that
until we get to a P-value of 1.6 times 10 to the negative ninth, we don’t see very good
replication before that. And so this is again
driving home the point that the currently used,
genome-wide significance threshold by 10 times the negative eight, does not seem to be appropriate when you’re running
biobank level analysis. So hundreds and thousands of phenotypes, hundreds or thousands of phenotypes. So we wanted to investigate next whether single iteration of permutations was actually sufficient before just saying that it was. And so we actually read four permutations, which are represented
here by the blue lines. The red line represents the average across all four permutations. We have minor allele frequency
category on the x axis and the number of independent false positives observed on the y axis. And the takeaway here is simply that all four of these individual
permutations tracked quite closely with the average
of the four permutations, and there’s not much variation as you go from permutation to permutation. Similarly here we have
across various P-values, the average false discovery rate for all four permutations
combined, as well as the range across the individual permutations. And we can see that, again,
there’s not much variability as you go from permutation, it varies by almost 4%. And this variation goes down as you get to the more stricter P-value thresholds. And so this is telling us
that a single permutation for our purposes, is sufficient
to get you an estimate of the false discovery rate. Next, let’s talk a little
bit about Computation Costs. So for our 1766 traits, we spent about 51,000 CPU hours. This translates to a cost
between 1000 and $9,000. It was on the much lower end for us because we were using a local cluster with brand new hardware’ 9000 on the upper end would be more if you were to use a cloud based platform like Google Cloud or Amazon Web. But the important thing
to note here is that the computation time and cost
for around the permutations is of course gonna be the
same as the computation time and cost for rounds, GWASs. So before looking at 1800 Manhattan plots, we wanted to have some sort of metrics. So for our purposes, we
found that it was useful to do this experiment. And that it was worth the
extra computation time. So in summary ran a GWASs as usual, we also ran a set of
GWASs using permuted data, so the shuffled data. And from there we found their associations and our false discovery rate. And then from there, we
calculated the proportion of hits in our original GWASs that
we expected to be true at given parameter values. In this case, we looked
at minor allele frequency, but you could look at
other characteristics of the variant as well. And we plan to publish
both the GWAS alongside the permutation results, to give readers a little bit of context and confidence to play some huge finding. So in summary GWAS is using biobank data from EHR-derived phenotypes, seem to require a more
stringent genome-wide significance threshold and five
times 10 the negative eight. That supplies when you’re running hundreds or thousands of phenotypes. And the optimal genome-wide
significant structural may be different depending on the characteristics of the variants. Computation analysis can
account for technical issues not accounted for by association tools. So I didn’t get to talk too
much about this in the talk, But, for instance, they
were using a method that was not well calibrated for our data. And we would end up with
some false positives. Because of that our permutation analysis would pick up on that as well. And further, it shows the
expected number of false positives for our specific data set, because it’s tailored to our data. And going forward, we
would recommend reporting these expected false discovery rates along with the GWAS results in the context of biobank level analysIs. Thank you. I’d like to thank everyone, who made this (mumbles)
(students clapping) – Since the talk, I have a question for the distribution for the phecodes. Did you find out the
distribution of phecodes have any effect on the
final expected P-value? – That’s a really great questions. So we didn’t look depending
on the characteristics of the trait, we did look at case count. And we found that case count didn’t really have an effect on the number of false positives that we observed. But that is definitely something
that you could look into. – [Student] It’s like
a positive correlation, like linear regression or just like to order
magnitude in correlation? – Oh, I see thuogh we
didn’t look into that. – [Student] Okay, thank you. – But that’s a great suggestion. – How much would it cost
to do the same analysis and above and you have an idea? Does it scale with size? – so we use sage and so I
know that first steps… So step one of sage would
take considerably longer. I’m not sure how much longer. But as for step two, I think
that it would take around the same amount of time
because it’s just streaming of data, right? And so it’s based on the
number of variants.(mumbles) But step one of sage will
take considerably longer, but I’m not sure by how much. We are running it from
permutation UK Biobank. So have the answer or
not (mumbles) (laupghs) – If you have the resources to ran this one’s for you,(mumbles)
you can do probably twice. – Yeah, that’s why we figure. – Okay. – Thank you. – [Lar] If not, let’s
play the speaker again. – [Anita] Thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *