How many folds is too many? Efficient simulation for everyday data decisions - Julia Silge
Transcript generated with
OpenAI Whisper
large-v2
.
So my academic background is physics and astronomy.
I moved into data science about eight or nine years ago.
I've worked as a data science practitioner.
And then as of about three years ago,
I transitioned to working as a,
like a data science tool builder.
So I work now at Posit, formerly RStudio,
where I work on open source software
for modeling machine learning,
and now MLOps for Python and R.
And one of the things that I think is a theme
through my career, like a kind of connecting idea,
is that I'm very interested in people's practical workflows
and how people do their real work, what makes it hard.
I'm thinking about systems,
like how do people use systems to get their work done?
And that's been, you know, whether I've been working
as a data science practitioner in an organization,
working on, you know, text analysis tools,
or more recently as I've been focusing
on machine learning tools.
So given that, when I saw this tweet,
which has been in a bunch of talks already,
well, when this tweet from Vicky had this phrase in it,
how many K-folds is too many?
I saw this and my immediate reaction was,
oh, but I actually wanna give a talk about that.
Like, I actually want to talk to people about that,
because one of the tools that I think
we don't hear enough about,
that can be super helpful in many situations,
is the tool to answer this kind of question.
And it's in my title, it's not a spoiler,
to say like that tool is simulations, building simulations.
I feel like in data work, you know,
we don't see blog posts about it,
you don't hear about it as a tool to use.
And honestly, what is more normy,
what is more norm-conf than just like doing the same thing
thousands of times, and you know,
like I just have to do the same thing over and over
to figure something out.
So I specifically wanna talk about
how simulation is a powerful tool.
I'll like, why is it powerful?
It's powerful because it helps us
make our assumptions concrete.
Assumptions I have in my brain,
assumptions that my collaborators have in their brains,
how can we make it concrete?
How can we get on the same page
with people that we're working with about trade-offs?
You know, some of us may be thinking,
oh, it will be better if we do it like this,
it will be better if we do it like this.
Often there are trade-offs between those decisions,
and simulations help us understand what they are
and get on the same page with each other.
And ultimately simulation helps us like,
make better decisions by,
be through these kinds of various things.
So let, so I have,
one of the big projects I have been working on
in recent years is called Tidy Models.
So Tidy Models is a framework in R
for modeling and machine learning
using tidy verse principles.
I'm gonna show you some,
there is gonna be like code on the screen here
and it's gonna be mostly tidy models code.
So I'm excited for you get to see it,
but really this is not a talk about tidy models
or because what we're talking about more
is how to use simulation.
And I think the, hopefully the way we talk
through some of these ideas,
even if you use a different,
usually use a different framework for machine learning
or your data work or like different,
whatever language or framework you like to use,
hopefully this will be applicable here.
So let's jump in, let's jump into this question.
This first question, how many folds is too many?
This is a question about predictive modeling,
supervised machine learning,
kind of the sort of like classical,
I have inputs, I wanna predict outputs.
And the purpose of folds,
of making cross-validation folds
is to estimate the performance of a model,
to be able to say how well a model
or a model configuration or hyper parameter configuration,
how well is it performing?
So to do this, we need some data.
How are you gonna get this data?
I'm gonna talk through this talk
about a couple of different ways that you can approach this,
but one way is to use an existing function
that is meant for simulations.
These are out there, go looking for them in the frameworks
and languages that you're comfortable with.
Here's a function that simulates data
for our regression model.
It's got, down here we can see it's got 20 predictors,
one outcome, like the idea is we predict the outcome
from these predictors,
just like with whatever kind of regression we wanna use.
We can look up in the docs,
how the outcomes are related to the predictors.
And like, this is one way you can get started
kind of in a straightforward way with simulations,
especially the simulation is kind of about a general
kind of modeling problem or machine learning
kind of practice you wanna get to,
look for one of these functions.
So once we have that, we can,
since we're interested in how many folds,
we can create folds.
So as a very brief memory jog or reminder,
V-fold cross-validation or also called K-fold cross-validation
is where you need to take your data.
Remember, we had a thousand because I said,
give me a thousand, we have a thousand observations.
Let's divide this into K-folds.
The default in tidy models is to use 10 folds.
So let's divide it into 10 folds.
And then let's create a whole set of resamples
where we hold out one of those folds,
all the rest of them go together.
We train on the nine folds and estimate performance
from the 10 that's held out or the 100.
That's the 100 observations, the one fold that's held out.
Then we slide it down.
Now this one's held out.
We train on these nine folds,
estimate performance using this one held out fold.
And we slide on down here.
So that was a little reminder of what V-fold
or K-fold cross-validation is.
So how the default is,
how the default is 10.
You'll see 10 as a default a lot.
How do I know that's the right number?
So let's go through some steps.
So how we can find that.
So I've made folds.
What I'm gonna do now is I'm gonna fit a model
to these folds.
So I am going to fit a basic random forest
using all those predictors.
I'm gonna fit it to the fold
and then I'm gonna extract out some metrics.
Let's just use some default metrics
that are good for regression, RMSE and R-squared here.
So this is me fitting one model
to each of the folds 10 times
and getting results out
that tell me the performance of this model
with the default of 10.
So, but I gotta do this a bunch of times.
I'm gonna use simulations to do this.
So almost always when I'm writing simulations,
I start with like the small problem and then build up.
So let's start building up and write a function
that will let me do this a whole bunch of times.
So what this function does is first
it generates a new simulated data set,
divides that new simulated data set
into cross-validation fold,
fits this basic random forest model to it
and then collects the metrics.
And so here, what you're seeing is that I have said,
okay, give me a thousand rows of data.
What do I get if I have V equals three?
So before I showed you V equals 10,
here's V equals three.
So what I wanna do now is I wanna now scale this up
and I want to be able to test
lots of different kinds of values.
So here is where I'm gonna do that.
I'm gonna set up different values of V.
So I'm doing this from four to 24
and I'm stepping forward by two.
And I'm gonna do my simulation 100 times
at each one of these.
So at the four level,
I'm doing four fold cross-validation 100 times.
At the 24th and I'll do 24 fold cross-validation
and I'll do that 100 times.
And then I'll take these values,
I can map across them, applying my function
and then get out the metrics that I have.
So here is the part,
this is the first thing that I'm putting on the screen here
that actually ran long enough
that I might wanna like get up and do something else,
go check whatever social media is doing these days.
And so I wanna speak, I put efficient in the title
because when you think about simulation,
this is not usually something that's gonna be deployed
in a production environment and needs really low latency.
We don't need to think about efficiency in that way.
What you need to think about efficiency
when it comes to simulation
is how well can you use the tools that you have
to get going and is it,
like I don't think it's important
that a simulation is over optimized,
but it is able to finish in a time that is useful to you
on your analysis timeframe.
All right, so now that we've got this,
let's make a little bit of a visualization to see.
So I'm gonna focus on RMSE.
I could have instead chosen R squared
if I prefer to use that.
And then I'm computing the variance of the RMSE.
So this is the median,
what's about to be on the plot is the median RMSE variance
of these different values here.
And so this is what the plot that this makes.
So you can see, we start with high variance,
the RMSE that we get,
we don't, it has, it's jumping up and down a lot.
It goes down very steeply and then it starts tapering off.
So if I look at this plot and I kind of say,
oh, am I gonna look for an elbow here?
I'm gonna say there's an elbow maybe around,
maybe around 10, maybe around 12.
And so this is where you start to be able to see
what kind of trade-offs are involved
in any kind of decision that we may make.
We can look at this, talk about with our collaborators
and decide what kind of trade-off do we wanna make
between how long it takes us
to estimate the performance of our models
and how kind of like how many diminishing returns we get
by bumping up and up in V.
So we did it, I answered the question, fantastic.
What if you have another question?
What if you wanna say, so that was variance.
What, how does bias change as you change
the number of folds?
The answer is you should run a simulation.
Spoiler alert, 10 is about the right number as well
for bias, it gives you kind of a good balance in results.
What if you have more or less data?
So I showed you examples with a thousand data points.
What if you actually are using,
working with quite small data
or you have something more in the 50,000 or 100,000
or very large range?
Well, you can run a simulation and see how does it change
as you change the size of the dataset.
Spoiler alert, it doesn't really change that much.
Like that relationship with variance specifically.
What if you're gonna use a different model?
Like I use random forest here,
but there's tons of different options out there
that you might use in this sort of classic
supervised machine learning kind of environment,
or you could use deep learning here as well.
Just run a simulation and you can find, does it change?
What is the right answer for you?
Spoiler alert, this actually doesn't really depend
on what kind of model it is.
So you don't, you actually will see
kind of something kind of flat here.
What if you're interested in doing repeated cross validation?
This is where I do 10 fold.
I had like make 10 folds.
I go back to my initial data.
I make 10 folds again.
I go back to my data.
I make 10 folds again.
You could maybe do that like five repeats
of 10 fold cross validation.
That gives you actually 50 folds,
but the folds are only shuffled within each time
that you do it.
And then you repeat it, repeat it.
Here actually you can get to a bit of a different place
in terms of bias and variance.
So when you ran this simulation,
you would be able to understand,
okay, if I'm willing to invest five more times time,
computational time, what can I get out
in terms of bias and variance?
You can get out some significant improvements.
What if say you wanted to use the bootstrap
instead of V fold cross validation?
You can run a simulation that compares them.
And here again, I'll tell you a little bit of the answer.
It turns out that you end up with different trade-offs
in terms of bias and variance.
Bootstrap tends to be low, variance, high bias,
cross fold validation tends to be low bias, higher variance.
So, but you can find this out.
You don't have to take my word for it.
You can run a simulation.
Okay, so this was sort of the first question
and then walking through how you might answer,
how you might ask other questions with a similar simulation.
Let's walk to a different kind of question
away from how many folds is too many.
And let's ask the question, how many observations do we need?
So let's say you are working on some product
that is just newly launched.
And it is important to you to understand the relationship
with two predictors and an outcome
that's important to your business.
You though suspect or maybe know
that there is an interaction between those two predictors.
Like the value of one predictor changes the relationship
between the other predictor and the outcome.
This is called interaction in statistics.
And you don't know what the effect size is.
Let's say it comes for this new product
that you're like starting to roll out.
People are starting to use it.
It is important to your business to know
what is the effect size of this.
And so you are going to run an experiment
and you're gonna collect data
and build a model and understand this.
You know, a lot of you are probably hearing this
and you're saying, ah, yes, I AB test.
Effect size calculator, this is a power calculation.
If I'm asking the question,
how many observations do I need
in order to be able to do something?
Though all those AB test calculators
that are out there though,
they typically only work for the most straightforward case.
Like you're gonna do a T-test at the end.
If what you wanna do is some more complex kind of model
and you need to know how many things you have to have
to start with, then to be able to get the answer that,
to be able to find out the answer that you're looking for,
what you need to do there is a simulation.
You might call it a power simulation.
So let's walk through real, I'm just gonna,
you know, don't worry too much about the details here.
Do notice though that I'm writing a function from scratch now
to generate the data.
So I'm basically, what I'm doing here
is I'm making my assumptions concrete
in the way that I did, I said at the beginning,
I say, okay, I've got two predictors that are random normal.
And then I am gonna make explicit my assumption
about the relationships between the predictors
that I have in the outcomes and how they are in fact,
how they in fact interact with each other.
So this gives us, when I call it, say with these,
as for a given, you know, a given assumption
about the effect size, I can get out some simulated data.
So I take this function, I do something similar to what,
so I can run it one time, I get a data,
I can, you know, do the kind of model
that I might actually use to analyze the data here.
And notice, and then I can deal with the output.
So notice what, that here, the interaction term
is the effect size that I estimate from the model
is not very big and the P value is kind of large
compared to the estimates on the linear coefficients.
This is really common with outcome,
with interaction terms and why it can be,
you know, like it is a little more complicated
to be able to detect an interaction between two things.
I'm using, you know,
as a straightforward linear model here,
but you could imagine doing this.
If you're gonna take an approach
where you're gonna use a more complicated model,
like a hierarchical model or mixed level model
or a fully Bayesian model or something,
you can just put that into here.
So let's take these little bits
and wrap it up into a function.
So in the function, what we do here,
we make a dataset, fit the model, get out the output.
And then ask, here at the bottom,
what this is doing is it is saying,
when I say summarize, like the significance,
P value less than 0.5,
what this is saying is how often do I detect the,
how often am I able to detect the interaction term?
I'm still able to measure it.
And so if I run this a hundred times
for an effect size of 0.1,
or in a sample size of 100,
I detect it 40% of the time.
And that actually is exactly what power is.
That right there is exactly what power is.
So I can do it a bunch of times.
So I am going to try different values of effect size.
Maybe people have just started using this feature.
And so what I want to do is I want to see
at what point will I have to,
how many people will I have to observe
having taken this behavior to be able to detect
if this thing that's important to our business
has happened or not.
And also how big does the,
given our assumptions about how big the effect size might be,
again, like when will I be able to detect it?
So let's run this a bunch of times.
Let's run it a thousand times
on all these different possible combinations
of numbers of samples and effect size.
And so then I get results.
Results are, I've done a power simulation here.
Let's make a quick visualization.
And I get a result that looks like this.
So this is symmetric, which is good.
We would be really surprised
if the interaction term going one way,
we couldn't expect, we couldn't detect it
if it went the other way too.
So it's symmetric, which is very good.
What's on the X-axis is the effect size.
How big of an interaction is there?
When, like, how much difference
does the value of one predictor make
on the relationship between the other predictor
and the outcome?
So how big is the effect size?
On the Y-axis is the power.
So a typical statistical cutoff is 80%,
meaning 80% of the time I would detect a real effect.
And so if we say, okay, is it important to our business
if the effect size is less than, say,
point, absolute value of 0.05?
If the answer is no, then great.
You know, we don't have to worry about there.
But let's say we decide for our business,
actually, if the effect size has an absolute value
of greater than 0.05, then we can come to these lines
and we can look and say, okay, I'm gonna need, you know,
700, 900, 1,000 samples to be able to tell you that.
So this is an example of how to answer
a question using simulation.
We can make our assumptions concrete
and then are able to make a better decision
than we would otherwise.
All right, in my last bit of time,
I wanna talk about one other kind of question
you can answer with simulation.
And that is, so first we had how many folds are too many?
We had how many observations do I need?
And now how important is this relationship?
The code that I've shown you so far
is all just like really basic tidy versus tidy models code
that you could do in any, you know,
in any language that you use, you could write it out.
But here, this is a package that's a little bit more special.
It's really unique idea.
It's called the Nullabor package
and it is for graphical inference.
So what this means is doing statistical inference visually.
So let's simulate one more dataset.
So it also, like before,
is gonna have two predictors and an outcome.
But the relationship now here is the predictor one
is linear, it's linearly related to the outcome.
Predictor two is related with the log.
So it does not, so the rate of it changing
is very different, right?
I feel like in many situations of interest,
you end up with these log,
with these power law relationships, right?
Like with these, like power laws are everywhere around us
all the time, right?
And so this is actually a fairly realistic thing
that happened.
And so often there are these relationships
that are power law relationships.
And it can sometimes be hard to communicate to stakeholders,
especially less technical stakeholders,
what it really means when there is a power law
in something like, and what does that mean
in terms of our leverage of how easy it is to change?
So let's just make some quick visualizations.
Here's the linear one.
Predictor one is on the X-axis.
The outcome is on the X-axis.
We can see that sort of linear change there.
Here's predictor two.
It looks quite different, right?
There's the absolute value there.
So that's what we have.
We kind of see it going up in these different ways.
But at these values, it's like,
well, that one's a little more like a blob, right?
A little bit more like a blob.
So what the Null Abort Package lets you do
is it lets you make what's called a lineup.
So this is like, the metaphor here is like,
you've been taken to the police station
to go look at a lineup of possible criminals.
And can you identify the one that you saw before?
Like, is it possible for you to identify
the one that you saw before?
So the kind of lineup that I'm gonna use now
is a permutation under the null hypothesis.
So basically, what the kind of graphical inference
we're doing is, given the null hypothesis,
is the, how unlikely is the effect that we're seeing?
So yes, I'm talking about p-values.
P-values done via visualization.
So we're doing this simulation, right?
Like we've talked about before,
and then we're gonna get a visual p-value.
So let's look at this, and everyone look at it,
and then I'll drop a question.
So if you all, I'm putting this
in the Ask the Speakers Slack channel.
So if you wanna go in there and maybe thread it
or something, which of these do you think
is the real relationship?
Not the permuted one that was simulated, but the real one.
Somebody go in and put in there what they think it is.
I'll just wait for at least one person to go in.
Yes, okay, I think this one's pretty easy to find.
14 is the real one.
14 is the one that is the real relationship,
and the rest of them have gotten, are random,
are just random.
So if we did this over and over and over
with a lot of people, I actually can compute a p-value
with that because there's 20 here,
like we can do this kind of thing.
Let's now look at the next one.
So this is the second predictor.
This is the second predictor here.
Let's permute it, and let's see whether we can see it or not.
So here is the second one.
So let me just put this in here.
So which one in the lineup here do you think
is the real one versus the permuted random one?
So take a look.
I think this one's harder.
I think this one is harder, but still possible.
I think this one is harder, but still possible.
And if I actually bump the standard deviation
up a little bit, you would actually not be able
to see it at all.
Like literally, there would be no way
to be able to tell this.
But yes, yes, that's right.
So this one is still two, which means it is actually
probably still statistically significant,
and we can, like that it is different from random,
and we are able to see it.
So what this, this is an example of a way to use simulation
with relationships that you may have in your real data
to be able to understand how important they are,
be able to, and in a way that is very accessible
to people that you may need to communicate with about this.
It's a really helpful exercise.
So like the rest of them, what that shows us
is how simulation can be this really powerful tool
to help us do these things.
We are able to get out of our heads
and our collaborators' heads and into code,
you know, the assumptions that we have,
we're able to talk about trade-offs,
see what the trade-offs really are,
and ultimately make better decisions.
So we're, I think almost all of us here are like data folks,
right, and usually we try to use data to make decisions.
So I think, again, I think there's almost more,
nothing more like in the norm-conf ethos
than to say like, oh, I don't have any data
to make this decision, I will just make some up.
But I think it's a sign of health
and a sign of using a tool that's available to us
to be able to do that, to be able to generate data
that helps us understand about trade-offs
and make things concrete.
So with that, I will say thank you very much
and see if we have any questions we wanna chat about.