Just use one big machine for model training and inference - Josh Wills
Transcript generated with
OpenAI Whisper
large-v2
.
My name is Josh Wills.
I am a DuckDB enthusiast.
I am also a software engineer
at a little company called WeaveGrid.
At least I am for one more day.
Tomorrow is my last day at WeaveGrid.
I'm taking a little break to do nothing and spend some time with
my family and my favorite embedded OLAP database,
which is DuckDB.
But today, I'm not going to talk about DuckDB.
I'm going to talk about using one big machine
for model training and inference.
To that end, I want to tell a story.
It's a story where I have adapted
the midwit meme to be entirely about myself, more or less.
From young Josh to mid-career Josh,
to elderly sage Josh, if you will, roughly speaking.
How my thinking around using one big machine for doing ML,
doing model training, doing model inference
has evolved over time.
To that end, I want to talk about my first job.
I graduated from college back in 2001.
My very first job was working at IBM in Austin,
Texas at a microprocessor,
what is called a bring up facility.
It's a place where basically engineers develop and test
microprocessors before they totally work yet,
would be the way to say it.
The logic of the microprocessor is very new,
so we're still testing it, we're still iterating on it,
we're still trying to figure out how well these things work.
I was hired as a data analyst to basically analyze
all of this data they were collecting about
microprocessors to see if I could figure out,
if I could build predictive models that would give us
an idea of whether or not a particular processor
would run at a certain clock speed,
or whether it would work at all, or anything like that.
Since this was 2001, to do this work,
I had a big computer underneath my desk,
and that big computer ran MySQL database,
and it ran a web server,
and it ran R and Perl,
because that's what we did back in 2001, we wrote Perl.
My job was to build dashboards,
again, primarily in Perl,
and write analyses primarily in R,
like way back in the day, to build predictive models.
It was pretty great.
It was about five gigabytes of data,
which I'll be honest with you all,
seemed like a lot of data back in 2001.
That was not a small amount of data back then.
Yeah, that was what I did, that was my job.
I'm doing this for a while,
and I'm building my dashboards,
and I start getting curious,
I think in the way that Juliet was just talking about,
just now, a second ago,
about how exactly this database software worked,
and how it was configured,
and how I could make it run better,
how I could make my dashboards run faster,
how I could pull more data out of the database,
all that kind of stuff.
And so I started getting pretty good
at administering a MySQL database.
And I was proud of myself,
because everyone thought this was useful,
and people would come and ask me questions,
and stuff like that.
I was feeling very good about myself as an engineer.
And then fortunately, my boss at the time,
this guy named Greg Wettele,
came and took me aside and said,
stop getting good at that.
Don't get any better at that than you are right now,
because if you get too good
at administering a MySQL database,
people will pay you to administer databases
for the rest of your life.
And that's probably not what you want to do.
Like, it's not actually going to lead you to a happy place.
Like, there's other stuff that you should be doing instead.
And so I was like, wow, Greg, thank you.
That's like super great, super useful management advice.
And I think it's good for all the managers out there to say,
ask your directors when they're like getting good
at something, are you sure you want to get good at that?
So I took Greg's advice,
and then I proceeded to ignore it
for the next 20 years or so, roughly speaking.
I left IBM, and I went and worked at a bunch of startups,
and then I went to Google, and then I went to Cloudera,
and then I went to Slack.
And in all this time, I got sort of, unfortunately,
really, really good at building large-scale data pipelines
on distributed systems and stuff like that.
So this picture I'm showing you here
is a little image I grabbed from an AWS blog post
about running Spark on top of Kubernetes using EMR.
And I understand all of the things in this picture.
I know how to use all of the tools you see here.
And that's not a good thing.
That's kind of like a tragedy.
No one should actually understand
how to use all of this stuff.
But sadly, I do, because I have used all of these
technologies at some point in my career.
And so I can look at this thing and be like, yeah, okay,
this makes sense to me.
I sort of see how this all hangs together.
I could use something like this.
So anyway, that's not a great place to be,
but that was what I did.
That's what I've done for the last 20 years or so.
So anyway, four years at Google, four years at Cloudera,
four years at Slack,
I was pretty tired after all of that kind of stuff.
And so I decided at the end of 2019
that I would take a little break.
I was pretty burned out,
didn't really want to be doing stuff with computers anymore.
So I decided November, 2019, that's it, I'm done.
I'm gonna step away.
And I really didn't like touch a computer in anger
for about four months or so.
I really just kind of like traveled and read books
and just kind of like reset myself.
And it was really great.
And I was really like asking myself,
like, what do I want to do?
I was basically having like a midlife crisis, more or less.
I think would be like the technical term for it.
Like, what do I want to do?
Why, like, what is the point of all of this like stuff
I've learned how to do over the course of my career?
Like, do I want to keep doing this?
Is this making me happy?
All that sort of stuff.
And then as it so happened, March, 2020 rolled around
and I got a call from my friend, DJ Patel.
He used to be Chris's boss at Devoted Health
before DJ had the good sense to fire Chris.
And has done a lot of cool stuff.
He called me up and he's like,
I am going over to Sacramento.
This is like early March, 2020,
to help out with COVID relief stuff.
And I think I could use some software engineers
and some data people and you are unemployed.
And will you like, can you give me a hand basically?
And I'm like, sure, DJ, anything you need, happy to help.
So he introduced me to a team of epidemiologists
at Johns Hopkins and folks
at the Department of Public Health in California.
And he said, we are running these big models,
these gigantic simulations to understand the impact
of social distancing, of shutting down the schools,
of all this kind of stuff on the spread of COVID
in the state of California over the next few weeks.
We've been doing this for a few weeks.
We're all really tired.
No one slept.
Can you help us?
Can you help us run more of these things?
Can you help us run a lot more of these simulations?
And can you do it really, really quickly?
Cause we have to present this stuff to the governor
in like two days.
And so I was like, yes, absolutely.
I would love to help.
Here's basically what we're gonna do.
We're gonna take a whole bunch,
we're gonna take your program and your simulations.
And we are going to spend a bunch of money
and spin up the largest machines we can possibly buy on AWS.
And we're gonna run as many of these simulations
as humanly possible in the next 48 to 72 hours.
And sort of just kind of see how it goes.
Like that's our plan.
And I was very nervous about doing this
because I have been like horizontal scalability
my entire career.
And here I am in this emergency situation saying,
okay, forget all of that stuff.
We're gonna do vertical scaling just as fast as we can,
as much money as we can.
Doesn't matter, like let's go to it.
And so that's what we did.
We went to AWS, swiped the old credit card
and got ourselves some really, really big machines.
Honestly, the hardest part of the process
was finding someone at AWS to like lift my reservation limit
so I could basically get more machines.
Like I wanted, I was like, AWS,
like please let me give you more money.
And they like wouldn't let me and stuff.
That was the hardest part of the problem.
And so I did that, I did that kind of work for a few months
and I absolutely loved it.
And it kind of like rediscovered and rekindled my joy
in doing data science and doing machine learning
and doing computer stuff in general.
And I wanna kind of talk about why and how great it was
because it's kind of stuck with me.
And after that sort of work was done
and I moved on from doing COVID stuff and I was like,
okay, how else can I help?
Because to be honest with you,
helping is like just the best feeling in the world.
I decided to go work at a little company called WeaveGrid,
which was all of like 10 people at the time
that wanted to build managed charging systems for EVs.
And if you're building managed charging systems for EVs,
there's a couple of different interesting
machine learning problems you have to solve.
You have to get good, first of all,
at identifying households that have electric vehicles
from the patterns in their meter data.
You have to get good at predicting
when those cars are going to get home.
You have to get good at knowing how often
you can wake up a car without running down its battery.
There's all kinds of interesting, interesting ML stuff
that you have to solve to do these things.
And my commitment when I got to WeaveGrid
was to keep things as norm core as I could,
to kind of stick with this model of like one big machine,
solve all the problems with the one big machine
whenever possible.
And I want to tell you about like why I think
you should embrace this as well
and why using one big machine
to solve machine learning problems
is the right decision for almost everybody.
Because it just really gives you
like just an enormous set of benefits.
And that's what I want to walk through here.
The first benefit is it's an incredibly useful heuristic
for identifying important problems.
There's a famous kind of story of the band Van Halen
back in the 80s having this very long, complicated contract
for their venues or concert venues.
And there was a little tiny item in the contract
and it said that there should be M&Ms available
on the craft services stand
and there should be absolutely no brown M&Ms
in the bowl of M&Ms.
And it always seemed like kind of a ridiculous thing to do
but it was actually very clever
because Van Halen's contract was very complex
and very detailed and involved all these
like crazy pyrotechnics and stuff like that
that were fairly dangerous.
And so it was a useful heuristic for them to check
and see if someone had carefully read the contract
by just looking at the craft services table
and seeing if there were any brown M&Ms in the bowl.
And to me, that's the same kind of thing
around using one big machine for machine learning.
If your boss says to you,
hey, I would like you to solve this sort of problem
and I think machine learning could help.
If you say to them, okay, great, I can do it
and I can do it pretty quickly
but I'm gonna need to like rent an R6A instance from AWS
and it's gonna cost like $12 an hour, is that okay with you?
Like, and they say, you know, actually, no
it's not okay for you to do that.
That's a great signal to the problem
they're asking you to solve
with machine learning is not actually that important.
And machine learning should really only be used
for very, very important problems.
Like this is my opinion.
We should use machine learning for important problems
and we should be careful about how we do it.
We should be thoughtful about how we do it.
Like the cost is a feature in my opinion, not a bug.
I think that if we get sort of distract ourselves
by building like platforms and tools
and all this kind of stuff
we're not really like solving the real problem.
We're kind of like solving around the problem.
Whereas like if we're focused on like spending a lot
of money to get an answer to this question
very, very quickly, it focuses the mind.
It keeps us like, it keeps our eye on the ball
of like, what is the impact of us solving this problem?
And if the impact is not enough money
to like justify the cost,
then like really like, why are we doing this?
Like that's not a good use of our time.
We can do other stuff instead.
The other sort of great aspect of using a single big machine
is it lets you save your innovation tokens.
There's a great blog post by a guy named Dan McKinley.
He goes by McFunley on the internet
and he wrote it back in 2015
and it's called Choose Boring Technology.
And in this blog post, he introduced the notion
of innovation tokens.
The idea is that every technology company
has a certain number of innovation tokens.
There's like a small set of things where you are allowed
to like do something super, super cool,
like use some crazy new framework
or some like ridiculous data store that you wrote yourself
or something along those lines, right?
And you get a few of these tokens
but only a very few of them.
And the great thing about eliminating sort of the network
and eliminating distributed systems
from your machine learning workflow
is you basically get one of your innovation tokens back.
So if you wanna use Ray or Dask or PyTorch
and you've never used it before, it's okay
because you're only running things on a single machine.
There's nothing bad it can do to you.
Like it can't hurt you and not for nothing.
If you run a new sort of cool framework on a single machine
and you don't like it and it causes you pain and suffering,
that's great, that's fantastic news
because then you can just like throw it away
and go back to doing stuff using like multi-processing
the way you were supposed to in the first place.
Like that's awesome, awesome news.
You should be really happy about that when that happens.
So using one big machine is just a great way
to like let yourself have some fun
with some new cool tools, new cool frameworks
without really incurring the cost
that they would impose on you
if you were trying to combine them with clustering,
distributed systems, sort of all that horrible stuff, okay?
Keep those innovation tokens, keep them fungible.
When I originally like submitted this talk to NormConf,
I tweeted about it thinking it would be funny
and like I was trolling everybody with it.
And I got this reply from Rob Story,
he's an engineer at Stripe.
And this is just kind of like was great and validating to me
that Stripe still trains machine learning models
using one big machine.
And like basically my opinion is,
machine learning is plenty hard on its own.
And if one big machine is good enough for Stripe,
it's good enough for you too.
All right, one more sort of virtue, one more benefit,
at least this is something I felt very acutely
doing my COVID work.
Make feedback loops fast, make feedback loops fast.
This is Eric Bernardson who used to be at Spotify
and wrote a lot of the recommender logic once upon a time.
And is now working in a company called Modal
building tooling to make it again, easy and fast
to take advantage of gigantic machines on the cloud
for data processing and machine learning use cases.
And so I love, love, love this talk
because it gets to the heart of what was my very,
very favorite ML ops tool by a wide margin.
And that's Htop.
Htop, if you've never used Htop, I highly recommend it.
Htop is a sort of slightly more sophisticated version of Top
which Top is just a Unix process
which you see what's running on your machine.
Htop is that, it's Top,
but it also shows you across all the cores you're running,
how hard are they being run?
How hard are they being utilized?
Like all that kind of stuff.
You can see the memory usage per process.
It's really like great, great visibility
into like what exactly is my data pipeline
or my feature engineering pipeline or whatever.
What is it doing on this machine right now?
And then if you combine Htop with logs
and the tail command to tell your logs,
you have an interactive single pane of glass
on a single machine into all the stuff
that your machine learning model is doing.
And this is fantastic.
This is like a visceral, joyful way to experience
like what your machine learning pipeline is doing
and be very, very quick and easy to identify
when things get stuck, when tasks run out of memory,
when you're supposed to be running 64 tasks in parallel
but you're only running like six
or something like that for some reason.
It just really makes this super, super easy to do,
super easy to check.
So it's just fantastic, fantastic tooling.
How should you configure your machine?
Let's talk about the three sort of rules here.
First, you need to choose an instance type.
And if you've never done this before,
it can be very confusing.
I was just looking at Amazon's page
and there's like 576 different instance types.
So how do I know which kind of instance I should run
to do my machine learning problem?
And the answer is RAM.
RAM, as much RAM as possible,
ideally enough so that you can fit all of your data
in RAM itself.
That's kind of like your goal.
Like you wanna sort of take like the minimum
of the amount of money you're allowed to spend
and like how much RAM you need to fit everything,
all of your data in RAM and like pick that,
as much RAM as possible.
Compute optimized instances are like a sucker's game.
Basically like they're fine for video encoding
and stuff, but they're not right
for machine learning applications.
Storage, NVMe, like you don't care about that stuff.
All you really want is RAM, RAM, RAM, RAM.
Just focus on RAM, get as much RAM as you can.
That's sort of rule one.
Once you have your RAM,
you're gonna need some more storage.
So one of like the funny kind of comedy of error things
that happens to a lot of people
when they're using one big machine.
When you provision a new instance on EC2,
it only has 20 gigabytes of disk,
which while that's four times more
than the size of the MySQL database that I ran in IBM,
is in the modern day, a laughably small amount of disk.
So what you need to do is go to EBS,
like Elastic Block Store,
and get like lots and lots and lots of disk,
a terabyte of disk, however much disk you need,
and then like configure it and attach it
to your EC2 instance.
So you have a really, really big disk
that you can keep all your data in
and do all your work on and all that kind of stuff.
Super common rookie mistake to like skip that process
and inadvertently do something where you just like
use up all the disk on the machine
and like end up killing it, stuff like that.
So get your big machine with lots of RAM,
go get a lot more storage.
Next, clean up after yourself.
Like when you're not using the machine, just turn it off.
And when you're done using the EBS instance,
this is sort of another sneaky thing that Amazon does.
When you stop the EC2 instance,
the EBS instance that's attached to it,
that's like your big disk, it's still running.
You're still paying for it.
So be sure to like snapshot that EBS volume,
save it away somewhere, or just copy your data out to S3
and then shut the EBS instance down.
Like get rid of it so you're not like setting money on fire,
leaving that thing running for weeks on end.
Be good, be like daddy robot, clean up after yourself.
And then finally, like do the minimum viable
amount of automation that you need to do
to kind of not drive yourself crazy.
Refer to this XKCD chart and be good about making sure
that you're not like pushing things so hard
that like you don't automate for automation sake.
Machine learning is always kind of zero to one.
It's always experimental.
You're always gonna be changing things.
So like don't over automate things too early.
Like sort of premature automation is bad,
is kind of my long-winded sort of thing here.
Um, yeah.
And with that, like go forth and prosper.
Like brag to your friends about running, you know,
192 cores with one and a half terabytes of RAM.
For what it's worth, your kids will think
that's a laughably small amount of RAM
and a laughably small number of cores.
Like their phones are gonna have way more storage
than that, right?
But for now, that stuff's pretty cool.
Just as cool as like a five gigabyte database
was way back in 2001.
And with that, thank you very much.
And I am happy to take questions.