Just use one big machine for model training and inference - Josh Wills

Transcript generated with OpenAI Whisper large-v2.

My name is Josh Wills. I am a DuckDB enthusiast. I am also a software engineer at a little company called WeaveGrid. At least I am for one more day. Tomorrow is my last day at WeaveGrid. I'm taking a little break to do nothing and spend some time with my family and my favorite embedded OLAP database, which is DuckDB. But today, I'm not going to talk about DuckDB. I'm going to talk about using one big machine for model training and inference. To that end, I want to tell a story. It's a story where I have adapted the midwit meme to be entirely about myself, more or less. From young Josh to mid-career Josh, to elderly sage Josh, if you will, roughly speaking. How my thinking around using one big machine for doing ML, doing model training, doing model inference has evolved over time. To that end, I want to talk about my first job. I graduated from college back in 2001. My very first job was working at IBM in Austin, Texas at a microprocessor, what is called a bring up facility. It's a place where basically engineers develop and test microprocessors before they totally work yet, would be the way to say it. The logic of the microprocessor is very new, so we're still testing it, we're still iterating on it, we're still trying to figure out how well these things work. I was hired as a data analyst to basically analyze all of this data they were collecting about microprocessors to see if I could figure out, if I could build predictive models that would give us an idea of whether or not a particular processor would run at a certain clock speed, or whether it would work at all, or anything like that. Since this was 2001, to do this work, I had a big computer underneath my desk, and that big computer ran MySQL database, and it ran a web server, and it ran R and Perl, because that's what we did back in 2001, we wrote Perl. My job was to build dashboards, again, primarily in Perl, and write analyses primarily in R, like way back in the day, to build predictive models. It was pretty great. It was about five gigabytes of data, which I'll be honest with you all, seemed like a lot of data back in 2001. That was not a small amount of data back then. Yeah, that was what I did, that was my job. I'm doing this for a while, and I'm building my dashboards, and I start getting curious, I think in the way that Juliet was just talking about, just now, a second ago, about how exactly this database software worked, and how it was configured, and how I could make it run better, how I could make my dashboards run faster, how I could pull more data out of the database, all that kind of stuff. And so I started getting pretty good at administering a MySQL database. And I was proud of myself, because everyone thought this was useful, and people would come and ask me questions, and stuff like that. I was feeling very good about myself as an engineer. And then fortunately, my boss at the time, this guy named Greg Wettele, came and took me aside and said, stop getting good at that. Don't get any better at that than you are right now, because if you get too good at administering a MySQL database, people will pay you to administer databases for the rest of your life. And that's probably not what you want to do. Like, it's not actually going to lead you to a happy place. Like, there's other stuff that you should be doing instead. And so I was like, wow, Greg, thank you. That's like super great, super useful management advice. And I think it's good for all the managers out there to say, ask your directors when they're like getting good at something, are you sure you want to get good at that? So I took Greg's advice, and then I proceeded to ignore it for the next 20 years or so, roughly speaking. I left IBM, and I went and worked at a bunch of startups, and then I went to Google, and then I went to Cloudera, and then I went to Slack. And in all this time, I got sort of, unfortunately, really, really good at building large-scale data pipelines on distributed systems and stuff like that. So this picture I'm showing you here is a little image I grabbed from an AWS blog post about running Spark on top of Kubernetes using EMR. And I understand all of the things in this picture. I know how to use all of the tools you see here. And that's not a good thing. That's kind of like a tragedy. No one should actually understand how to use all of this stuff. But sadly, I do, because I have used all of these technologies at some point in my career. And so I can look at this thing and be like, yeah, okay, this makes sense to me. I sort of see how this all hangs together. I could use something like this. So anyway, that's not a great place to be, but that was what I did. That's what I've done for the last 20 years or so. So anyway, four years at Google, four years at Cloudera, four years at Slack, I was pretty tired after all of that kind of stuff. And so I decided at the end of 2019 that I would take a little break. I was pretty burned out, didn't really want to be doing stuff with computers anymore. So I decided November, 2019, that's it, I'm done. I'm gonna step away. And I really didn't like touch a computer in anger for about four months or so. I really just kind of like traveled and read books and just kind of like reset myself. And it was really great. And I was really like asking myself, like, what do I want to do? I was basically having like a midlife crisis, more or less. I think would be like the technical term for it. Like, what do I want to do? Why, like, what is the point of all of this like stuff I've learned how to do over the course of my career? Like, do I want to keep doing this? Is this making me happy? All that sort of stuff. And then as it so happened, March, 2020 rolled around and I got a call from my friend, DJ Patel. He used to be Chris's boss at Devoted Health before DJ had the good sense to fire Chris. And has done a lot of cool stuff. He called me up and he's like, I am going over to Sacramento. This is like early March, 2020, to help out with COVID relief stuff. And I think I could use some software engineers and some data people and you are unemployed. And will you like, can you give me a hand basically? And I'm like, sure, DJ, anything you need, happy to help. So he introduced me to a team of epidemiologists at Johns Hopkins and folks at the Department of Public Health in California. And he said, we are running these big models, these gigantic simulations to understand the impact of social distancing, of shutting down the schools, of all this kind of stuff on the spread of COVID in the state of California over the next few weeks. We've been doing this for a few weeks. We're all really tired. No one slept. Can you help us? Can you help us run more of these things? Can you help us run a lot more of these simulations? And can you do it really, really quickly? Cause we have to present this stuff to the governor in like two days. And so I was like, yes, absolutely. I would love to help. Here's basically what we're gonna do. We're gonna take a whole bunch, we're gonna take your program and your simulations. And we are going to spend a bunch of money and spin up the largest machines we can possibly buy on AWS. And we're gonna run as many of these simulations as humanly possible in the next 48 to 72 hours. And sort of just kind of see how it goes. Like that's our plan. And I was very nervous about doing this because I have been like horizontal scalability my entire career. And here I am in this emergency situation saying, okay, forget all of that stuff. We're gonna do vertical scaling just as fast as we can, as much money as we can. Doesn't matter, like let's go to it. And so that's what we did. We went to AWS, swiped the old credit card and got ourselves some really, really big machines. Honestly, the hardest part of the process was finding someone at AWS to like lift my reservation limit so I could basically get more machines. Like I wanted, I was like, AWS, like please let me give you more money. And they like wouldn't let me and stuff. That was the hardest part of the problem. And so I did that, I did that kind of work for a few months and I absolutely loved it. And it kind of like rediscovered and rekindled my joy in doing data science and doing machine learning and doing computer stuff in general. And I wanna kind of talk about why and how great it was because it's kind of stuck with me. And after that sort of work was done and I moved on from doing COVID stuff and I was like, okay, how else can I help? Because to be honest with you, helping is like just the best feeling in the world. I decided to go work at a little company called WeaveGrid, which was all of like 10 people at the time that wanted to build managed charging systems for EVs. And if you're building managed charging systems for EVs, there's a couple of different interesting machine learning problems you have to solve. You have to get good, first of all, at identifying households that have electric vehicles from the patterns in their meter data. You have to get good at predicting when those cars are going to get home. You have to get good at knowing how often you can wake up a car without running down its battery. There's all kinds of interesting, interesting ML stuff that you have to solve to do these things. And my commitment when I got to WeaveGrid was to keep things as norm core as I could, to kind of stick with this model of like one big machine, solve all the problems with the one big machine whenever possible. And I want to tell you about like why I think you should embrace this as well and why using one big machine to solve machine learning problems is the right decision for almost everybody. Because it just really gives you like just an enormous set of benefits. And that's what I want to walk through here. The first benefit is it's an incredibly useful heuristic for identifying important problems. There's a famous kind of story of the band Van Halen back in the 80s having this very long, complicated contract for their venues or concert venues. And there was a little tiny item in the contract and it said that there should be M&Ms available on the craft services stand and there should be absolutely no brown M&Ms in the bowl of M&Ms. And it always seemed like kind of a ridiculous thing to do but it was actually very clever because Van Halen's contract was very complex and very detailed and involved all these like crazy pyrotechnics and stuff like that that were fairly dangerous. And so it was a useful heuristic for them to check and see if someone had carefully read the contract by just looking at the craft services table and seeing if there were any brown M&Ms in the bowl. And to me, that's the same kind of thing around using one big machine for machine learning. If your boss says to you, hey, I would like you to solve this sort of problem and I think machine learning could help. If you say to them, okay, great, I can do it and I can do it pretty quickly but I'm gonna need to like rent an R6A instance from AWS and it's gonna cost like $12 an hour, is that okay with you? Like, and they say, you know, actually, no it's not okay for you to do that. That's a great signal to the problem they're asking you to solve with machine learning is not actually that important. And machine learning should really only be used for very, very important problems. Like this is my opinion. We should use machine learning for important problems and we should be careful about how we do it. We should be thoughtful about how we do it. Like the cost is a feature in my opinion, not a bug. I think that if we get sort of distract ourselves by building like platforms and tools and all this kind of stuff we're not really like solving the real problem. We're kind of like solving around the problem. Whereas like if we're focused on like spending a lot of money to get an answer to this question very, very quickly, it focuses the mind. It keeps us like, it keeps our eye on the ball of like, what is the impact of us solving this problem? And if the impact is not enough money to like justify the cost, then like really like, why are we doing this? Like that's not a good use of our time. We can do other stuff instead. The other sort of great aspect of using a single big machine is it lets you save your innovation tokens. There's a great blog post by a guy named Dan McKinley. He goes by McFunley on the internet and he wrote it back in 2015 and it's called Choose Boring Technology. And in this blog post, he introduced the notion of innovation tokens. The idea is that every technology company has a certain number of innovation tokens. There's like a small set of things where you are allowed to like do something super, super cool, like use some crazy new framework or some like ridiculous data store that you wrote yourself or something along those lines, right? And you get a few of these tokens but only a very few of them. And the great thing about eliminating sort of the network and eliminating distributed systems from your machine learning workflow is you basically get one of your innovation tokens back. So if you wanna use Ray or Dask or PyTorch and you've never used it before, it's okay because you're only running things on a single machine. There's nothing bad it can do to you. Like it can't hurt you and not for nothing. If you run a new sort of cool framework on a single machine and you don't like it and it causes you pain and suffering, that's great, that's fantastic news because then you can just like throw it away and go back to doing stuff using like multi-processing the way you were supposed to in the first place. Like that's awesome, awesome news. You should be really happy about that when that happens. So using one big machine is just a great way to like let yourself have some fun with some new cool tools, new cool frameworks without really incurring the cost that they would impose on you if you were trying to combine them with clustering, distributed systems, sort of all that horrible stuff, okay? Keep those innovation tokens, keep them fungible. When I originally like submitted this talk to NormConf, I tweeted about it thinking it would be funny and like I was trolling everybody with it. And I got this reply from Rob Story, he's an engineer at Stripe. And this is just kind of like was great and validating to me that Stripe still trains machine learning models using one big machine. And like basically my opinion is, machine learning is plenty hard on its own. And if one big machine is good enough for Stripe, it's good enough for you too. All right, one more sort of virtue, one more benefit, at least this is something I felt very acutely doing my COVID work. Make feedback loops fast, make feedback loops fast. This is Eric Bernardson who used to be at Spotify and wrote a lot of the recommender logic once upon a time. And is now working in a company called Modal building tooling to make it again, easy and fast to take advantage of gigantic machines on the cloud for data processing and machine learning use cases. And so I love, love, love this talk because it gets to the heart of what was my very, very favorite ML ops tool by a wide margin. And that's Htop. Htop, if you've never used Htop, I highly recommend it. Htop is a sort of slightly more sophisticated version of Top which Top is just a Unix process which you see what's running on your machine. Htop is that, it's Top, but it also shows you across all the cores you're running, how hard are they being run? How hard are they being utilized? Like all that kind of stuff. You can see the memory usage per process. It's really like great, great visibility into like what exactly is my data pipeline or my feature engineering pipeline or whatever. What is it doing on this machine right now? And then if you combine Htop with logs and the tail command to tell your logs, you have an interactive single pane of glass on a single machine into all the stuff that your machine learning model is doing. And this is fantastic. This is like a visceral, joyful way to experience like what your machine learning pipeline is doing and be very, very quick and easy to identify when things get stuck, when tasks run out of memory, when you're supposed to be running 64 tasks in parallel but you're only running like six or something like that for some reason. It just really makes this super, super easy to do, super easy to check. So it's just fantastic, fantastic tooling. How should you configure your machine? Let's talk about the three sort of rules here. First, you need to choose an instance type. And if you've never done this before, it can be very confusing. I was just looking at Amazon's page and there's like 576 different instance types. So how do I know which kind of instance I should run to do my machine learning problem? And the answer is RAM. RAM, as much RAM as possible, ideally enough so that you can fit all of your data in RAM itself. That's kind of like your goal. Like you wanna sort of take like the minimum of the amount of money you're allowed to spend and like how much RAM you need to fit everything, all of your data in RAM and like pick that, as much RAM as possible. Compute optimized instances are like a sucker's game. Basically like they're fine for video encoding and stuff, but they're not right for machine learning applications. Storage, NVMe, like you don't care about that stuff. All you really want is RAM, RAM, RAM, RAM. Just focus on RAM, get as much RAM as you can. That's sort of rule one. Once you have your RAM, you're gonna need some more storage. So one of like the funny kind of comedy of error things that happens to a lot of people when they're using one big machine. When you provision a new instance on EC2, it only has 20 gigabytes of disk, which while that's four times more than the size of the MySQL database that I ran in IBM, is in the modern day, a laughably small amount of disk. So what you need to do is go to EBS, like Elastic Block Store, and get like lots and lots and lots of disk, a terabyte of disk, however much disk you need, and then like configure it and attach it to your EC2 instance. So you have a really, really big disk that you can keep all your data in and do all your work on and all that kind of stuff. Super common rookie mistake to like skip that process and inadvertently do something where you just like use up all the disk on the machine and like end up killing it, stuff like that. So get your big machine with lots of RAM, go get a lot more storage. Next, clean up after yourself. Like when you're not using the machine, just turn it off. And when you're done using the EBS instance, this is sort of another sneaky thing that Amazon does. When you stop the EC2 instance, the EBS instance that's attached to it, that's like your big disk, it's still running. You're still paying for it. So be sure to like snapshot that EBS volume, save it away somewhere, or just copy your data out to S3 and then shut the EBS instance down. Like get rid of it so you're not like setting money on fire, leaving that thing running for weeks on end. Be good, be like daddy robot, clean up after yourself. And then finally, like do the minimum viable amount of automation that you need to do to kind of not drive yourself crazy. Refer to this XKCD chart and be good about making sure that you're not like pushing things so hard that like you don't automate for automation sake. Machine learning is always kind of zero to one. It's always experimental. You're always gonna be changing things. So like don't over automate things too early. Like sort of premature automation is bad, is kind of my long-winded sort of thing here. Um, yeah. And with that, like go forth and prosper. Like brag to your friends about running, you know, 192 cores with one and a half terabytes of RAM. For what it's worth, your kids will think that's a laughably small amount of RAM and a laughably small number of cores. Like their phones are gonna have way more storage than that, right? But for now, that stuff's pretty cool. Just as cool as like a five gigabyte database was way back in 2001. And with that, thank you very much. And I am happy to take questions.