Ethan Rosenthal and the M1 misadventure - Ethan Rosenthal
Transcript generated with
OpenAI Whisper
large-v2
.
Hi, everybody. My name is Ethan Rosenthal. I manage a small team of AI engineers at Square.
I'm not going to talk about any of that fun stuff today. We do things with large language
models and all sorts of hot stuff. Like with much of this job, I think the majority of
my job is not doing that stuff and it's instead dealing with the norm-confy things. Today,
I'm going to talk about the absolute terrible time that I had trying to get my team's code
base to build on an Apple M1 laptop. And so this is largely going to be a cathartic rant,
but I hope that it's also, you know, helpful and maybe you all will learn something too
along the way. So why am I talking about this today? I would be surprised if you could forget,
but if you all remember our, you know, intrepid norm-conf leader, Vicky, sent out that tweet
that was heard around the world back in July about creating this conference. And I was
like nose deep into trying to get my team's code base to build on this M1. And I, you
know, kind of snarkily made a reply to that tweet. And here I am now actually talking
about that. So kind of fun to bring things full circle. So what is the M1? For those
of you who don't know, back in 2020, Apple released a new line of laptops that use their
M1 chip, which is kind of like the first time that Apple had created their own CPUs, like
specifically for their laptops. And so they followed up like a year later with the M1
Pro and the M1 Max, and it had all these nice charts showing that, you know, you get more
performance for lower power. And so, you know, a lot of people are excited about that, you
know, who doesn't want their code to run faster and who doesn't want to burn their thighs
with their laptop when it's using too much power. And so this all seemed cool, but I
was not very interested in any of this because, you know, I know the first rule of programming,
which is that you should never change anything. Like if it ain't broke, don't fix it. And
my team's code base was working. And so I don't want to mess with like success. Unfortunately,
you know, time is like the one thing that changes that, you know, none of us can stop
that even if we wanted to. And so what happened at work was that in January 2022, there was,
it was announced that M1s were now available to new hires at the company, as well as people
that were eligible to upgrade their laptop if, you know, their current laptop was kind
of old. And so they made that announcement. And one of the people on my team was kind
of excited by these new laptops. And so they requested an upgrade, they got an M1 and they
failed to get the code base to build. You know, they spent about a week with this and
no luck. And then a couple months later, another teammate tries to get a new M1, spends a week
trying to build the code base and they fail. And then in August, we knew that there would
be two new members starting on the team and they were going to be issued M1s. And I stupidly
decided to start managing the team a couple of weeks before. And so this meant that this
was now my problem and I was going to be the one that had to figure this out. And so like
any good engineer, I completely underestimated how much work or time this could possibly
take. But, you know, I tried to be reasonably smart about this. And so anytime you embark
on something new at a company, I think it's a good idea to, you know, check the wisdom
of the masses. And, you know, Square is the biggest company I've worked at. I used to
work at some startups before this and I never really appreciated how helpful it is to have
a big company Slack. It's kind of like your own personal Reddit where you can get recommendations
for like local accountants or holiday gift ideas and everything else. And so I decided
to delve into Slack to try to see if anybody else had already run into this problem before.
And so, you know, I'm reading through Slack search and other sorts of things. And I find
these esoteric Google Docs and Notion Docs where other people had attempted to solve
this problem. And if you search through these docs, you stumble upon these like incantations,
right? These cryptic brew libraries that you have to install before you do anything and
these other cryptic environment variables that you have to set. And, you know, you don't
understand what any of this means, but you're just going to like copy and paste it and run
it like everything else that you do with Stack Overflow. So I did that and I thought, all
right, like I'm good to go. Other people have run into this. I'll be good to go. And so
the first step is, you know, now that I've done my Roo installing and set my environment
variables, now I just need to install Python. Like my team's code base is built in Python.
That's the first step. And immediately I failed because my team's code base was based on Python
3.7, which it turns out is not supported on Apple M1 laptops. And so now I find out that
we can't use our version of Python. We're going to have to change the version of Python.
Thankfully, it's not like going from two to three, but you know, I'm guessing most of
you out there probably don't like to change your version of Python too often. And so setting
this version, this Python version, this was kind of like the first node in this like tangled
web of shit that I had to wade through. And so one recommendation I have is if you want
to deal with different versions of Python, Pyenv is a great little package that allows
you to install different versions of Python on your laptop. And so then anytime you are,
you know, starting a new project, starting a new virtual environment, you can choose
which version of Python you're going to use via that. And then, you know, I had to decide
which version of Python to use. And I think, you know, my rule of thumb is to like to go
with Goldilocks. So don't go with the latest and greatest 3.10 because, you know, this
is like on the bleeding edge. Don't go with the oldest 3.8 because it's, you know, it
was kind of like almost out of support by that point. And so just, you know, go with
something in the middle that's only a couple of years old. So we chose 3.9. So this was
like a two minor version upgrade that we went from 3.7 to 3.9.
So now that I've changed my version of Python, this means that I'm going to have to change
some of my Python dependencies because certain dependencies only work with like certain versions
of your dependencies only work with certain versions of Python. So what I mean by a Python
dependency is like a library that you are pip installing like NumPy or Matplotlib or
something like that. And so now I'm going to have to change the versions of these dependencies,
but it's not just that I have to change them because they are now incompatible with my
new version of Python. The other issue is that I have this brand new flashy sports car
of a laptop and there are no wheels. So Python wheels, for those of you who don't know, are
like zip files where other people have kind of like prebuilt your code so that it will
run wherever you run your code. And so for a lot of these libraries, you need prebuilt
wheels in order to run this code on like your laptop or in the cloud or something like that.
And it turned out that even if I had certain Python dependencies whose versions worked
with my new version of Python, these versions of these dependencies did not work on Apple
M1s. And so now I need to go and find what version actually introduced wheels that are
compatible with the M1.
OK, so we've changed our version of Python. We've changed a whole bunch of Python dependencies.
And then ideally, you are not managing these dependencies yourself. I think if you're spending
all day kind of hard coding all of your dependencies into a requirements.txt file, then you're
not going to have a good time. But if you use a dependency manager, which is a tool
where you can kind of declare, you know, these are the high level dependencies that I need
for my package, then this manager will go out and find those dependencies, dependencies,
those dependencies, dependencies, dependencies, and this entire like nested graph. And it
will make sure that it finds, you know, the exact versions of all of the sub-dependencies
such that everything will work and kind of, you know, be reproducible every time you install
your package.
And be aware, a little bit of self-promotion, but I have an entire blog post about this
where my actual entire process around how I do data science is that every time I start
a new project or analysis, I create an entire Python package and I manage all the dependencies
with a dependency manager. And the nice thing here is that this means that, you know, six
months later when I come back to the analysis, because, you know, some stakeholder had a
question in a Google doc from an analysis that I ran six months ago, I can, you know,
reinstall my package and rerun my analysis and have everything work.
The other thing that I learned in this process is, you know, it's kind of difficult sometimes
to make sure that all of your different versions of your Python package actually all work together.
And so what you end up finding is that, you know, you run your dependency manager, try
to get all your versions to work together, and the dependency manager spits out some,
you know, weird error saying, you know, these two packages conflict with each other. There
is no way that you can install all of these. So I kept running into this and a very stupid
skill that I now have is that I learned how to debug this pretty well. And my debugging
strategy is that I create a minimum viable Python package. And so what that means is
that instead of starting with my huge code base that has all these dependencies that
takes, you know, it takes a really long time to add a new dependency because you have to,
like, traverse this giant graph, I create a tiny little new Python package and I first
install that offending dependency that seems to conflict with other ones. And then one
by one, I start to add some new dependencies until the conflict appears. And then I know
which exact dependencies are conflicting with each other. Hopefully, like, you know, chat
GPT solves one day and you can just ask it, you know, feed it your lock file and ask it
what the issue is. But until then, you now have this very painful process that you get
to go through, too. Along the way, you know, we were using pipenv
as our dependency manager. I never really liked it. I always wanted to change the poetry,
but I, you know, don't like to change things at the same time because everything was working.
But, you know, never let a good crisis go to waste. And so we ended up switching our
dependency manager in the process. We also had to do this because for the life of me,
I could not get SciPy to install with pipenv. I think there was a bug somewhere. And so
this ended up being actually like a necessary requirement of this refactor.
So now we've changed our version of Python. We've changed a whole bunch of dependencies.
And we've changed the way that we actually install and manage those dependencies. And
so, of course, we don't do all of this manually because we are engineers and we like to abstract
stuff and, you know, build declarative systems. And so everybody's favorite declarative system,
the make file, and we have, you know, a couple hundred line make file in our code base. This
now needed to be updated to work with the new dependency manager.
But thankfully by this point, I actually got everything to build. So, you know, I was able
to build my code. I was able to run tests. I was able to, you know, train some small
models on my laptop. So I should be done, right? I fixed it for the new laptop. But
unfortunately code doesn't just run on our laptops. Code runs in the cloud. And the way
that we run our code in the cloud is we build Docker images, at least on my team. So we
use a Docker file to build a Docker image. And then we ship our code up to the cloud
and run the code up there. And that Docker file is building the code that we are writing
locally. And so there is kind of a natural coupling between the Docker file and our code
base. And one of the things that the Docker file has to do is it has to install Python.
It has to install Python dependencies and other sorts of things like that. And so this
meant that I needed to update all of our Docker files so that they now worked with this new
setup. But we don't just run regular code in the cloud. We run code on GPUs because
we're training fun things like language models and everything else. And that code that runs
on GPUs, you know, we need to install CUDA, which is like the NVIDIA library. We need
to install that in both the Docker container as well as the EC2 instances that the container
is running on. And so now we have this other piece of our cobweb. If you go to NVIDIA's
website, they have this diagram there where I think they're trying to argue that it's
simple to do stuff on GPUs. And maybe it's a lot simpler than it used to be. But what
you actually are not really, you know, what's not entirely clear until you look closer at
this is that there's this tight coupling that exists in GPUs between all sorts of layers
of the stack. So you have to make sure that you have like the right versions of drivers
and CUDA that are running on the actual instance that's running in the cloud. That is coupled
to the version of these libraries that is running in the Docker container. And all of
this gets coupled with the application that is actually doing the programming, which in
our cases is PyTorch code. And it turns out that all of this coupling is terrible. And
it's like a lot of work and it's a total pain. And I wanted to try to find the words, but
I couldn't find the words to convey this to you. And so instead I decided to burden some
GPU hours and ask Stable Diffusion to paint me a Renaissance painting of a GPU burning
in hell. And ironically, the hallmark of the GPU here is really the heatsink fans, which
seem to be trying to dissipate the heat of hell. And so, yeah. So if you go to PyTorch's
website, which is the deep learning library that we use, they have this wonderful little
widget where you can pick, you know, are you on Linux or Mac? Do you install your Python
packages with pip? Which version of CUDA are you on? You pick all of these pieces and then
they give you this wonderful command that if you just run it, works most of the time
in order to install these libraries. And all of this is great if you're on like the latest
and greatest version of PyTorch. But for various reasons, my team was not. And so if you try
to go back to earlier versions of PyTorch, this is the recommendation. So depending on
which version of CUDA you're running or if you need GPU support or not, and which platform
you're on, you have all of these different pip commands with like pip indices and other
sorts of things that you have to worry about. And so, you know, I needed this to work with
our dependency manager. And so I was having some trouble. So I decided to search on Google
for how to, you know, work with PyTorch and poetry and stumbled upon, you know, 36 common
GitHub issues that was prematurely closed because the problem was definitely not resolved.
And so then a new issue was opened. And that issue is definitely still open because this
is entirely unresolved right now. And so I ended up doing something terrible. You know,
I had this beautiful dependency manager that allowed me to, you know, declare my dependencies
and it would figure everything out for me. And instead I ended up kind of just having
to hack and ruin it. So what I ended up deciding to do was that, you know, in our code base
in the dependency manager, we would just install the version of PyTorch that does not have
any GPU support because I don't have any GPUs locally. And then in our Docker file, we're
going to first export a requirements.txt file from our dependency manager. So we will generate
kind of like running pit freeze. We're going to generate all of our requirements. And then
I have this terrible sed command, which just substitutes the line of PyTorch with a hardcoded
URL to a wheel of my exact version of PyTorch. And so it feels bad, but it works. And I haven't
touched it since I made this change. And so maybe that's a sign that it's working. And
if you're struggling with PyTorch and poetry, this hack, like it actually does work. So
sorry if you end up having to implement it too. So fine, we've figured out how to build
our code locally. We've figured out how to make sure that that code also works in the
cloud. We're good to go, right? Unfortunately, there's actually a third place that my team
runs code and that's in continuous integration. So anytime we want to merge our code into
our main branch of our repo, we run various tests and things like that in our own build
system, kind of like GitHub actions. And so we need to do everything within continuous
integration as well. And it was at this point that I realized that kind of like, I've been
possibly blaming a lot of other systems out there, right? I've been complaining a bit
about GPUs and poetry and all sorts of other things. But part of this is my team's fault
as well. And so one of our issues that I realized is that we had three entirely separate different
processes for where our code runs. So when we're local on our laptop, we have one way
of installing Python. You have to install a dependency manager, install those dependencies
and then run your tests or run your code. And so there is a very set process for doing
that on your laptop. But then when you're in CEI, there's an entirely separate process
for installing Python, an entire separate process for installing those dependencies.
And there's a little bit of coupling between the two. We kind of use a make file for a
little bit of this coupling, but not fully. And then within Docker, there's an entirely
separate process for all of these. And so what this meant was that any time I had to
change one of these blocks, I had to change it in three different places. And so this
is maybe a sign that we have not architected our setup in quite the right way. And so I
realized actually after this that at previous companies that I'd worked at, we had been
a bit smarter about this. And so one option is that you do all of your CEI within the
same Docker container that you're going to use when you run your code in the cloud. And
this actually kind of makes sense because the whole point of CEI is to test your code
in an environment that's fairly similar to the production environment, kind of like the
same way that you create a test set with your machine learning model, because you want that
test to be representative of what's going to actually happen when your model is running
in the real world. And so one option is to use that exact same Docker container within
CEI so that you only have like, you know, within that container, you have your method
of installing Python and the dependencies and everything else. And then this way, we
actually at least only have two ways of doing this. You can start to collapse this more,
but I don't know, programming with Docker locally is not great. So what ended up actually
happening? If you notice, it was August 2022 was when I needed to finish this by. If we
go back and look at some of our timestamps, you can see that it was July 26, that I sent
that tweet at Vicki. Next day, I submitted a PR. And my goal was to get that PR reviewed
by August 3, for the very important reason that I was going to be on vacation, and I
didn't want to have to worry about this after going on vacation. And sure enough, August
4, it got merged. I went on vacation and somehow nothing broke. So that's the story of that.
And you can find me elsewhere on the internet and thank you all.