An ML Fairytale - Vicki Boykis, Keynote
Transcript generated with
OpenAI Whisper
large-v2
.
Hey, everybody. Welcome to NormConf. Good morning, good afternoon, good evening. I am
going to start off. So it's very appropriate that James and Jesse were talking about Bards
because what I want to do is I actually want to tell you a machine learning fairy tale.
Once upon a time in the kingdom of all users contained in a matrix of M by N in the real
number space, there lived a girl named Vectorella who was a machine learning engineer. And Vectorella
was pretty happy. She was doing her machine learning stuff. She worked for a company that
made Nutella, which is everybody knows is everybody's favorite spread for hypothesis
testing fans. And as a company that made Nutella, this company shipped it far and wide across
all places, both in the user matrix space and in the real world across hundreds and
hundreds of countries. Now, as such, they had a lot of business transactions and a lot
of data to deal with. So Vectorella had to handle a bunch of different stuff as a machine
learning engineer. She had to handle API errors, time zone issues, import errors, K8s. And
she was generally pretty okay with this, but she wanted a lot more. And what Vectorella
saw whenever she went to her work slack was past the air flow alerts, past the memes,
past the spark issues, there was a private channel that she could see that was called
staff Emily's. And all Vectorella wanted to do was to get into that staff Emily channel
with all of her heart. She was doing her day to day stuff, but she thought they were doing
more in there. She thought they might have been doing stable diffusion or chat GPT. And
all she wanted to do was do that cool stuff instead of the stuff that she did on a day
to day basis. One day a miracle happened. Her very cool boss sent her a text message
and said, hey, I think you've been doing a lot of this great work and it's time for you
to join this channel. And Vectorella said, this is amazing. I would love to join this
channel. But then a misfortune befell Vectorella. She went through a rework. And so what happened
was the cool boss she used to have was replaced by an evil boss and two step PMs. So now not
only did she have different work, but she had three people managing her work. And so
she was further away from that staff Emily channel than she ever thought possible. And
she said, can we, can I please go to this channel to her new evil step PMs and step
manager? And they said, no, we have a lot of work for you to do. You have to do these
three tasks. If you do these three tasks, maybe we will let you in the channel. Maybe
we'll have to see how well you do them. And so Vectorella thought, okay, I guess I've
already been doing a lot of this work, but I really want to see what they're discussing
in there. So she put on her hat and got to work. The first task that they gave her was
to count sales. So Noltella comes in all sorts of varieties, large jars, medium jars, small
jars. And in order to increase revenue, they also branched out into gelato and biscuits.
Vectorella's job was to count all of these. Don't forget it shipped both in the multidimensional
vector space and in the real world. So there were sales across many, many countries. So
she had sales for the U S Morocco and Japan and lots of other places. So she said, okay,
I guess I'll do this. So she took all the orders that were placed in the United States.
So there were biscuits, medium, small, large jars, et cetera. She had an enormous list.
And she took that from the transaction logs and put that into a list. Then she said, okay,
I guess I'll count these. So for every item in the orders, she added it to a default dict
and then she incremented the counter for everyone in the U S until she had all her items added
up all together. She wasn't done though yet, because there are a lot of other countries
that had these sales. So she also had to do the same thing for Japan, Morocco, China,
France, Italy, and others. At this point in Vectorella was getting really bored, but she
really wanted to be in that channel. So then she said, okay, well now I guess I have to
add all of these countries together. So then she incremented a counter. She took all her
global orders and she added an incremented those orders as well, so that she had a global
count of everything all together. And she said, okay, I guess this is it. I did this.
This is so boring. And then she went to her evil step boss and step PMs and said, is this
getting me closer to getting into that Slack channel? And they said, sure, but we have
a second task for you. Now you have to turn words and millions of recipes into numbers.
And Vectorella said, why, why do I have to do this? This is going to be so boring. And
they said, just do it. Don't ask questions. So she said, okay. And she went to see that
the recipes that Noltella had illegally scraped from the internet to do work with. She wasn't
sure about that part, but she did look at these recipes and they had things like a list
of ingredients, a list of steps, a list of things that went together to make things out
of Noltella. So for example, this is a Noltella cake and there were millions of these. And
she said, okay, well, how am I going to turn these into numbers? So what she did was first
she turned it into code and she created a big long string for each recipe. Then she
said, okay, what am I going to do next? I guess I have to clean this data so I can somehow
process these strings. And so Vectorella did what she did best, which was clean a lot of
data. So she had to lowercase all the stuff. She had to take out the contractions. She
had to remove all the symbols, strip all the spaces. This is so boring. I bet they're doing
stable diffusion in there, she thought as she was writing all this cleaning data.
Then she had her cleaned words and then she thought, okay, what am I going to do next?
Then she ended up mapping each word in the clean list to an integer, to a sequential
integer. She said, okay, this is kind of something. But this is not really what they were asking
for, I think. And so then what she did next was to actually encode the words. And so what
she did was she created, she took the dictionary and for each word in the dictionary, she assigned
it to a vector of zeros and then everywhere in the vector where the word actually was,
she replaced it with a one. And then what she ended up having was a set of one hot encodings.
And she said, okay, I guess this is something. What is this? Why am I doing this? I could
be posting memes in that channel right now. People would be liking it and appreciating
it instead of making dictionaries. Vectorella was tired and angry. So she went to her step
PMs and her step manager and said, look, I've created this. I've processed them. I've mapped
my vocab and I've encoded them so that I actually have vectors for each word. Is this something?
Can I go now? And they said, no. You have one more task. Since you did these other two
tasks and they were reasonably okay, we'll see if we let you in the channel after that.
Now you have to schedule these to run every day on a daily basis because we sell Nutella
every day and we also need to process it every day. And Vectorella said, oh, I just want
to be in that room. So what she did was she rolled up her sleeves again and she created
a crontab script that ran every day and that did a couple of things. First, it took all
the code that she had run on her local machine and it's linked to a remote Linux server because
Nutella only has one remote Linux server. It's a very norm core company. And then what
that script does is at the same time every day, it runs that sales script and that number
script. And she said, okay, I'm done. I cleaned everything. And she said, please, please,
can I be in that channel? And they said, okay, fine. You win. So the fairy SRE came along
and gave her all the correct admin access and unlocked the staff MLE channel. And that's
when Vectorella became staff MLE Vectorella. Vectors. But when she got to the channel,
she noticed something weird. She noticed that these staff MLEs were asking interesting questions.
So the first one asked, hey, how do I run a distributed Spark job on a million parquet
files in an S3 bucket to get the total counts of receipts for my sales forecast model? And
Vectorella thought, hmm, curious. She thought, what do I know about Spark? What is Spark?
So Spark is a way to do computation through data parallelism and fault tolerance across
multiple commodity machines. What does that mean in English? It means that you can take
a program that computes the frequencies or sums of all words or items in your sales catalog
occurring in a set of text files and prints or outputs the most common ones. So what Vectorella
realized was that the Spark diagram on the left was actually what she had been doing
in adding sales per country and then aggregating them across the world this entire time. Vectorella
was shocked and scandalized, but there was more in store for Vectorella. Second, in that
super secret staff MLE Slack channel, she did not see people multiplying matrices or
working on stable diffusion. What she did see, well, she kind of did, but not really.
Not for the sake of the narrative. What she did see was the super staff MLE number two
that said, I'm trying to create a set of embeddings projected into a lower dimension space to
predict which ingredients I can cluster with Nutella for optimal combinations to show my
customers. What does this mean? It means you have millions of recipes and you have one
Nutella and you want people to select ingredients that they can cluster together to suggest
what they can make with Nutella. So for example, on the right hand side, if you have lemon,
peppermint and cream and they commonly cluster together, you can suggest that they make Nutella
cookies. If you have vanilla, butter and chocolate, you can suggest that they make a Nutella cake,
which requires buying four to five jars of Nutella. Buy Nutella, stock goes up. If you
have for some reason, banana and pizza dough and crepe, I'm sorry about this one. You can
make something in the category of sandwiches potentially, or flatbreads or something similar
that you can sell Nutella to. But for that, you need to generate embeddings out of text
and vector. I thought, well, what is an embedding? Embeddings are just a way to compress highly
dimensional data, such as text or images and no smaller dimensional space by assigning
each data point to a vector and comparing similarity of those vectors to each other.
So she thought, okay, well, what is this? This sounds familiar to what was she doing?
She was just creating vectors too. And in fact, if you go to the wonderful PyTorch documentation,
it'll tell you that the one hot vectors that we created, which are sparse because they
only have data points at certain points, are actually a special case of dense word embeddings,
which we use for deep learning and stable diffusion.
Finally, a third super staff machine learning engineer was asking a different question.
Vectorola was puzzled. She thought this channel was going to be super high level discussions,
theory, et cetera. But what they asked is, can someone validate my Airflow job to update
my machine learning retraining pipeline? And Vectorola thought, what is Airflow?
She remembered that Airflow is a programmatic way to create, monitor, and deploy batch workflows
on a scheduled basis. She thought, what does this sound like? This sounds like something
that I've done before. It sounds like fancy crontab. So Airflow is basically fancy crontab.
It's a way to schedule a lot of different complicated tasks, like, for example, your
Python scripts or your model runs. So it turned out that Vectorola had been doing this also
kind of all along. What is the moral of the story? Basically,
in the data science, machine learning, and general data space, we are all Vectorola today.
Why? Because the advertised map for what we do at work is not the true territory. The
interviews that we go through, the medium articles that we read, the hacker news articles
that we read, all of this work, not all of it is encompassed by what we do on a day to
day basis, which is crontab, and creating dictionaries and adding things to lists. This
is the heart of the true work of machine learning. Second, what Vectorola realized is that building
machine learning systems is just building software. And building software is fragile.
It has lots of different components. It has lots of things that break. It's not glamorous.
It's not sexy. It's real true work that you need to do to get your stuff to work. And
that's what she had been doing when she started working on this.
Third, even advanced work deals with normal problems. So this is actually a screenshot
for when I tried to register for chat GPT on the first day it came out. And potentially
they were having problems serving web requests, which has nothing to do with the really complicated
and impressive model that was happening in the back end, which goes to my last point,
which is that data work is also engineering work, and we deal with a lot of the complexities
and fiddling with all the pieces that that entails.
And finally, something that Vectorola realized was that the solid fundamentals that she had
been working all along that seemed very boring to her and not related to what she had read
about in the media were actually the fundamentals. And if she learned those very well, then she
could do and be present for the advanced work as well. And I think that's something that
we can all take to heart. So what Vectorola realized is that she now lived enlightened
ever after, still also in the staff MLE room because the fairy Asari had forgotten to rotate
her credentials. And so something we can all take away is ad astra per aspera. This is
kind of like my personal motto. It means the stars through difficulty. We're all going
to deal with this stuff. We're all going to have this happening. And it's up to us to
embrace it. So thank you. I want to say a couple of thank yous before I go. That was
my talk. First of all, enormous thank you to all the Norm Conrad organizers, Gania,
Ben, Jeremy, and Roy. The amount of late nights, the amount of time spent, the love and effort
put into this conference is amazing. I hope everyone sees it and enjoys it. We had so
much fun planning it, and we hope you all love it too. Thank you to the emcees and moderators.
Again, these people are, to my knowledge, not being paid for any of this. Everybody
is doing it for fun and for love. And I'm excited to share this with you. Thank you
to the speakers who I also voluntold to speak. I'm excited to hear what you're going to talk
about. Thank you to our sponsors. If you are in Slack, you can come visit their booths.
And they're all on our website. Thank you to our lightning speakers who led the way
for the pregame to the conference and who had some amazing talks. Thank you to Numfocus,
who make the tools that we use and who we're donating the optional donations of the conference
to. And finally, thank you to you. This event would be nothing without the spirit of the
data community, with everyone that works on this stuff, that contributes to it, and that
builds on it every day. Welcome to NormCon!