What's the simplest possible thing that might work, and why didn't you try that first? - Joel Grus

Transcript generated with OpenAI Whisper large-v2.

So, I'm Joel, and I'm going to talk about what is the simplest possible thing that might work and why didn't you try that first. So first I tried to ask chat GPT to write my talk, but it gave me one of its mealy-mouthed, you know, it's complicated answers. If you're playing norm conf bingo, I hope you have chat GPT on your card a lot, because that's probably the only one that you're going to get, but you'll get it a lot of times. So anyway, like I said, I'm Joel. I wrote a book called Data Science from Scratch, which was mentioned. I also more recently wrote a book called 10 Essays on Fizz Buzz, which is interesting. You should check it out. I'm known sometimes for my live coding videos, and more infamously, I'm the guy who does not like Jupyter notebooks. And so if you don't like Jupyter notebooks, you can watch my talk and agree. Or if you do like them, you can watch my talk and not agree. And I have a podcast, Abracadabra Learning, which I didn't plug because we never make new episodes anymore, but maybe we will. So most of you have never worked with me, but if you had, you would know that my favorite people to ask questions or my favorite question to ask people about the problems is, what is the simplest thing that might possibly work? And the reason I started asking that is because people like to make things overly complicated. For example, not too long ago, I was talking with someone who wanted to do a machine learning project involving book reviews. And so they needed book reviews. And so they started asking questions about like, what's the best way to build a Goodreads scraper? And if you look, building a Goodreads scraper is a non-trivial endeavor. And so I looked in the mirror and asked myself, what is the simplest thing that might possibly work? And that was, hey, maybe someone else has already scraped this dataset and shared it. And sure enough, it turns out that they have. And so that was potentially a better solution, a simpler solution. And chatGPT knows this. If you asked it how to get the book reviews data, it will suggest downloading it. I'm very bullish on chatGPT. I would be even more bullish if it weren't broken. Half the time I asked it to try to slide, but big things are coming. And so over the course of my career, I've spent a lot of time teaching diverse teams of junior data scientists to do good data science. I've done this so much that if you ask Dally to draw a picture of someone doing this, it draws a picture that looks almost exactly like me. And so my style is to go off and let people try things. And then they try things and they come back and they tell me about them. And so then I have to ask, what's the simplest thing that might possibly work? And why didn't you try that first? And so I kept asking that a lot. But the further I go, I find that simplicity is not... Well, one, it's not easy to get an AI to draw it, but also it's not easy to define. And so, you know, here's this quote about simplicity requires hard work to achieve and complexity sells better. So maybe simplicity is sort of the opposite of complexity. Okay, well, then what's complexity? Well, you know, if you do machine learning, you've probably seen this bias-variance trade-off graph where as your model gets more complex, the bias goes down and the variance goes up. And here, complexity means something like, you know, number of parameters or model capacity. Is that a good way to think about it? That's probably not how I want to think about it, but we'll see. And you can find examples of, you know, here are some models that are simpler in the model capacity sense that do better than models that are more complex. And, you know, people will write about these. That's not where I kind of want to go. Rather, I would like to really seek what is simplicity and then you can distrust what I say. So a big part of why I ended up thinking about this problem is that for the last five plus years I've been doing a lot of natural language processing. And so I'm always talking to people about their natural language processing problems. And so I talked to a lot of people about text classification. It's come up a couple of times today already. And so if you're not an LP person, text classification is exactly what it sounds like. Trying to decide whether email is spam or not. Trying to decide whether a piece of text has positive sentiment or negative sentiment or neither. Trying to decide whether a blog comment is toxic or whether a talk is worth attending based on its title or its abstract or its presenter. Or whether a machine learning model is simple based on its description. So there's a lot of ways you could proceed. You know, one easy approach might be logistic regression. Take each email, convert it to a feature vector somehow using like bag of words or one-hot decoding or averaging word vectors and then you learn some weights, one per feature. So that's basically like one parameter for each word in your vocabulary. So you know, that's reasonably simple. Another very classic approach is what's called naive Bayes classification. I don't want to go through the math, but you chop your email into words. You do some computation about how likely you are to see a given word in the spam email or not. And then you apply this PIPA-BEBE theorem and then you know, you get it out. This is kind of the classical approach to spam detection. There's a chapter on my book. You could implement it yourself in an afternoon. And it again has like N parameters where that's the number of words. About five years ago, the state of the art would have been, you know, this less dim, tat-dim model where you take your email, convert it to a sequence of embeddings, run them through this recurrent neural network, get a final state, classify that hidden state. You know, this is now suddenly thousands and thousands of parameters for all these matrices that you're multiplying by. And you know, more recently you do a BERT model, which I'm using in shorthand for any kind of transformer model, where you take the email, convert it to a bunch of different embeddings, feed it to the transformer, get some pre-trained embedding as input to the classifier, fine tune it, and now you're talking like hundreds of millions of parameters. And so, you know, if I could see you and listen to you, I would ask you a question and I would say, which of these is the simplest approach to test classification? Logistic regression, Naive Bayes, LSTM, or BERT? And so there's no audience feedback, but I'm just going to assume that you're all voting for choice number one. So I don't know, I don't think there's a right answer. People can differ. But now here's a different question. Let's say you had a friend who was a junior data scientist and they needed to do some kind of test classification and they needed to get good results and their job depended on them getting good results. What would you recommend? And there I know what the answer is. I would recommend that they use BERT. And I know that's the answer because I keep meeting people who are doing test classification using Naive Bayes or, you know, logistic regression or XGBoost on TFIDF vectors or other things you don't even want to know about. And I'm like, why did you not just use BERT? And so it's not just that I tell them to use BERT, it's like, why are you wasting your time doing the others? But then there's this tension, right? Like on one hand, you have this angel on my shoulder saying, try the simplest thing first. And then you also have this devil that's like, why did you just not use BERT? And so I felt like there was this contradiction there, right? And so what do we do when we feel like we have a contradiction? You know, you turn to Anne Rand and she says, contradictions do not exist. When you think you have one, check your premises and one of them is wrong. So which of my premises is wrong? Well, let's take a step back. When we did that LSTM model, I said that it had thousands of parameters from these matrices that were multiplying things by and so on and so forth. But the reason that that model only has thousands of parameters is that it's kind of free riding off of whatever text embedding model you're using, GloVe vectors or Word2Vec. If you go and set out to download the GloVe vectors themselves, that's going to be hundreds of megabytes of data. So maybe those should count for part of the complexity of the model that, you know, in this model, you have millions of parameters worth of data that choose which word goes to which vector. But most of those parameters are already fixed by the time you show up. Someone had to learn those vectors once, but then you can use them over and over again for different problems, just changing some parameters here and there. So they're a black box that's not really part of our model. Do they count as complexity? They're pretty complex. Or do they count as simplicity because they move the complexity kind of out of our modeling and into the pre-processing, if you will. And so, you know, if we go back to BERT, you could also think of the BERT embeddings as a black box. So I think a BERT is something where I give it, you know, a sentence and it spits out an embedding. Okay. And now I take that embedding, use it as input to some simple classifier, logistic regression or whatever. And now I fine tune this. And that part of the model itself, that's thousands of parameters, not millions of parameters, right? And then like, you know, I'm cheating a little bit because when you fine tune, you are like updating those BERT weights somewhat, so they're not quite a black box or like a gray box. But there's actually, outside of BERT computing those embeddings for you, there's not a lot going on. In fact, there's less going on there than in the, you know, fancy LSTM model. And what's more, there's also some hidden complexity in the simple models like Naive Bayes. So, you know, imagine you're doing this Naive Bayes where I want to chop my email into words. Well, maybe I also want to consider phrases. So now I need to, you know, look at n-grams rather than words. And maybe I want to filter out stop words. And maybe I would like to do some kind of stemming so that classifier and classification show up as basically the same word. Maybe I want to split out contractions. Maybe I want to do any number of things. And so, you know, once you get into the mechanics of I want to do a simple Naive Bayes model on this, you have to make a lot of choices and you have a lot of degrees of freedom. And suddenly, if you include those degrees of freedom in your model, it's less simple with, you know, one of these transformer models. You just kind of shove the text in and it does with it what it does with it to kind of go back to what we were talking about earlier about find methods that can cope with your dirty data and just ignore it. So this is a lot of clutter. So where do we find the simplicity? And you know, I mentioned woodworking at the start. So now I'm actually going to talk about woodworking and hopefully it will be relevant and make sense. So as I mentioned, you know, in the pre-talk banter, woodworking is sort of a new hobby of mine. I only started doing it in earnest a month or two ago. But what I've been doing all year is watching woodworking videos on YouTube. In fact, I'm not bragging when I say that I'm pretty much like a savant at watching woodworking videos on YouTube. Like I'm really, you know, I'm a 10X woodworking video watcher, if you will. As a woodworker, I'm not that good. I barely know how to use the tools, but good at the watching videos. So now imagine that you and your friends are like building a dining table, right? And so like these people, I think they're people, you have no idea what you're doing, but you know, you don't want it to have these sort of ugly, simple square legs. You want your table to be beautiful and have these complex tapered legs with, you know, nice slopes that intersect each other and that look good. And so, you know, one thing you can do, obviously what's the simplest thing you can do, and why do you not do that first is you can go to Amazon and buy tapered legs. It'll cost you 175 bucks. But if you just want legs and you just want to buy them, then you can do it that way. It works, it's expensive. But it is an intricate and involved process. And there's a lot of ways that it can go wrong. If you're not like a master craftsman like this guy, like this is a video of someone, you know, making tapered legs. He's really good at what he does. And he's got all these hand tools and he makes this precise cuts and, you know, it turns out beautifully. But then you watch more YouTube videos and you start learning about jigs. And so a jig is sort of like a tool that you build to take away some of your complexity. So this here is a tapering jig. I want to cut a piece of wood at a specific angle to a specific length. So I build this little frame that I can put my wood in. It has a stop so that the wood can only go in really one place. And it has another stop that holds the wood at exactly one angle. And you put a piece of wood in, you run it through your table saw, and you get the exact, you know, slope and size that you want. And it's repeatable and it always works. And so all you have to do is, you know, however many of these pieces you want. You shove a piece of wood in it, you run it through the table saw, and you try not to cut your fingers off and, you know, you're done. And it's suggested to me that how simple something is, is not just how complex it is in theory, but also how difficult is it for you to try it? So you know, if it's 2018 and you wanted to fine tune a BERT model, you were signing up for a lot of work. This is one of the examples. This is a text classification example from PyTorch pre-trained BERT, which was like the original precursor that became Hugging Face eventually. And if you wanted to make it work on your data, you were signing up for a minor adventure. This is like line 532 of the text classification example. And so, you know, many, many hundreds of lines of code. In 2022, if you want to fine tune BERT, like this is basically it. You import a couple of things from Transformers and, you know, create a tokenizer, create a padding collator, create a model, train it, and train. And like, I've left out a few lines, but this is like basically it, right? These days, you know, in 2018, it was not simple to use BERT. It was not simple to fine tune a BERT model. Today it is pretty simple to use a BERT model. And in fact, I would argue it's simpler to get good results by fine tuning a BERT model than it is to get good results with a simpler model. And if you just want like the pre-trained embeddings out of it, it's even simpler. It's basically three lines of code. You import from Sentence Transformers, load a model, and say encode my sentences, and you're done. So, there's like nothing to it. So maybe when I talk about simplicity, I mean not just a model that's itself simple, but a model by which it's simple to get good results. So, you know, one other example that's slightly out of the ML data science space is sorting. Sorting algorithms can be really simple. You know, sometimes if you interview canonically, they'll ask you to implement, you know, bubble sort or merge sort or quick sort or whatever. You know, back when I was preparing for coding interviews back in the day, I memorized like every one known to man. Not that that's really done me any good. And Python uses an algorithm called TimSort. You know, my guess is that most of you don't know how TimSort works, because I don't know how TimSort works. It's slightly complicated. It's not obvious. It's not a simple algorithm. And yet it's simple to use, right? So if you say, if you ask a chat GPT, what's the simplest sorting algorithm? It says probably bubble sort. And I think that's right. Probably conceptually, you know, if I had to explain it to my 11 year old, I would pick bubble sort. But if you're working in Python, you should of course use the built in sorted algorithm that you used to, unless you have like a super good reason not to, right? And then if you ask it, you know, is TimSort a simple algorithm? It says, no, it's not a simple algorithm. But is it simple to use correctly? Yes, it's very simple to use correctly, right? Unless you have a comparison function like the other day in the advent of code, which it was less simple to use correctly. But anyway, the point I'm trying to make is that as our tools get better, the boundary between simple and complex changes. So how easy a solution is to implement and scale depends a great deal on the tools and abstractions you have for working with it. With a tapering jig, it's simple to implement and scale making tapered legs. If you don't have a jig like that, it's not so simple. With the Python standard library, it's simple to use a stable hybrid merge insertion sort, which is what TimSort is. You know, in other languages, it's not so simple. So I saw an interesting tweet about this from the CEO of AngelList Venture. He says, why use AngelList to run a fund? He says, my answer, AngelList abstracts away all the complexity. So he said it was starting and running a fund. And so I found that phrase really interesting, right? It's not that it eliminates the complexity. Complexity can't be gotten rid of. But it abstracts it all away and gives you an API where for the most part, you don't have to worry about it. And from your perspective, it's much simpler. So sometimes things can be simple, not because they lack complexity, but because complexity has been abstracted away. And so, you know, I guess the lesson here is that as you're doing data science or machine learning or software engineering or data engineering, whatever, think about ways to abstract away the complexity. Sometimes that can be, you know, just using pre-trained models and user-friendly libraries. Sometimes you might want to create your own shared product, project templates, you know, crafting clean APIs, using engineering best practices and using shared processes. And then if you do that, you can use complex machinery in a norm-conf way. So there was a survey that was floating around the other day about ML and AI practices where like a third of the people said they were using natural language understanding, but only a third of those people said they were using transformers. And now, you know, maybe they didn't know because they say, oh, I use Hugging Face. I don't know. But when I see that, I think, wow, that's, you know, two-thirds of the people using natural language understanding are probably not getting as good results as they could with the same amount of work. So, you know, you have my blessing to go import transformers and go nuts, and I won't criticize you. And even Chad GPDT knows this one, right? If you ask it, what's the simplest way to solve a text classification problem, you know, it's just a bag of words. But then it says another option is to use a pre-trained language model to generate embeddings for each document and then train a classifier for these embeddings. So take it from Chad GPDT. That's a simple way to solve the problem. This is one of the woodworking YouTubers that I follow. He put out a video pretty recently about making a simple, elegant table. And of course, I saw the word simple and like it keyed with me. And you know, he said simplicity is something that sometimes we only discover through experience and with confidence. And that resonated with me, right? Like knowing what the complexity, what the complex parts and the simple parts are is how you know how to abstract those complex parts away and leave yourself with something simple to work with. And so I'll leave you with a quote by Lao Tzu about abstracting complexity away. It's not really about that, but there aren't too many quotes about abstracting complexity away. Manifest plainness, embrace simplicity, reduce selfishness, have few desires. So manifest plainness, that means don't put too many stickers on your laptop. There's too many, take them off. Reduce selfishness, that means don't hog all the GPUs, share them. Other people need to use them too. Everyone needs to train machine learning models. Have few desires, I don't really understand this one. I don't know. Someone can explain it to me later. I don't get it. And then embrace simplicity, right? So if there's a moral to this talk, it's maybe that simplicity isn't simple. Things that seem simple sometimes aren't, and things that seem complex might actually be a simpler choice. So thank you. I'm Joel. You can find me on Twitter. Everyone keeps saying that Twitter is dying. It's not dying. It's still there. It's like the same as always, pretty much. So I'm still there. And that's that.