Group-by statements that save the day - Vincent D Warmerdam

Transcript generated with OpenAI Whisper large-v2.

What I'm here to talk about is a group by statements that will save the day, which I thought was going to be an interesting topic. So what I would like to do is explain what this talk is about, but I also want to explain who the talk is for, because in my mind, there are people like Alice and Bob and Factorella that are interested in maybe getting started in this field of stuff with data. It's a very attractive field. And maybe Alice and Bob are recent graduates. Maybe they're just interested in a career transition, but it's been a while since I was a beginner. So I was thinking, you know, what is it like to maybe start out now? And then you give it a Google and then you read stuff like, hey, data drives everything and you need to get the skills you need for the future of work. And there are these listicles of like 10 essential skills for data scientists. And if you really try hard, I mean, you will even find like the best data science certificate programs from a bootcamp that I will not name here, but people are very eager to take your money so they can give you a certificate. And, you know, and that kind of brings me to the topic of this talk because what I want to talk about in this talk is two stories of two datasets. These two datasets are properly beautiful because they both tell a story, but the reason why they're beautiful is also because it's a story that I think Alice and Bob and Vectorella really need to hear. It is a story to help remind everyone of the human aspect of this data work that we're doing. And hopefully it's also a story that might help remove some anxiety against this whole must-have skills phenomenon that I'm becoming increasingly frustrated with on the internet. So with that out of the way, let's talk about a beautiful dataset. The first dataset I want to talk about is called ChickWaite. ChickWaite comes with the language R. So if you install R, you can call this variable ChickWaite and you actually get this dataset. It comes with the language. It's a neat feature. And this dataset has a couple of columns. So one column is chick, the other one's time, the other one's weight. And the idea here is that the chicken number one at time step zero had a certain weight. And as time moves on, the weight also increases. But this chicken also had a diet. This chicken got a specific kind of food. And different chickens might have different kinds of food, but you can imagine that there's kind of a use case here where it's kind of like an A-B test. We're trying to figure out which diet is best for the chickens to get them to grow. And if you're the good data science person that you're supposed to be, you do some exploratory work. So you make a chart and you kind of get this. On the X-axis, you've got time. On the Y-axis here, you've got weight. And you can see that there's this general pattern in the middle over here. You can definitely see the average is going up, but you can also see that there's a bit of variance. There's actually quite a bit of spread here. And that is something that you can totally see. But then you can do sort of the clever group by. And what you then notice is that when you group by the time as well as the diet, you can kind of make a summary line for each diet. And you can also see that some of these diets perform better than others. It's like something that you can conclude. Ah, but now come the data machine learning people. And they say, we have these tools for the future of work. These are tools that you really want to use and apply. So I can imagine the pressure, right? Like you're doing analysis here, but you need to wiggle a tool in here that's supposedly the future of work. And these are good tools by the way, but like I can imagine that there's this pressure. So then immediately your mind kind of goes like, maybe I need to do something predictive here. And you can actually come up with a reasonable use case for that. You can actually say, well, given a diet and the time, maybe we can predict the growth trajectory for these chickens. And it's actually kind of a reasonable thing you could do. But you can also just wonder, well, is that the thing we should do? Like, should we immediately think about the modeling thing we want to predict here? Because maybe that's a distraction. And I'm about to give you like a very good hint. Like if you're in data science, there's one thing people don't do enough of. And that is that when they make a visualization like this, you need to take five minutes and just stare at the thing for a while. And you need to look for stuff that might surprise you. Hadley Wickham has this really great quote, a machine learning scales very well, but it doesn't surprise you. Visualizations don't scale super well, but they do have the ability to surprise you. And in this case, if you just squint your eyes a bit, something fishy is happening here. Notice here, like a time step zero, right? It seems that there's a chicken that's actually losing weight. Because if I look at the height, there's definitely like a chicken here that seems to be losing weight. I don't know which one of these chickens, right? But that's a bit off. And it's not just happening here. It's also happening, well, here. And that's a hint. So, okay, something weird seems to be happening there. Maybe we shouldn't model just yet. Maybe we need to think a little bit about that. Maybe we need to do another group by. And what I'm going to do now is I'm going to make the same visualization, but I'm just going to group by the chicken. Like I'm going to ignore the diet for now. Just group by the chicken. And when you do that, you get a growth line for each of these chickens. Let's just go back there. You get a growth line for each of these chickens. And you can kind of see the trajectory for each of them. And when you then zoom in again and you squint your eyes again, you will notice that a chicken stops here. And another chicken stops there. Another chicken stops there. And if we now zoom out again and we just assign a color to the chickens that stopped prematurely. Okay, it seems that some chickens die prematurely. I don't know for sure, but that's a hint that I'm getting. And now if I take a step back and I ask myself, which of these two charts is more important, right? Like even if I'm using the most fancy machine learning tools and I'm like displaying all the metrics that try to convince you of how good the predictive power is, we have to be very critical here because suppose that you have a TensorFlow model that's very good at predicting like the average diet weight over here. Are we actually sure that it's taking into account that some of the chickens might've died? Unless you are aware that that's the thing you got a model for, the model's not gonna know about it. And this, I would argue, is a really nice example of a group by statement that can really save the day. If you didn't know this upfront, you might be pushing a model to production that predicts dead chickens. And that will be bad, I think. So maybe we don't need TensorFlow to save the day. Maybe we need a certificate for asking the right question because that's something we maybe need more of. It's like a very simple conclusion that I have with this dataset. So, okay, that's just one dataset. Let's move on to the next one. Now, this other dataset deserves a slightly different introduction because I have to talk about this kind of, I will say, unclear task first that's increasingly popular. So what you can do is you can go to this website called HuggingFace. And HuggingFace has a couple of interesting features. They host datasets, they host a bunch of models, but you can also search for these models. And one way of searching allows you to say, well, I'm interested in the task of text classification. That means text goes in and some sort of class comes out. And you can predict just about anything like topic of a newspaper article and whatnot. But apparently in the top five, like most downloaded models on text classification, four or five are about sentiment, which is basically saying text goes in and out goes positive or negative. But then you start wondering, like how do I interpret positive and negative? Because that's a pretty big bucket of stuff. Like I can be in love and I can be laughing, but those are two quite distinct emotions that I don't think we should really pretend like they are the same. That feels a bit strange and maybe the same thing for fear and anger. Like I get that sentiment is like a simple concept, but it might be too simple as well, depending on what we're interested in doing with text. So imagine my delight when around the same time last year, like about a year ago, a bit longer, I learned about this dataset called Google Emotions. They took the effort of actually writing a paper and the Google name is attached. And this is a dataset where we're not doing just sentiment. No, we're actually doing emotions. So they took data off of Reddit and you've got texts like this, like, oh my God, yep. That's the final answer. Thank you so much. And like, you can attach labels to it. This is not classification, it's more like tagging. So you can have more than one class attached to this. Let's move a bit of a detail, but I think it's interesting because it also makes sense. You can have texts where more than one emotion applies. So that feels appropriate. And what you can do here is that they also did a bit of annotation. So each example was annotated by at least three people. Sometimes even five people had to look at it. And you can read the paper and read all sorts of interesting details. So there's about 60, like almost 60,000 text examples, over 200,000 annotations, 82 people were annotating here. There were 27 emotion tags attached from amusement to joy, to relief, to nervousness and grief. I mean, it's a huge pallet of emotions that was covered. They also made this correlation chart that was interesting. So you can see that like some emotions co-occur together and some don't, some just don't appear together at all. That's something that the paper did mentioned. There's also like a lot of effort that went into pre-processing that was pretty interesting to read. So they removed a couple of subreddits just because they were too vulgar, makes sense, Reddit. But they also masked the name of people as well as references to religion. Seems like a good thing. They made sure that each test had at least three tokens and no more than 30. All the annotators were English natives. So, you know, a bunch of interesting things that went into this, bunch of thought. And then the paper also goes into like benchmark models. So the deep learning thing where they all show the confusion matrix. And it makes sense that certain emotions are confused with each other. And oh, there's also these awesome like F1 charts for transfer learning. And here's where we need to stop again. Because we can get excited about these charts like before. Like we can really get the blood pumping. But what if there are dead chickens in this dataset too? And that's a question I've been asking myself more and more recently. So usually when you're looking for dead chickens in the dataset, it's like a really good exercise to do. But if you're looking for a good hint, try to think about how the dataset got created. Because let me give you a hint. If you can't use visualization, how a dataset got created usually gives a good hint too. So this dataset got created because a bunch of people on the internet were annotating. And if I recall correctly, I believe the paper mentioned Mechanical Turk. So it's not just the fact that we have many people annotating. We also have people annotating that aren't in the same room. You can actually assume that these people don't necessarily talk to each other. And you can kind of wonder, well, maybe these people occasionally disagree. That's definitely possible. Especially when it's emotion, which is culturally like a thing. It's not necessarily universal. It's language, that's also a thing. Oh, and by the way, it's also Reddit data. And that brings in an interesting phenomenon. So on Reddit, you can have texts like, oh my God, those tiny shoes desire to, should be boob snoot, a desire to boob snoot intensifies. Like that's a perfectly normal sentence on Reddit. But I can imagine that if you're like the 25 year old or 50 year old trying to parse what's actually being communicated here. Like, can we really assume that everyone has the same opinion on whether or not this is about excitement? And by the way, this is an actual example from the dataset. I'm not making this up. This stuff just naturally appears. But no, okay, maybe this is grouped by time. Maybe this is one of those things I just need to check before I do any sort of modeling on this thing. Let's just see if I can come up with a nice little number. So I group by text and I just check whether or not all annotators agree. On all 27 emotions, by the way, at the same time. That's the number that I'm calculating here. And there you go. Out of the 58,000 text examples, less than 8,000 of them have every annotation agree on every emotion. So it's like 13% of all the examples have everyone actually agree on the emotions that are there. That's not a huge number. But I do want to caveat just a little bit. So when there are 27 emotions, if only one of the annotators disagrees with one of the emotions, that also means that there's no agreement between all of them, right? So it's not like I expected this number to be high. But I would also expect that if I'm going to do machine learning on this, I do expect this number to be just slightly higher, maybe. And what's more, you can also do some other really interesting group buys here. So what you can also do is you can say, let's just look at the top three annotators. Usually these data sets, by the way, they come with like a huge skew, like the top N% has like way more than N% of the annotations. So in this case, the top three annotators annotate about 14% of all the data. Together, they annotated almost 7,000 examples and not even half of them agree with each other. So this whole disagreement thing, you didn't need 82 people to discover this. If you just take three people, it seems that they immediately disagree on a lot of this stuff. And again, I would argue, this is another example of a group buy statement that could really save the day. Imagine running this, imagine not running this statistical check up front. Imagine training a huge group of people to do a group buy statement. Imagine training a huge model on it. Right? And again, it's the same thing, like which is more important. And I hopefully don't have to convince too many people. There's a risk that you're gonna model too early if you focus too much on the modeling tools that are out there. And just to like really drive this point home, like the standard data science pipeline, right? You start with your data, you pre-process the thing, you model the thing, outcomes on metrics. And if we like the metrics, we do the prediction thing. Well, if your data is not too great, your predictions are not gonna be too great, but your metrics can still be super shiny. Like if people have bad labels, for example, your accuracy numbers are not gonna prevent you from putting bad predictions out there. And I don't necessarily wanna dunk on people, but I also had a look on the same Hugging Face website. And it turns out that a whole bunch of people trained models on Google Emotions and they put them out there. And some of these models got downloaded like 260,000 times. And you gotta wonder, like, did they run the group by? Maybe not. And I also don't wanna blame them, right? If you see a dataset from Google, you know, that's like had all of this effort being taken in and had all the charts, I could definitely imagine that all of that distracts you from thinking about running this one little group by statement. Like, I do think it's a shame because people forget how these datasets came to be, but this whole annotation process that happens before it is really, really important. And I do worry that maybe if we're convincing ourselves that let's say TensorFlow or Scikit-learn, if those are the must-have tools, that maybe we forget about the step back that we gotta do once in a while. And, you know, with that in mind, looking back, I do gotta admit, Google Emotions is actually kind of a beautiful dataset if you think about it. Like, sure, there's annotated disagreement, but I also don't wanna dunk on Google Emotions here because a part of me actually fell in love with it. Like, for a moment here, right? How many datasets that are used for public benchmarking actually contain the annotator information? I can't think of a lot of them. And it's a great case study, actually, if you think about it. And what's more, there are actually some really interesting papers being written about this. There's one paper that I do recommend giving a read. The title is, Are We Modeling the Task or the Annotator? This is about annotator bias. Turns out if there's like one person who has like too many of the annotations under the his or her belt, that's gonna totally bias like the aggregate label that might come out. It's a really good read. But also, Google Emotions definitely had like an almost career-altering effect on me. Before, I would say, oh my God, super cool, new dataset, let's try stuff. But now I like to think I'm a bit more critical. Before actually putting this in production, I really start to wonder, well, maybe there's something wrong with it. At least I should check like some basics. And it also led me to write a library. There's a library called Doubt Lab. It is a relatively simple tool with some scikit-learn-based tricks to try and find some bad labels in your dataset. And they're just simple tricks. They're not necessarily state-of-the-art, but there's stuff in there that generally is worth a try. And if you find that there are like some weird labels in your dataset, that's an excellent time to maybe pull the plug and say, let's first check our annotator process. Maybe there's something up with that. That's gonna prevent a whole lot of harm in production. If you're interested in that, by the way, like how those tricks work, the Explosion YouTube channel, which is again, my employer, I was able to make some content for that. So if you're interested in like the techniques on how to find these bad labels, definitely check that out. But then if I sort of pivot back to like the topic of the talk, like if we're gonna be talking about like, what are the must-have skills here, really? I mean, I get that you're not gonna be able to buy a certificate in common sense. And I also understand that it's gonna be kind of hard to find a bootcamp in critical thinking. I also don't wanna suggest that these tools are useless. They're not. Tools can be super useful, but maybe at this point in time, there's too much emphasis on learning these technical tools. And maybe it will be better if we can come up with more content that puts emphasis on this human part. Maybe we just need more anecdotes to share around and maybe that's gonna prevent more harm. So looking back, I hope people look at this and they say, well, this was indeed a talk about two datasets. And I also hope that we all agree that both of these datasets are actually super beautiful because they tell a very beautiful story. But most of all, I hope that the story here helps give Alice and Bob and Vectorella a sense of calm because maybe even if we're Googling, we should worry less about these must-have and essential skills because maybe we should just admit tools are just tools. It's very good to learn some, but maybe it's more important to just stay aware of what you're doing and to be the human in the loop. And when you do that, you might just have group by statements that save the day. And that, if anything else, is what we need more of in our field. And I also want to mention this because I do get this question a lot. It's this frustration with educational content that actually led me to make CalmCode. I get this question a bit, so I figured I might just mention it. It's honestly the fact that mainly what we maybe should be doing is just focusing on making educational content that are more about tools and thoughts that make your professional life just more enjoyable. And maybe we should be less braggy about the tools that we use because they're not necessarily super state-of-the-art. They're all just tricks to help us get through the day. And I'm mentioning this explicitly because a couple of you, dear listeners, are also content makers yourself. You do a bit of education. Try to keep the calm in mind is the only thing I would ask because I do think our profession will be a whole lot better if we stop bragging about the tools and we just start sharing anecdotes a bit more. I do think the field could maybe use a bit more of that. So having said that, thanks for listening. I hope this was super interesting. And if I can give one small plug, so I work for this company called Explosion. And we have a bunch of cool tools like Spacey. You might've heard of that. And I can't announce anything just yet, but I can say there's a bunch of really cool stuff in the pipeline. So like, give us a follow. Like there's cool stuff coming. Just wanna mention that. Thanks for listening. Ask me anything in Slack and I'll be around for the conference today.