Data is the new coffee - Peter Baumgartner
Transcript generated with
OpenAI Whisper
large-v2
.
Thank you, everyone, for watching my talk. Data is the new coffee. I am Peter Baumgartner,
and I am just going to dive right into it. So I'm a machine learning engineer at Explosion.
We created the spacey natural language processing library and the prodigy annotation tool. I
also lead our consulting services, so I work on various applied natural language processing
problems. And often those problems involve the process of annotating data, which is also
called labeling or coding data. For those of you not familiar, there's an example here
on this page. In this image, we have a common natural language processing task called named
entity recognition, where we're looking to identify entities and their types within a
piece of text. So right now we have models that can do a pretty good job at this task.
But the reason we have those powerful models that can make predictions about entities is
because people have done the work to annotate thousands of documents with named entities
like this one. So one part of my talk is going to be talking about this process and exploring
this question of where does our data for machine learning come from? Specifically, what's this
process that transforms unlabeled raw data by adding annotations or labels for supervised
learning tasks? But we're going to get a little weird. We're going to take a winding road
to answer that question, because first we have to talk about coffee. So I love drinking
and learning about coffee. The idea for this talk came from this book on the right here.
It's a sort of manuscript, a very long exploration into the world of professional coffee tasting.
And as I was reading, I noticed there were just a lot of similarities between the world
of professional coffee tasting and what we do when we annotate data. It raised a lot
of interesting questions about coffee for me. It's like how does a bean travel thousands
of miles to end up in our cup without having defects? When we're at the store, how do we
actually know what kind of coffee we're buying? How does a coffee taster convert these raw
taste perceptions into a quantitative and qualitative assessment of that coffee that
can be shared? Who came up with the descriptions that we see on coffee bags and menus and how
did they do that? So my twist on the norm comp theme here is illustrating that I think
there's a lot we can learn from these mundane, habitual, very normy tasks like drinking coffee.
And so we'll start getting some answers to these questions about coffee by exploring
the world of professional coffee tasting and then see what we can adopt for data annotation.
So a typical coffee tasting consists of sampling several coffees. Depending on the purpose
of the tasting, this can result in a description of their flavors, assigning numeric values
to the qualities of the coffee like bitterness or acidity, and tasting for any defects in
the coffee. We'll see an example of this in a bit. This is a screen cap on the slide from
a video that was the world's largest coffee tasting, which was a virtual event put on
by YouTuber James Hoffman, who I'd strongly recommend if you're at all interested in coffee.
In this video, he's doing part of this virtual tasting and he's tasting five different coffees
that have been prepared in the same way with a process that's formally called cupping,
which is different than how you typically brew coffee for consumption. You know, one
difference being that you actually are going to slurp the coffee from a spoon like he's
doing here. So on the right hand side of the frame here, there's a sheet for taking notes
and scoring the coffee. And this is where the magic happens as tasters annotate their
experience and convert it into structured data. Often tasters will use a structured
form or a protocol like this one from the Specialty Coffee Association. There's a lot
going on here. So I'm just going to split up this form into four different parts. So
one thing that's happening is they're doing a numeric assessment of flavor characteristics.
So we've got fragrance, aroma, flavor, aftertaste, acidity, body. There's some descriptive notes
of the flavors. So this one is very floral, silky body, beautiful aftertaste, extremely
balanced. There are checkboxes for any defects that are noted in the coffee. And then there's
a final score that's basically the sum of all the points minus some points for any defects.
So coffee tasters completing this form and doing a tasting is similar to annotating data.
And the benefit for us is that they've been doing it for a long time and have these established
methodologies that appear to be working. So now I'd like to transition into talking about
practices in professional coffee tasting that I believe can be of service to those of us
who rely on annotated data for data science. I'm going to talk about five different practices
that exist within the world of coffee tasting and draw some analogies to annotation. All
right. So first up is this idea of calibration and agreement. So professional coffee tasting
sessions usually begin with a day of what's called calibration. What happens here is before
evaluations and descriptions are formally recorded, tasters work together to come to
a sense of agreement on what various descriptors and scale values mean. This is important because
everyone has different backgrounds and tasting abilities and experiences that are going to
merit the need for calibration. If everyone is proposing their own idea of what an 8.5
for flavor means, a large value of the, a large part of the value of tasting is going
to be lost due to that inconsistency. There was an interesting example of this in the
book that I mentioned. So typically when the descriptor orange is used in North America
by coffee tasters, it refers sort of straightforwardly to the flavor of the orange fruit. However,
tasters from India typically use the descriptor orange as an iconic descriptor where it essentially
serves as a substitute for the idea of good flavor. So this is just one of many examples
of how interpretation and judgment being in can be influenced by sort of local characteristics
of a tasting. So luckily with annotated data and with digital information, we have a way
to measure how much people agree on the annotations and the labels. This is the idea that is called
inner annotator agreement or inter-rater reliability. And these are measures that you can calculate
when you have multiple annotators labeling the same set of examples. When you're starting
a new project and you don't know how difficult your annotation task is, agreement metrics
can provide extremely valuable information. If two humans can't agree on how something
should be labeled, your model isn't going to output consistent results and you and your
users want consistent results. Additionally, you're going to bias your training dataset
based off the individual whims of your annotators. So like imagine if we had a magical machine
learning model to predict whether coffee had orange flavor or not. If half the data sets
annotated with North American tasters and half with tasters from India, what is it exactly
that we'd expect that model to produce? So starting a project with multiple annotators,
calculating agreement metrics and revising your project based on agreement, to me, seems
like a cheat code for successful machine learning projects. As an example of this, here's a
chart displaying an agreement metric called Cohen's kappa and the F1 scores for that model
for a model trained on Google's Go Emotions dataset. Vincent covered this dataset a little
bit in his talk. I think this dataset is sort of like a mini dumpster fire, but one of the
good things that they did do was have multiple annotators and calculated agreement measures
on their annotations. So the goal of this task, just as a refresher, they're classifying
Reddit comments into one of 27 emotion categories. So each dot here is the agreement score for
a single emotion category on the horizontal axis and then the F1 score from that model
on the vertical axis. So there's super high correlation here. So hopefully it's clear
that having multiple annotators and calculating these agreement metrics can tell you kind
of early if your model's going to be trash. So if you agree with me that agreement metrics
are awesome and you're using them on a project and find out you have a sort of poorly performing
task with low agreement, what is it exactly that you would be fixing? So we'll talk about
that as it relates to structure and process. So we saw that in coffee tasting, they have
this form, this protocol. The form guides tasters towards what aspects of the coffee
to pay attention to and how to pay attention to them. Coffee tasters also go through years
of training and refining their perceptions and descriptions of taste. Essentially, there's
just a deliberate effort to document and scale this task of tasting coffees and making sure
this process is as shareable and consistent as possible. So here's another example of
a tool that provides some structure. This is a flavor wheel. I know it's too small for
you to read anything. Don't care about the details or go Google flavor wheel when you
have more time. This one's from counterculture coffee and what it does is just lists out
common descriptors or flavors in coffee in a hierarchical manner. So it's just a nice
tool that coffee tasters can use. So for projects needing data annotation, the equivalent tool
to the cupping form or something like a tasting wheel are called annotation guidelines. So
annotation guidelines serve as your reference for how data should be annotated or labeled.
Here's an example of one from I just Googled annotation guidelines. There's this time ML
project. And essentially, this is just a specification for marking up temporal events within text.
So, their annotation guidelines include detailed descriptions of all their tags, instructions
on how to annotate with them, and most importantly, numerous examples of annotated texts. So,
without guidelines, it's difficult to know exactly what you're evaluating when you do
something like a model evaluation. Your model could be performing poorly because of inconsistencies
in your annotations on your training or your evaluation data sets. Now, this isn't a document
that you need to have perfectly defined when you start a project. Your annotation guidelines
can be developed iteratively over the course of the project and then solidify over time.
You really want them to be finalized, though, by the time you're annotating what we usually
call like your gold or your test data set. And you also want to be sure that you're reapplying
those finalized guidelines to your training data for consistency. So, speaking of iteration,
let's talk a little bit about that. So, this one, I'm going straight to the example. Here's
a bag of coffee I had a few months ago. I've highlighted the profile of this coffee on
the bag. And it says it is apple, toffee, and milk chocolate with a medium juicy body.
So, a natural question here is, you know, is the flavor of apple actually in this coffee
once I brew it and taste it? Is it my job to extract this inherent taste sensation from
my experience while drinking the coffee? And that's certainly one way to think of drinking
coffee as a task. But tasting notes in this profile here can also serve as a jumping off
point for additional discoveries. So, rather than being a specific experience I'm trying
to extract, this apple flavor can be used as a jumping off point for additional discoveries.
I might think of the flavor of apple, have a sip and try to locate that flavor and fail
to do that. But that experience can lead me towards another flavor that I can experience
more clearly. So, I could use that experience the next time I take a sip and see if there's
something more clear that comes through. This is the idea of operating reflexively. It's
like bootstrapping your experience by adopting an initial task and then, like, revising it
in a cyclical process. So, we can adopt this same reflexivity in
annotation and labeling data. If you have a new dataset and a new task and the concept
or idea in which you're annotating is still sort of not well defined, you don't have to
have your annotation scheme finalized or a production ready model in order to use those
tools to help you explore the problem space. You can just make up an annotation task for
yourself that will give you exposure to the data and experience with translating this
raw sort of data information into a structured task. It's fine to start with something simple
and discover your task as actually something else along the way as you start to encounter
more and more data. Your goal here is to understand the data through this process.
Another place this reflexivity shows up is with model in the loop workflows. In this
case, you will have trained an existing model that's making predictions on the data that
you're annotating. Here, the reflexivity stems from using these models, the model's predictions,
to better understand what type of data you might additionally collect or annotate. In
short, you can perform a simplified task or intermediate version of your final task as
a stepping stone to your final solution. You know, don't expect to get it right the first
time. So this very formal protocol-based testing
isn't the only kind of tasting around. There are actually three different sort of categories
of tasting. I've covered descriptive tasting, which is this long process with the form.
There's also discriminatory tasting, which is like tasting for defects. And there's also
hedonic tasting, which is just discovering whether a taster personally likes or dislikes
a coffee. So these different tasting tasks get employed strategically, and we should
be thankful for that. As lay tasters, we don't want to complete an entire form to determine
if we enjoy a cup of coffee that we brewed in the morning.
Flexibility is important because it helps you not waste time solving the wrong problem.
On the left here are examples of two types of defects in coffee beans, insect damage
and fully black beans, which usually result from environmental issues like frost. Fortunately,
these can be visually inspected for issues before the coffee is tasted. But if the process
didn't include that task of checking for defects, you know, these beans could get all the way
to an actual tasting, and we'd have people wasting their time with the wrong task, which
is tasting coffee when there are known defects. In the same way, you have to think about the
suitability of your annotation task relative to sort of like the use case for your model
and the limitations of your data. So my favorite task to pick on, of course, is sentiment analysis.
So let's say we wanted to know whether a movie review was positive or negative. I've got
a real life example here from the movie Birdman, which I love. So this is just a paragraph
from that review. Now, most sentiment models would just classify this text as negative.
And we can work with this coarse sense of sentiment if it's valuable for our product
to know generally how much people enjoy a movie. But if we're interested in a more nuanced
characteristics of the movie, like the quality of the musical score or the cinematography
or the strength of the weakness of the acting, you know, a broad sense of sentiment isn't
going to help us answer those questions. So here's an example of a paragraph later on
in that same review that actually was overall positive talking about the cinematography
of the movie. So you have to align your raw data, your annotations and your use case and
make sure all these things sort of line up and that you're solving a suitable task. Another
point here is that that other people have mentioned is that you should always be assessing
whether machine learning and supervised learning specifically is the right approach for your
system. You should be thinking about machine learning as one component of a larger system,
not your whole system in itself. In the same way, you know, it's not the job of coffee
tasters to determine the tasting notes that appear on a bag of coffee or the description
you might see in a coffee shop or how a blend might be successfully marketed. Instead, they're
focused on trying to describe the coffee as objectively as possible and in a standardized
way. All right. My last point is this idea of
full stack collaboration. I apologize if this sounds like a weird management consultant
buzzword, but it was the best thing I could come up with. So hear me out. So coffee farmers
don't often taste their own coffees, actually. Typically, they're just trying to achieve
the highest yield, which just means the lowest number of defects. So generally, there's not
a lot of interest in positive taste characteristics. So farmers don't know exactly what they're
selling a lot of times. This is one of the areas that people in the coffee industry are
actually trying to change so that farmers can better understand what it is they're selling
and improve their crops. And in the same way, I think machine learning
projects kind of have some sort of communication and collaboration risk if this whole sort
of team pipeline isn't working correctly and not everyone understands what the other one
is doing. So the risk here is having a project that's sort of too heavily focused on one
team in this diagram. So focusing on the left, I sort of have this thought that a lot of
when you look at a lot of organizations attempting to do machine learning or some research projects,
they think the real problem is trying to find a cost effective annotation team and getting
data labeled for as cheaply as possible. And on the other end of the spectrum, you have
spaces like Kaggle competitions that are framed in this sort of like, sterile product management
business problem perspective, like you just get a problem and they want you to bring the
data science and you can't ask that company to annotate more data or provide additional
context. And then finally, you know, data scientists
and machine learning engineers or whatever you are, my green screen is tipping over now
because I'm stepping on it. Whether you're a machine learning engineer or data scientist,
we're also just as guilty as this, too. I know numerous times, like, I just want to
get to coding. I'll just download a pre-trained model, get the data set and just go to town.
And that is to my own detriment later, I've found numerous times. So, you know, there's
downsides of focusing too much on one of these groups. If data science doesn't have insight
into the annotation task, you can waste a lot of time on a task that might make logical
sense and have good annotation guidelines, but be poorly executed or result in a data
set that's just not workable with machine learning. The data science team also needs
to help incorporate model in the loop workflows to perform error analysis and perform error
analysis on early iterations of the model so the task can be refined. Data science also
needs to communicate with the product team. It's unlikely that the things that machine
learning can do out of the box with a pre-trained model or framework is going to be a direct
solution to your business problem. And it's important to have product involved in the
annotation task because the things that you annotate are going to be the outputs of the
system. You need to be sure that those outputs are relevant and meaningful to the product
that you're developing. I would say it's not out of the question for everyone on any of
these teams to annotate data at some point in a project life cycle. I think the best
way to understand the details of the data and the annotations relative to each team's
goals is to actually do that process. So by neglecting to think about where our coffee
comes from and sort of this process of coffee tasting, I think we're missing out on, you
know, delightful coffee experiences and aspects of our coffee that we weren't aware that we
could enjoy. And in the same way, I think we're missing out on a lot when we ignore
where our data comes from and fail to implement processes that make it better. It's not enough
to accept your data as a given of a project and move on. If we care a little bit more
about where our data comes from and how our data gets annotated, I think we can also have
more successful and more delightful data science projects. Thank you.