Data's desire paths: shortcuts and lessons from industrial recommender systems - James Kirk
Transcript generated with
OpenAI Whisper
large-v2
.
I'm James Kirk. I was one of your MCs this morning, if any of you are still awake after
the early start today. But I'm back and I wanted to tell you a bit about a subject that
I really, really love. And I've gotten to work on a bunch, which is recommender systems,
and particularly a lot of the ways that building recommender systems can get really thorny
and challenging and also a lot of fun. And my favorite way to rationalize recommender
systems is as a desire path. If you're familiar with these, the desire path is what happens
when a whole bunch of people all decide to walk through a field in the same way to get
from point A to point B. And eventually the grass dies and you get a bit of a trail. And
sometimes people will come along after and say, all right, well, if people always wanted
to get from point A to point B there, then maybe we should pave it, pave it. Maybe that's
where the path should always have been. I think this is kind of a neat metaphor for
recommenders themselves for kind of an obvious reason, right? Like people like to get from
A to B, it's your job, the recommender systems job to get them from A to B. But I also think
when you go a little one level, a little more abstract when you're building systems like
recommender systems or ML systems in general, this is kind of how you should think about
your process of building them from scratch, developing them, iterating on them, getting
them out into production and making something that you actually can leave behind to continue
to iterate on for your peers going forward. So a bit about me. My name is James Kirk.
That's what I looked like when I had longer hair and was therefore a bit cooler. I'm the
co-founder of Meru. We're building some new ways to recommend with humans, particularly
content creators in the loop to make recommendations to people who follow them. Previously, I worked
at Spotify, bulk of my career at Spotify, working on recommenders there and also at
a couple of other companies around Boston and also a stint at Amazon. But today, you're
not really here to learn about me. I want to tell you a story about you. This might
be a bit familiar to some of you if you've built recommender systems before, because
you're an engineer, you're a data scientist at your company. You've been there a while.
You've been working on a couple of different projects. You've gotten familiar with the
tech stack, the data. You've put some ML models in prod. And then one day this word comes
from above and somebody opened up TikTok and discovered something really cool and wanted
to know why they can't build that or can we build that or why haven't we already built
that and how do we get there? How do we get recommendations like some of these products,
especially consumer products that tons and tons of people are engaging with? This can
probably make the hair on the back of your neck stand up the first time you hear it.
But I think it's also a really great opportunity because this can be a really fun space to
dig into if you've never built recommender systems before, as long as you know some of
the pitfalls that you're going to want to avoid. The first and most salient one is probably
very simply from the origin here that most recommender projects from the get go are poorly
framed. By poorly framed, I mean that the idea hasn't really been fleshed out to a
degree that gives you, the engineer or data scientist, enough clarity to develop something
that will be healthy either in the short or the long term. Now, when this comes from above,
this isn't necessarily leadership's fault because they might absolutely have the right
idea that the business has data that could be used in such a way to really delight your
users and it's strategically valuable to build recommender systems and improve your product
with them. But maybe there's a lot of really loose concepts or inspirations flying around
about how that recommender might really work. And when that's really amorphous, it can be
really challenging to actually build those systems and get them out into the real world.
Just as often, these poor framings come from the bottom up. Sometimes somebody reads the
new paper from the Rexis conference and it's super cool and flashy and they want to stick
the transformers into the loop of their data somewhere. And then what does that look like?
Maybe it's recommenders, et cetera, et cetera. It becomes technology or a solution in search
of a problem. That's just as thorny and it's also your job if you're working around them
to figure out how to take that idea, which might be perfectly aligned in the long term
to something really wonderful and valuable and form it into a project that will actually
be healthy. So you're in the jungle now. Pitfalls abound.
You have just crossed the first one because you, or at least you see the first one coming,
which is around the problem framing. And if your company has never built recommenders
before, there probably really are no desire paths for you here. There's not an organizational
necessarily competency around what these projects look like, how to develop them, how to iterate
on them, how to maintain them long-term. And it's your job on this team to just get it
right the first time. Get it right the first time, not perfectly, not a perfectly well-paved
highway for the team to be sending a thousand models and experiments down or anything, but
start to carve that path in a way that other people would actually be able to follow in
the future. When you're framing this and you want to make
sure it's a healthy project, I think there are really four things that you got to look
for. And these are often missing when the idea to build recommender systems first come
up. The first and most salient in my mind is a clearly defined user. Does this matter
to your customer, to the user, to the person who's going to be receiving recommendations?
And on top of that, do you have any specificity about what their needs really are? Is this
somebody who will actually respond to recommendations, wants recommendations in their product and
will get value out of them? A big misalignment that often happens is that there will be a
different group of users that maybe one party like leadership sees as the recipient of recommenders
versus maybe product management might see things differently. Maybe executives, they're
looking at the strategy and they're saying, all right, it's got to be for 1824s in the
US that are casual users and we want to make sure they're experienced perfect. And then
product is saying, oh, it's going to be perfect for everybody. Our algorithms need to understand
every single user, cold starts need to be perfect, et cetera. You need to help get people
aligned on who exactly this clearly defined user is going to be. Otherwise the waters
are already choppy. Next up is a measurable definition of success. When you build this
system, this product, and you put it in users' hands, how will you know if it's any good?
In plain English, how will you know if it's any good? How will you know that people's
experience is delightful? And also quantitatively, do you have metrics that will tell you that
users are responding well to recommendations? And if not, how can you develop those metrics?
The burden is often on you as engineering or data science to help define this metric
of success, both in plain language, as well as quantitatively. And you need to make sure
that all these stakeholders are also aligned about what that definition is. Next up is
that you need somebody to be able to tell you in clear language, why that matters. Why
does it matter to your business, to your organization, that that recommendation is successful? There
are a lot of projects that create really great recommendations. They're really delightful
for people, but are unable to actually drive value for the business, and they languish.
And sometimes this is how these passion projects die. They never get in front of enough people
that they could have if they also had a clear relationship to what makes the business successful.
So seek that out and make sure that especially leadership of a recommender project understand
that relationship. Last, and this is squarely your domain as data and engineering, is make
sure that the data and tech stack that the company have are ready to build some kind
of recommendation systems. Nobody wants to get into a project where you have to tell
them, okay, well, we need six months of explicit feedback data collection to even dream of
starting a recommender here. Make sure that what these concepts look like, look like something
you can deliver with the data and tech stack your company has. So this, I consider kind
of the first shortcut is just crisping up the requirements, which doesn't really feel
like a shortcut. It feels like more work, right? But the reason it's a shortcut is because
it keeps you from going in circles here. There are projects that can go on and on and on
just trying to crisp up what the actual real world definition of success is, or they can
go on and on and on trying to eventually get to the right data to implement recommendations.
And if you make sure that everybody's crisp and aligned on what those last four points
were before you start the engineering, you're in a much better situation to just drive through
and start building things that will drive value for local users.
As you're crisping up these requirements, you're going to also start to figure out that
there's a lot of different flavors of recommendation and ways that it could work. One of them is
your classic flat recommendation. This is not personalization or anything fancy like
infinite fees. This is just you are here, you're looking at, if you're me, you're looking
at my monthly batch of beard butter, and you're getting recommendations of things that I might
also want to buy. To be able to make these recommendations, Amazon doesn't need any information
about me. There's no database table saying James has a beard or anything about James,
but just the context, just by being on this page, they know something about me that then
is useful for making these recommendations. So knowing that you don't need to know much
about these or to make these recommendations tells you a lot about what you need to build
to satisfy them.
The next layer deep is when you start to go into real personalization. These tend to be
products that they're reflecting the user based off of the interactions that they've
had in the past. This means that we need to engineer for the interaction data and for
user data to be accounted for in the recommender system. One of the ways that you kind of know
personalization systems are what's on the horizon for you is when it's not just about
the algorithms, but it's about the copy. What are you saying about the recommendations?
Are you saying it's for you? Is Spotify with the Discover Weekly, they're saying it's your
weekly mixtape. The UUU is telling you a lot about what the user expectations will be.
And so if design, copy, product are all telling you that top picks for James and your weekly
mixtape are critical components, you know, you're talking personalization, which means
you know, you're talking user data and interaction data.
And another flavor of this is called omakase recommendation. Omakase is a Japanese word
meaning I think chefs or you choose something like that. And often used in sushi. It's,
you know, chef's choice. You take it away. You see this a lot in these infinite feed
style recommendations where there's not really like a framing or copy or anything. You just
dive in and you get some content and it just keeps flowing. It just keeps going. You also
see this a lot when you're working with voice assistants. When you see something like, hey,
I'm not going to say it because otherwise I'll probably trigger a couple hundred of
them. Hey, blank play music. That kind of thing is very, you know, it's a very loosely
formed request for recommendations, but the system underneath it needs to be able to satisfy
that satisfy usually a very long session of that at consistent quality. And that's a very
high bar for the recommender systems and data underneath.
So get crisp about which one of these you're actually building. Make sure that you in engineering,
make sure that product user research, design leadership are all aligned about which one
of these you're building. Make sure that you have a really strong hypothesis that this
is what the customer really wants, especially when you're starting from scratch. Has user
research indicated this is something that would delight them? Have prototypes or mockups
been shown to users that give you some feedback about how it will be received? And most importantly,
when you've launched this, how will you actually know that you've succeeded? How will you know
that this is delightful for the users that are receiving it?
So now that it's starting to solidify, you might start feeling like this, this recommender
system that's crystallizing in front of you, starting to feel a lot like search. Search
is a very, very similar field to recommender systems. There are a lot of things that overlap
between the two, but a very, very important thing to keep in mind here, especially for
somebody who has experience in search, just that they're not quite the same. And those
not quite bits are where all the lift really is going to come from. When you're able to
take things that come off the shelf from search and apply them, knowing the ways that recommender
problems and products are different, that can be really, really powerful.
But keep in mind that they're not quite the same. There's good news though. I mean, they
are pretty close and that's a really great shortcut because search is quite mature. Lots
of organizations do have a competency in search. They have search technology. There's off the
shelf search tools that you can apply to recommender problems if you know how to apply them and
put them together in a way that builds a healthy recommender system.
The most similar thing between search and recommendations is generally that you're retrieving
from a large set of candidates. Search is looking through millions of documents. Recommendations
are searching through millions of items and you're trying to come up with a couple of
the right things to show to somebody. But often that's about as deep as the similarities
go. That's where they start to diverge a bit. On the search side, users are generating a
query. They're telling you something that they want and it's your job to satisfy that.
On the recommendation side, it's a little fuzzier. The user, the context they're in,
their history, all kinds of things kind of become the query, but you kind of have to
squint to look at it to call it a query. You end up constructing things that look a lot
like queries and can then run through search style systems like queries. But keep in mind
again, this isn't explicit somebody asking for something and that can often lead you
into trouble if these queries are poorly constructed.
And search, you tend to be optimizing for the first couple of things that you're retrieving
for the user. You want to get those first few results right because the person came
in looking for something and you need to get it for them. Recommendations, that applies.
You still want the first couple of recommendations to be good pretty much all the time. You get
a lot of other factors that start to really impact recommendations. You get slate diversity.
How broad are my recommendations that I'm seeing on this shelf? You get novelty. Am
I seeing the same things every time I interact with these recommendations? You get interactivity.
If somebody is giving you a thumbs up or a thumbs down on something that they liked or
didn't like, are you responding to that in a way that makes sense to the user? These
things that you're optimizing for are where recommendations in search tend to really,
really diverge. So if you're somebody who is experienced building search systems, you
probably are starting to feel like, oh, okay, I kind of recognize how recommender systems
work. All I have to do is take what I know, break it down into its constituent bits, and
put them all back together again to make something new. This means that you're pretty well positioned
to be effective at building recommender systems. You have a lot of these bits of knowledge
and competencies. The biggest risk is kind of staying in your own way at keeping RECs
thinking about them just as search problems. I would recommend that you just consistently
re-anchor yourself. Remember that recommendations are a different kind of user experience. Think
about how your user is receiving them and why that's not quite search. And then I think
you are going to blow people away with the things that you're able to apply from the
search domain to recommendations. Use these Legos, but make sure you remember that you're
putting together something totally different than what you've done before. A colleague
of mine named Carl said this recently. He tutored it over on Mastodon, that Rexis, it
is quite similar to search, to information retrieval, except there's a lot of nuance
that goes into building that quote-unquote query about the user. This is a lot of the,
for lack of a better word, artistry of building recommender systems, and it's where a lot
of people get lost when they're coming from search over to recommendations. The second
place that people sometimes get really tangled up in recommendations is simply that the offline
metrics for recommender systems tend to just be kind of bad. They're a little misleading.
They don't tell us really what we think they're telling us, and a lot of them are borrowed
from adjacent fields like search or other areas of information retrieval, but they don't
quite apply squarely to recommenders. If you've worked in search or recommenders, you might
be familiar with NDCG. By the way, I promise this is the one equation the whole slide is
showing. NDCG is a very commonly used metric. It's effectively saying how good are the top
few recommendations and how high have we packed the really good stuff in that recommendation
set. It works really nicely in a lot of problems. You know it's fancy because it's got Greek
letters in it. I know that sigma is not one of the fancy, fancy Greek letters, but we'll
still give it some credit. It's often applied to recommender systems because it's very commonly
used in the search systems, and there are a lot of tools available for it. You've probably
trained up a recommender system offline. You have a bunch of data about historical user
reactions. You can calculate NDCG, and it gives you a number that tells you something
that looks like my algorithm is good or my algorithm is bad. Then you're going to start
to run some of these algorithms online. You're going to start to collect online feedback
data about which of these algorithms are good and bad. What I am advocating for you to do
is to plot it. Just plot it. Just take the offline scores that these metrics, especially
the borrowed ones, are telling you, and for every algorithm, plot on the scatter where
that offline metric told you that that algorithm performed and where an online metric, something
that really is about user satisfaction with the recommender system, what that tells you
about how good those recommendations were. You probably expect it to look something like
this. Your algorithm is, you start with something truly random, that pink dot. It's not very
good for offline metrics, and it's not very good online because it's just a bunch of random
ranked stuff, but your better offline metrics, they start to get better and better online.
Maybe there's some diminishing returns there, but generally, you have some responsiveness.
More often than not, when you first start a project, you will find that your offline
metric gives you no response whatsoever in your online metric. There are big, big changes
in the offline performance that might not have any appreciable impact in the online
performance of your recommender, and there are dozens of causes. It could be the product
itself. It might not be sensitive to these changes. It might just be that users aren't
really responding to these changes. Maybe the product just is poorly framing the recommendations,
so how good the algorithm is doesn't matter, but don't be surprised if you see a chart
like this. Also, don't be surprised if you see a chart like this. Sometimes what you'll
find is your random treatment, when you're truly just picking random stuff, it's not
good, but all of your recommenders are much better online, but the sensitivity within
that is very, very poor. What that's kind of telling you is that there's not much headroom
to be gained online from just squeezing more and more and more NDCG out of your algorithms
offline. This probably tells you that NDCG is the wrong metric for your problem, and
you're going to want to find some new way offline to measure how good your recommendations
are, and you should experiment to find those right metrics. So there's another shortcut
for you. Just design those first experiments, not for the best algorithm to go out online,
although you'll maybe have some serendipity there, but design them to validate your experimental
methodology itself. Make sure that your metrics matter, and make sure that you're going to
be able to actually cut through this hedge maze, or you could get lost in it forever,
squeezing more NDCG out in a way that just truly never matters.
Just real quick one for you is just does it scale, and does it need to? In a lot of cases,
especially when you're prototyping recommender systems, you actually don't have the kinds
of scale problems that you would expect if you were going to production or full scale,
and often a lot of very simple solutions will give you everything you need to experiment.
A little story here, we built Spotify's first podcast recommender. This was like five years
ago. This model's long dead. And when I built it, there were only 10 million podcast listeners
on all of Spotify, and only 10,000 podcasts. Spotify had very mature systems for running
recommenders, but we just kept it simple. We ran one Python script on one big box, dumped
10 recommendations per user to Bigtable, and serve them on the homepage. And this scaled
for about half a year as both of those numbers doubled. So just keep in mind that often scale
can lead you to thinking that you need to go with a lot of technologies that aren't
really improving your speed of iteration, and the simple stuff lets you iterate very
quickly. So just keep serving the simple stuff for as long as you can.
So those are my four shortcuts for you, and my one last thought to leave you with is simply
to remember, and check a couple of this off your norm conf bingo, that recommender systems
are about people. They're about their behavior, what they want, what they need, what they
care. They're by people, you and the things that you think about content, about your users,
are all getting baked into your system, and that can be good or bad, but they're for people.
You want to make people happy with the recommendations you're giving them. You want it to be delightful,
and that is what matters most. So thank you so much. If you disagree with anything, we
live in a beautiful age where you can yell at me on more platforms than ever. You can
find me at any of those, and I'd love to hear from you. And yeah, let me know if you have
any questions.