All my machine learning problems are actually data management problems - Shreya Shankar

Transcript generated with OpenAI Whisper large-v2.

My name is Shreya. I am a PhD student at the UC Berkeley Epic Club and I work on data management for machine learning. So like very much continuing with the theme of what we've been doing today. In previous lives I've been a machine learning engineer and the reason that I kind of went to do a PhD is it's like a fully funded experience to sit back and fundamentally rethink the way that we do things. So that's why I'm here and I'm super excited to share with you what I'm thinking about in the last couple of weeks. And this talk is brought to you by hundreds of hours of databases, at least studying about databases, machine learning debugging, and stable diffusion. So it's no surprise that machine learning is kind of taking the world by storm in the way that we develop applications across industries, across different sizes of companies, but it's not very pretty in practice. I mean we have a whole conference track dedicated to this, but there are a bunch of bugs that happen when you have machine learning in production. I'll talk about two of some of my favorite. One is you have this ML model that takes in a feature that relies on a different part of the code base and if an engineer makes a change to that code base that causes this kind of get status to fail, how would you know that really? And maybe it's failing and it's failing quietly, it's returning like a negative value for example, something that kind of doesn't put an error out there and it really begs the question, at least the authors of this paper ask, do you know when your data gets messed up? Are you able to do that? Who knows? And another kind of bug was when a healthcare and company is kind of monitoring patient outcomes and they take a lot of vitals through a mobile app logging device and one of their patients kind of was just their battery on their phone was dying over time and they stopped sending data in. So this is kind of complete opposite, it's not that the data is changing, it's that the data is staying the same when you're pulling the latest data, you know when you stop getting data. This seems like a nightmare to kind of catch all these different bugs out there and I feel like we've had this sentiment really in building and debugging ML pipelines where it's like when we train an ML model and it works well and we feel satisfied that it actually accomplishes something that we want, unfortunately we kind of throw the model over the wall, if you've heard that sentiment before, and it becomes some sort of ops problem. And what does it mean to have an ops problem now? It means you have an engineer out there just sitting there collecting bugs, getting super overwhelmed, and most of these ML failures really have nothing to do with machine learning. They have something to do with some data corruption in the pipeline, some engineering bug as I mentioned earlier, but still someone is tasked of identifying that and then trying to fix that as quickly as possible. And kind of as Jeremy mentioned in the previous talk, when you try to fix things very quickly you tend to use rules, you tend to use other things off the shelf, you don't want to go and try to retrain a model for every bug that you get, and you end up stitching together all sorts of tools, all sorts of different rules, all sorts of different filters, and it creates something called a pipeline jungle. I really really love this term because it's a jungle really not just of tools, it's also a jungle of people. You know this diagram gives me anxiety, I think it also gives Vicky anxiety, I'm not even gonna go through it, I'm just gonna put it out there. And this diagram also gives me anxiety, it's a jungle of tools. And when you have all of these tools, one thing very very terrible about it but natural, is you feel like you're getting sold snake oil. At least I felt like I've been getting sold snake oil. Last year I went to some networking event and someone was telling me they had this new tool where all you had to do is add decorators to your functions and then set their dependencies and set your schedule, and it's just a few lines of code, but you can get some production ML pipeline. And I'm sitting here thinking like okay I don't even know if I want to productionize anything but it's a Jupyter Notebook that I want to productionize. Another networking event that I went to is someone's trying to tell them sell me feature stores and while they might be useful for a lot of cases, I was very skeptical, I was very happy with my table of features that I was just adding to on a schedule. I didn't really know why I needed feature stores but I felt like I was being sold this a lot. And maybe that was just me at a startup but I'm sure a lot of people have kind of similar experiences here. But it's not just this kind of feeling of getting sold snake oil that's a bad thing about pipeline jungles. Lots and lots of reasons why pipeline jungles suck, you know, onboarding sucks. There are tons and tons of bugs that come up out there. It takes forever to go and diagnose a bug where you don't even know where it is. And at least this is my opinion, I really think it requires some sort of PhD or equivalent of experience. You know you work for so many years and collect all this expertise just to be able to maintain or have hope of maintaining these pipelines in production. I can think of, you can think of many many more. But the goal of this talk really is to give you a data practitioner or software engineer without like a PhD in machine learning, someone who's not training machine learning models but you're working on ML-powered software. I want to give you maybe a new different framework for interpreting ML bugs. And kind of spoiler alert, it has to do with materialized views. So before we get into it, at least before we get into the new view of ML bugs, I want to talk about how we kind of get to pipeline jungles in the first place. I have this section called Jupiter notebook, from Jupiter notebook to pipeline jungle. And I took this example from a paper that I wrote earlier this year. But the first kind of step in ML is just to identify, you know, is ML even the correct pop or ML even the correct tool for you to solve this problem? Can you actually train a model that gets good performance on some set of data? So maybe we open up a notebook, we experiment, you know, we do EDA, exploratory data analysis. And this is, this example is also just taken from the internet, scraped from GitHub. And we create this notebook and maybe we do decide that, you know, this model outputs something reasonable, and we want to go ahead and productionize it. And we have different kind of components in this notebook workflow. And what does it mean to really productionize something like this? So one mode of production ML, I call single use production ML. And that is I want to take the results from this workflow and present them to someone else or have that kind of implicitly inform business decisions. So maybe I clean it up, get rid of my EDA, and I only include the relevant code. And then I submit these results and I am happy I call it a day. And that's great. There's also a different, another mode of production ML, which I'll call multiple use production ML, which is I want to run this on a regular basis when the underlying data is training, or sorry, underlying data is changing. I want to run this regularly, I want it to continually provide value, and I want to do it with some form of low latency. So I want to take the same thing. And I kind of want to construct some directed acyclic graph, a DAG out of these different nodes. And I don't have the tags here, but the same like data loading tag and the same fitting the model tag. But there are also in multiple use production ML, there are also different stages that we need to do, especially if we're getting continually changing data, right, we need to do data cleaning or some sort of validation. Maybe we have an inference node, quote unquote, serves a model. And very, very quickly, you go from something like this to something like this. And this is taken from a blog post from Uber a while back that was motivating feature stores. And the idea here is you have this DAG of dependent or you have this DAG of nodes that need to run in order to be able to serve real time, or even like low latency predictions, you have your model, you have your data preparation, and you have your inference. And maybe you start out with something like this just to meet your like online requirements. But as kind of ML engineers observe new bugs that come in over time, they need to react quickly, right? So they how do you patch this whole pipeline up to prevent against failure modes in the future. And an interview study that I did quite recently, I love this quote. But whenever they deployed their model, it was a chatbot like customer service kind of model. And someone would ask the model the language model, like what time is the business open, the model will hallucinate sometime 9am, something that like kind of looks right on the surface. But if you look at kind of the website of that business, it's not going to be it's making up times. So the ML engineer responsible for this, you know, if they they filtered the output, if they detected time, they filtered the reply, they referred them to the website, rather than try to fine tune the model on every single customer or every single business's web page. So what ends up happening in the pipeline is you end up adding filters. And after the inference, sometimes you add filters even like before the model, if you can kind of detect that the data asked for something like time. So you can see how these kinds of pipelines will get to something that is this like amalgamation of ML models and filters. So now now argue that this kind of DAG is not really great. It gives us a headache, gives us meaning ML engineers, for a lot of reasons. Current ML pipeline DAG suck, because they for one, require low level scheduling. So for every task in that diagram before, I need to set a schedule, like maybe I use airflow, I use some cron thing, I need to make sure that we materialize each single one, each single tasks outputs on the schedule. Then most of these tasks run on different schedules. And that's not like it makes sense that it does run on different schedule, like data ingestion, for example, runs maybe every day, but model retraining might run like every week. Maybe the model retraining step even take several weeks. And so maybe it's run every month. So given like compute requirements, just like organizational requirements, also, these tasks end up running on different schedules. And then when you have these kind of low level DAG requirements, people now have to handle consistency on their own, right? What do you do when a task fails? What do you do when someone changes the code? Do you go and backfill old outputs? All of these questions now come up, and especially when you have rotating ML engineers, right? Like that's an incredible amount of knowledge that you need to share just to make sure that this works. It requires constant constant babysitting and monitoring, which absolutely no one wants to do. And now I'll argue that these DAGs kind of are the way they are because they're quite training centric. What is training centric mean? It means that your workflow starts with, you know, the goal of the workflow in the beginning is to understand whether a model can achieve good performance. So everything is about the model, be it data centric, model centric, whatever. It's about getting good model that gets good validation set performance. And in this training centric approach, right, recall like the Jupyter Notebook that I showed you earlier, you do some sort of data preparation, you run experiments. And this part is like really where you spend most of your initial time, right? Like can you even get a bottle that will work? Is ML a good use here? And then once you're satisfied with that, you can get the best model artifact and then throw it over the wall, hand off this artifact for deployment. What does this look like in the pipeline setting? It means your model or your kind of training job, your retraining job is really written first. Your predict job is written second, your data preparation and cleaning is written third. And then finally, all the online stuff, like the queries in itself, are an afterthought. This is not great. So maybe I'll maybe I'll present to you an alternative view the idealistic view of how we would want to do machine learning, the query centric approach. So what does this look like? It means so that this is actually a way to forget kind of how you've been taught ML in the past, this forget that training centric approach. Completely different way of thinking about using machine learning in software systems is to think from the perspective on the query. When a new example, when a new query arrives, what we want to do is retrieve the historical examples that were similar to that or retrieve the historical examples of the same schema per se, fit a model to these historical examples, maybe all of them, and then use this model on the new example to surface the prediction and return that prediction to the end user. So I'll keep this diagram here for a little bit. The idea here is we want to, I know it sounds crazy, the idea of training a different model for every query, but conceptually, like this is how we want to think about machine learning, right? It's when you get a new data point and fit a model to your existing data points, and then return a prediction. We can't do that. We're not even close to this yet, right? Obviously, this is the highest latency policy you could ever imagine. There's a huge gap between these training centric and query centric worlds. I'm not going to argue that you should be doing this. That's like the goal of research really is to figure out how we can move closer to this world. But what is the gap between the training centric and query centric worlds? In the training centric world, fitting the initial bottle is really an experimental process. Like, can we even get a good model? What is the best model even look like? What are the best minimal, like, what is a good set of features? All sorts of things, right? In the training centric world, the model in itself is really an artifact. Whereas in the query centric world, the kind of training set, the store of all examples, that is mainly the artifact that we want to manage in production over time. In training centric worlds, your tasks are recomputed inconsistently, right? Data preparation is run differently than model retraining. So those maybe run batch offline, different schedules. But in the query centric world, right, once you get a new query, kind of everything, all of your data is as fresh as possible. It's clean, it's whatever. We have those kind of guarantees. And you've seen this word consistency, consistent data all over my slides. So this smells a lot like a data management problem, right? What does it mean to have consistency in DAX? So now I will kind of pitch ML engineering to you as materialized view maintenance. And don't worry too much if you don't remember what views are, materialized views are. I'll do a very brief refresher. So suppose you have a table, something like this. So tables, like you are all familiar with this data structure of rows and columns stored in your database. So a view is kind of a virtual manipulation of this data. So it's kind of formed as a result of a query, but it's not stored separately. You can query this view as you would query a table. And when you do query the view, then the outputs of this view is then materialized or created when it's queried. So then the question becomes, okay, like what is materialized views even mean? What is, how do we do that? So materialized views are stored in the database. And they're computed when you initially define that view and on base updates to the table. So every time I add a row to the new table, then I recompute my view. There's a ton of open problems or open questions in materialized view maintenance. So you're wondering, well, how does this apply to ML pipeline land? Materialized views in the literature really are just some form of derived data. So same thing with ML pipelines, right? When I issue a query to an ML pipeline, that prediction is some result of transformations of the data. So in the ML pipeline world, we can think of views as kind of the training sets and the models that we are creating. And the idea of view maintenance, which maintain these views, is this crazy DAG that we're building to update the training sets in models as we get new data points. And views can be kind of maintained, updated, all sorts of different ways. You can do it immediately. You can do it in batch, like deferred view maintenance. So retraining is almost always done in a batch setting. You can, every time you need to recompute your view, materialize it on every update to a base table. You can materialize a whole thing from scratch. So retrain the whole model from scratch, for example. Or you can kind of write custom operators to incrementally materialize them. So try to, this is super, super tedious, right? Because you have to maintain some state. You have to maintain what you will do on update. That's different from the initial, that requires custom logic. So, you know, food for thought, what is low latency, right? If you do things in batch, that's low latency. If you redo things from scratch the whole time, that's easy to code up. So you can think about that later. But whole crux of my argument now is that in these kind of ML DAGs, you have these inconsistent materialized view maintenance policies. So when you're training models in development, you're working off of like immediately materialized features, right? Like things that can make you iterate as quickly as possible. But when we're training a production, since we're retraining on a cadence oftentimes, we have this like, it's deferred view maintenance. Like this cadence kind of comes out of thin air. And, you know, when we're issuing online queries, we can't really, it's not the same assumption as this immediate maintenance in development. And then when you're serving in prod, maybe some of your features are immediately materialized, right? They're done online. Maybe you're querying your batched features or joining with some batch features. So you have this like hodgepodge of immediate and deferred policies. And we're using some retrained model. So you have all these like crazy mismatch in policies, and it's really no wonder that we get so many bugs. At least if you get anything from this talk, that should be, this should be it. Like no wonder we get so many bugs. It makes a lot of sense. So now, I know I have only a few minutes left. So I'll finish in two minutes, I promise. But if you think about recasting kind of ML bugs as view maintenance challenges, you can think about them, some of them as view saleness problems, right? When you're doing, you're materializing outputs offline in batch, you might get trained serves queue, right? You need to make sure that your model is retrained as kind of frequently as possible. In the interview study that I did, I love one of these quotes, the retraining cadence was just like finger to the wind. I almost never see like a very principled way of figuring out what the retraining cadence is. And you get kind of these feedback delays also. So if labels are done, if humans are in the loop, like labeling data, and labels only come in every couple of weeks or so, right? This will make your training sets stale. Your models will be stale, even in offline, right? So how do we kind of think about that or reason about that? Then we also have view correctness problems, right? If I run inference on bad quality data, if I retrain a model on bad quality data, right? Like all ensuing predictions from that model are also going to be bad. I have a paper on that coming up soon. And data errors really compound, right, over time. If you have a data, if you have error ingestion, right, like the error just grows as you move throughout the pipeline. So are you really, you got to implement data validation and monitoring at every stage in your pipeline. And then finally, kind of bugs that arise from the dev-prod gap. This is stuff that I never thought about when I was an ML engineer. But it was my validation and development time equivalent to kind of how I served at prod. So for this retraining cadence, did my validation set have the same number of examples that I would serve in production, the same representation, like same subpopulations, etc. Was my validation set in development time sampled like a contiguous sample of production queries? This was almost never the case at the company that I previously worked out. I almost always saw random train test splits. This is not the same way that you would kind of monitor performance and production. And finally, are you verifying this when you're promoting from development to production, like in your CI? I'll skip through this almost, but this is really just kind of tips and tricks for if you're an ML infrastructure engineer. Validate everywhere, as I mentioned before, version data, training sets, and code together. Make it super easy to check out old versions. If I time traveled back to last week, can I get a view of the pipeline? Everything in the pipeline of that week. This is super hard to do, but very important to kind of get debugging in provenance as first class citizens. And with that, thanks so much.