The physics of data - Juliet Hougland

Transcript generated with OpenAI Whisper large-v2.

Thank you all for having me here today. I'm super excited to be talking about the physics of data, two of my favorite subjects combined. So let's talk about how I actually got here. I got into computers through physics and math. The world of physics and math is so wonderful because the world is modeled through equations that reflect what's going on in the world beautifully. You don't actually need to get into a lot of the intricacies of thinking about how a computer functions when you're doing those computations. Generally, when I was working in the world of physics and math, I thought about working in industry. And in industry, I really had this image of like, oh, you don't really need to understand how, like you're just working at algorithms and making models and letting computers compute. But you then join industry and it's like horrifying reality hits you one day. You know, someone comes up to you and they're like, why is your job so slow? Why does it cost so much? Or maybe you're trying to escape from the dinosaurs that are free in the theme park you're in. And you realize you really need to know how a computer functions to actually do your job. So let's start first with relative orders of magnitude because these orders of magnitude of operations really drive the constraints of the system. So in 2012, Jeff Dean published a list of numbers called Numbers Everyone Should Know. I have it here on the left. It's kind of small, so I pulled out. Some of the operations that I think are interesting. There's a few things to notice here. Number one, it takes a while to send data somewhere over a network and get it back. And it takes longer the longer the distance. So a data sent around trip versus a packet from California to another one to California again. Reading data, also slow, but a sequential read of approximately a megabyte is only twice the length of time as a disk speed to actually like go find data on a disk somewhere. And then I kind of summarize these like CPU operations, things that are close to CPU and RAM, where you have a L1 cache reference or a branch misprediction in CPU. These are fast. So when we think about data, the first thing we want to recognize is that we want to move it around as little as possible because it's incredibly slow. I've included a link down here and you can get access to my slides later for a great reference where you can see these numbers over time. These numbers are old, they're 2012. That's really the relative magnitude that I'm interested in. So let's start with some very norm data processing. If you work at a company, very norm data processing means you first query data that lives in the data warehouse somewhere. Usually this is data spread across a few servers, maybe there's a bunch of different tables, maybe you have a large amount of data. And then after some querying, you maybe aggregate it to a point that you're gonna do some math on a single machine. Maybe it's your laptop, maybe it's some math in the data warehouse or in the cloud, whatever that is. So I'm gonna break this up into two parts. The first part, I'm gonna talk about querying data in the data warehouse and implications there. So the question is how fast can we move? And this is kind of about the physics of data, but also about pictures that I think are funny. And so IBM has this photo that they are letting me use very generously. I planned it up in advance to ask. This is four dudes moving a raw Mac 305 into a truck because this is how they move five megabytes at a time. Very difficult, it can only go as fast as a text scan. Well, actually that's not quite true. You could also load it into a plane and a plane would be faster. And I think maybe an interesting thing to note here is that the cylindrical object there is actually a disc of magnetic platters that record the bits, the bytes of data of this data processing machine. And that's how they're read off of them. And so this might remind you of something. This is like a closeup of a raw Mac 305 at the Computer History Museum that they're restoring. And these like cylindrical shaped platter disc is where the image of the database logo icon that you see in every architecture diagram actually gets its image from. So much as the kids of today don't know that the save icon is a floppy disc, I didn't really know this was related until someone told me. So I hope you enjoy that. And so you might think to yourself, oh, surely technology has improved. Things have gotten so much faster. But in fact, if you have about a hundred petabytes and you need to move some data into AWS, they will send you a semi truck and take all of your data, load it into the semi truck and drive it to their data center. So while it is much faster to move five megabytes, there's still bandwidth limits. So what does this actually mean for querying the data warehouse? Well, first of all, we want to move things as little as possible. And so that means doing work as close as possible to where the data lives. Netflix actually does exactly this. And I'm not gonna talk to our data processing team at all. Instead of Netflix, you watch movies at home on your TVs, you stream those movies to your TVs from somewhere. And like I just said, it's actually quite expensive and difficult to move large amounts of data around. And so when Netflix has done and built its own CDN, we know that we have people all over the world that will be reading these files, streaming the data to them. So a very read heavy workload. We feel very comfortable making copies of these files all over the world. So we actually partner with ISPs, put little boxes inside of the ISPs and cache films that we think people nearby will want to see. And when you request it, you actually stream it from somewhere much closer. And so the distance that we're sending data is much shorter. And so I think this is kind of fascinating because it is a read heavy workload, much like analytic workloads. It doesn't really matter if we have copies and you're not like trying to make changes or transactions. This is kind of exactly the concept of data locality, except distributed all over the world. So let's talk a little bit more about data locality. Data locality means a few things. One is that you've organized your data really well so that your work and your processing happens close by. So good data organization, but also the best kind of data is less data than you have today. And so you really, you know, data, wonderful, great, but maybe think about storing lots of it, both for data governance, data privacy, but also reading one megabyte sequentially off disk. It doesn't matter how you've imported it. It doesn't matter how you've compressed it. You still have all these CPU cycles relative to reading off of this. And so having good encoding and compression can really help us storing literally less of it. Let's talk about what a hard drive actually looks like. In the raw Mac 305, we saw these magnetic platters. In computers, at least, I mean, nowadays on a laptop, you'll get an SSD. Whole other story, but also like Asterisk, kind of not sequential reads are still faster on SSD. On a hard drive, we have here this magnetic disk and we have a reader head, this yellow, like that yellow area that will move to the position that it needs to read. The disk is spinning and it has to wait until the right position comes around again to begin the sequential read. So really we're trying to move around the reader head as little as possible. And just read sequentially. So that brings up the question like, well, if we want to read sequentially, what are the workloads that we're usually doing? Let's talk about table layouts. We have a few tables, there's rows and columns in these tables. We might want to aggregate all of these, like what is the average of column B? Maybe that's how much a certain transaction costs. Maybe column C is like what state or country people are in. Maybe we want to group by. And so we're doing scans over these columns, right? There's a few different ways we can think about actually storing these data in a file. One is we could take every row and store the rows in sequence, or we could take every column and store the columns in sequence. Here we can see that if we were to store the columns in sequence, if we wanted to take the average of B, we could do a sequential read if all of these were immediately after one another. So this actually now gets us to the place of, hey, have you heard of Parquet? Why does Parquet exist? Parquet exists. This is often a default file format that you'll see all over the place in larger data warehouses. But the principles behind it apply in many situations. First, it allows a lot of different compression, well, three very useful lossless compression algorithms. It has a few different types of encoding, and both the encoding and compression can be applied to a single data set. And both the encoding and compression can be applied at a column level. And so that is incredibly useful for us because again, we're doing column scans usually. And it's column oriented because it's organized in a column way. Like again, an asterisk kind of, it's actually, it has the concept of row groups, but that's neither here nor there. I got 20 minutes, time to move on. Second section, we're gonna do math in a single machine. Great, we have some kind of concept about how we sort of spread our data around in a data warehouse and process on it. Now we're gonna do something more with mathematics. Single machine computation. Probably the, apologies, I'm a little, getting a cold. Probably the thing that comes up the most when thinking about computation on a single machine, you can imagine we've already read the data off of Zip. Moore's law just comes up all the time, right? The density of transistors just keeps increasing, but I hate to break it to you. Moore's law is topping out. We're kind of getting to the end of it. So what are we gonna do about exactly? Well, first we're going to use wider SIMDs. SIMDs are a very special type of processing unit inside of a CPU, where normally in a CPU, you have registers that can use instructions and data input and then write the data somewhere. So let's say you want to take a number, put it in a register, you have an instruction, add three to this, and then you're gonna write this out somewhere else. So that's like three CPU clock cycles. SIMDs let you do that with groups of registers. So it's like four registers and four of exactly the same instructions. So that's same instruction, multiple data. So you apply the same instruction to like these vectors. So again, like sounding a whole lot like the type of sequential, let's do a bunch of things parallelized at once. So the wider the SIMD, the more registered, but it has the bigger vectorized calculations you can do. It also have multiple core processors. I think people talk a lot more about multiple core processing. So I'm just gonna skip it in this talk and go a little bit deeper into SIMDs. Or you can have higher clock frequency. And this is where it gets kind of fascinating and challenging. When you look at what is constraining CPUs in a number of ways. One is purely the size of the transistors. And the second is the amount of power you need to actually do computation. You don't wanna spend a lot on energy. I mean, you already do, but you don't wanna spend more than you have to. But also when you have power going across these transistors, it's letting off heat and you might melt it, like literally. And so data centers use a lot of HVAC. And so their costs are both powering all of their computers, but also cooling it so nothing melts. Wider SIMDs scales linearly with power. Multi-core processors scales quadratically and a higher clock frequency scales cubically. So while all processors have started moving and increasing these directions, wider SIMDs are where the power trade-off makes the most sense. Cool. Is that the only thing we can do? Absolutely not. There's lots of really cool processors that people have made to make things a little bit better, faster, and specifically less power. It's both financial reasons, but if you want another incentive, like do you like polar bears in the Antarctic, you should use less energy. One, which I've just talked about, is wider SIMDs and CPUs. Second is GPUs. So often these are used for video game processors or fancy deep learning things. They're kind of interesting because they're still very generic in the sense that they can use multiple types of instructions. So they're somewhat flexible. And then ASICs, which are application-specific integrated circuits. And the fanciest example of this, and probably the one that I think is the fanciest because it's related to deep learning, are Google's TPUs. And these are tensor processing units, which literally all they do is linear algebra all day. It's like matrix multiplication. It's like matrix multiplication after matrix multiplication. And there's a really great power trade-off. That's all I can talk about it because deep learning is too not-norm-close. I'm done. Great. So this is a picture of what a data center looks like. This is a Google data center in the Dalles, Oregon. And this is specifically where the power grid meets it. Power is a real concern. These things cost a lot. So we're going the wider SIMD track. How are we going to do this? There's really kind of three ways that I think are relatively common. D, using vector primitives. Like in the language, there are vector primitives that allow you to access SIMD vectorization. I don't think it comes up that much, but it's possible. I personally don't write a lot of raw C. Another example that all of us probably have used behind the scenes without knowing is Lopak, Slask, all of these libraries that are linear algebra libraries written forever ago that are hyper-optimized and are able to use the vectorization capabilities inside of the platform that it's been like compiled against. And so this is actually why a lot of Python to be a little tricky because it's not just Python versions that we're managing. We're also managing these like C and Fortran libraries that are underlying. And then the third is LLVM. LLVM is a tool chain in C that allows you to take compiled C code and an intermediate representation of compiled C code and then optimize the assembly language output. So let's talk a little bit about LLVM for a second. Impala is a distributed query engine. It's written in D. Skye Wonderman-Nolne, she's an incredible engineer and she works on vectorization of queries that were submitted, that gets submitted to Impala, have huge performance gains and there's a link to a talk that she gave about it. You can use LLVM back and flat. This is super common. So if you yourself are writing a Python library that's scientific computing oriented and you're like, no, I'm not gonna use SciPy or NumPy for some reason, you could have your own binding through using these libraries. But much more frequently, you're gonna try to install something and if you install it correctly, it will just happen. So these are instructions for a library called Axe. It's a Bayesian optimization library built on PyTorch and in it, it's like, hey, be careful about the order that you're installing things because if you do it correctly, it'll be significantly faster and this is why. So I think that's cool. In 2007, Gordon Moore was invited to the Computer History Museum for an interview and he had this quote, which I think is incredible. The fact that materials are made of atoms is the fundamental limitation and it's not that far away. We're pushing up again some of the fairly fundamental limits. So one of these days, we're gonna have to stop making things smaller. And I hate to break it to you, ladies and gentlemen, that day is very close. If you look at transistor size in nanometers, in 1971, it was about 10 million. This year, the transistor size is about three nanometers and the silicon atom is 0.2 nanometers. So if we look at now transistor size in terms of silicon atoms, it used to be 50 million and now we're about 15 atoms across. And so trying to get electrons to cross there, it actually begins to make us have to take into account quantum effects, let alone the fact the amount of power that it takes to actually make these transistors function. It's predicted that next year, we're gonna get down to two nanometers across, which is again, incredible, but we're hitting a real physical reality. What's kind of exciting is sure, quantum computers are being built. This is amazing. I don't see math immediately happening on them. It seems like that would be very expensive, but again, this is a new boundary of computation. Totally different talk, because again, 20 minutes. So the question is, data and computation with data happens on computers. How do you build your skills? How do you learn about this? How do you get better at understanding what is going on when you run into a problem? Well, I think that we are, this is like a skilled trade. And so not only are there things that you can learn, like computers are not magical, learning is possible. You need to figure out how to go about doing that. There's a few books that I think, well, resources that I think are excellent. One is Wizard Zines by Bork, great for like little Unix tools, extremely fun, delightful to read, strong recommend. And then the second is Designing Data-Intensive Applications by Martin Plotman. I'd be pretty interested to hear in Slack what other ways you all have best learned or resources that you've had. But again, we're skilled craftspeople. Practice is what makes perfect. And so if something happens, like your job runs out of memory, don't just double your memory and call it good. What happens? What does the hypothesis, test it, dig into it, learn to monitor your system. But what if you're like, I don't know how to monitor your system. Again, like reading, one way to do it. But my absolute favorite way to learn about how these systems work is to pair with them. Someone that's experienced and knows what's going on. When they're doing some kind of performance debugging, they're like, oh, this job is slow, what are we gonna do? And see what tools they use. I learned just incredible tool sets from people and like pick up little things here and there that just begin to add to your speed and ability to get things done. Thank you.