Episode 33
How Does Modern AI Work?
November 29th, 2023
53 mins 34 secs
About this Episode
The panel dissects the world of machine learning and artificial intelligence. They unravel how machine learning algorithms function, using the game of Pac-Man to break down complex topics like reinforcement learning and reward-based systems. Mike dives deeper into reinforcement learning, explaining how it operates on reward signals that assess the value of a series of actions over time. Dave chimes in with his insights on AI's ethical considerations, arguing that lacking a moral framework could lead to unintended and potentially harmful outcomes.
They navigate the expansive "state space" concept in machine learning, using chess as an illustrative example, and explore how understanding all possible board states and moves can help develop an unbeatable chess AI agent. This sets the stage for a pivot to natural language processing, where Taylor provides a thorough explanation of transformer architectures, including sequence-to-sequence problems and attention mechanisms.
Dave returns to his earlier point about ethical considerations, spotlighting the importance of the paper "Attention Is All You Need," which has been instrumental in developing transformer models. He then shifts to the challenge of making AI systems more explainable. While Taylor and Mike acknowledge that the complexity of neural networks makes them difficult to interpret, Dave insists that striving for explainability is crucial to making AI more accountable and transparent.
Transcript:
MIKE: Hello, and welcome to another episode of the Acima Development podcast. I'm Mike. I'm going to be hosting today. And we have a panel with us here today of Dave, and Taylor, and Francisco. Welcome, everybody. Interesting topic we're going through today. And we're going to approach it a little bit differently than we've sometimes done in the past. We're going to be talking about the fundamentals of machine learning.
Taylor is with us from data science. So, he does this for a living day in and day out.
TAYLOR: Yep.
MIKE: And has the smarts here to talk about it. I am very interested in this topic. I have been for a long time [laughs], honestly, since I probably heard about it as a kid [laughs]. So, this has been a long-standing interest of mine. And I've played around with this. I follow the topic closely, but I am not a professional practitioner. So, hopefully, that'll qualify me to at least give you an overview. And we're going to lean on Taylor [laughs] to be the expert.
Usually, we have a heavy panel discussion. We do have a panel here today to have a discussion. They will maybe be doing more question-answering and filling in the blanks. The goal here...I was talking about this in our pre-call discussion that I'd like somebody who listens today to walk away and say, "You know, I have a basic idea of how this works. I might not be able to implement it, but it's not magic, you know, it's not this dark magic. It's just somehow these machines can think." But you have an idea of how this might work.
I wanted to start by telling a story. Grounding things in real-world stories is usually very helpful to help us conceive of how things work. The story is about a hike I took a number of years ago. And I couldn't tell you what year it was. It was a long time ago. A friend of mine and I decided to go hiking up in the mountains, and we hiked several miles up a trail. And I really wanted to try something different. My buddy had not done a lot of hiking. So, my ambition was probably bigger than it should have been. Chris, if you ever listen to this, I'm sorry.
And we decided to go on this hike. And I wanted to try a new route that I'd never been on before because I knew that there were trails and two parallel canyons. And we were going on a canyon that was not that deep up high up on the ridge line. And there was a canyon parallel to the one we were in that was much lower, hundreds of feet down, hundreds of meters for anyone not in the United States, at least 100 meters, right? It was a long way down into this parallel Canyon. And I wanted to see if we could find a route between these two canyons in an area that was trailless.
It was late spring. I thought, oh, this would be great. The weather's nice. So, we went up. We hiked up the one canyon, got up to the high spot, saw what was an easy place to drop over into the next canyon. And we went over the edge. And it turns out that there was still late spring snow, like, all the way down that canyon wall down through the woods that were there [laughs], and decided to do it anyway.
But it was smooshy snow. It was, like, up past our knees. So, we were just postholing, you know, stepping into it. You jump down in up above your knees with every step going down this hill. And it was hard work. We were just absolutely exhausted. We did, however, eventually find the trail that we were looking for. And the way we got there was really very simple. I knew that it was down, and I went that way. We had to walk down through the woods, you know, I didn't have a compass. I didn't really have a map of the area. This was kind of pre-smartphone era. This was a long time ago.
But I did know how to make my way down. And I knew the trail was at the bottom of this canyon. So, what I did was I just walked down. Now, there were certainly obstacles along the way. Again, it was forested. There was a lot of trees to weave our way through. But for the most part, that was obscured by the snow; we just had to, at every point, say, well, which way is down? And take a step in that direction.
And when we finally got down to the bottom, there was the trail. And [laughs] we were finally free of the snow. It was wonderful. We got out of the snow, and not much further was the trail. And then things were much better, and we walked down and made our way back, you know, to the car, which was a bit of a walk because we had to go between canyons. Again, Chris, I'm sorry. But we did find out that there was a way.
Why am I telling this story? It's because you might think that machine learning is based on something really sophisticated and complex, but it's actually a lot simpler than you might think. And the basic math that it uses is something that you probably learned in either junior high or high school. So, think back about your math. And if you're not a math person, think about this mountain I'm talking about, or think about a road that you have to drive that's got a hill.
And there's really only a couple of things that define that hill. First of all, how steep is it? [laughs] Like, it matters how steep it is, and that's kind of a direction. And where's your starting point? Like, if you're starting most of the way up the hill, then you don't have very far to go, right? If you're starting at the bottom of the hill, there's a long way up. So, there's kind of two parameters that define that hill, which is how steep it is and your starting point. Now, I'm simplifying some here.
But in the math that you may have learned way back when and try to remember, you may have learned something like y equals mx plus b, which sounds really mathy, but that's really trying to tell you something simple. It's telling you that you have some input, and you're going to multiply by that and then add a number. Well, and that's still not very practical.
The second example I'd like to give is one I usually think about, and there's a few that we could think about. But the one I usually think about is, and that I've seen a lot online, is, let's say, that you want to predict how much a house is going to cost. You want to predict how much a house is going to cost. And you only have one piece of information to predict that, and it's the size of the property it's on. And there's probably a pattern there, right?
I would imagine that if you're on a tiny city lot, the value is probably going to be lower than on a large city lot. If you've got a house on a large city lot, that's probably a pretty valuable house because city lots are expensive. And there's going to be a lot of noise in that data, though, because that tiny city lot is probably worth a lot more than a tiny country lot. And, you know, maybe it's in a neighborhood where not many people want to live. How close is it to public transportation? I'm going to get back to that minute.
There's all of these other things that you could measure. But we're just going to think about one thing, the size in... we'll say in square meters or square feet. I'm in the United States; I'll say square feet. I know that [inaudible 06:07] allows a unit. Meters are better [laughs]. But it's what is commonly used in the United States. So, you know the square footage of your lot. And you want it to have that one number and predict what the price of a house will be.
Now, let's say you're doing it within the city. Well, there's some simple math you could do there. Here's what I'm going to propose you do: you take the size of that lot in square feet, and you have a number, we'll say 1,000. That's a very small size, but I'm going to use a small number. You can say that 1,000, and you will multiply it by a number, by some weight we'll call it. And you'll get another number that you say that's the price of the house. But you're going to add one other thing, and that is, where does it start? Because there may be a floor to how low the prices will go.
It may be even if you have a square-foot lot, it's worth something. It's still worth something more than zero. So, you've got a starting point that's higher than zero. Or maybe you're in, like, Detroit after they went bankrupt, and some property values are worth less than zero, right? [laughs] The city will pay you to take them [laughs]. Either way, there's a starting point that may be above zero. It might even be below zero that the zero, you know, the property with no size would still be worth. And then you go up from there.
And I'm going to call that value...I didn't give it a name. They call it in the machine learning world the bias. They call it the bias. That's the starting point. And the other one, that multiplier you use, is usually called the weight. There's another name you can call it, though, on a line you often heard in math: slope. It's the slope. If you want to get fancy and you think about the steepness of the road that I mentioned before, they call that the grade.
And in math, if you're thinking about a complex grade where it's in multiple dimensions, they call that the gradient. They call that the gradient. But it's really just a fancy name for the slope, you know, how steep is it? And so I got two numbers: the weight, which is the steepness, and the bias. Where's your starting point? So, I'm going to look at Taylor here. So, you with me, Taylor? What gaps do you think I have left so far?
TAYLOR: No, I think you're doing pretty well. I mean, that's why we call the y equals mx plus B slope form. It's describing lines and their slope and how they traverse through [inaudible 08:21] space. When you were talking about traversing from different trails––last time, obviously, you're talking about gradient descent here.
MIKE: And I'm going to get there in a minute [laughs].
TAYLOR: Yeah. And one of our co-workers had a sticker that he gave when he left, and it was a ski resort called Gradient Descent or, like, Gradient Slopes or something like that.
MIKE: [laughs].
TAYLOR: I thought that was a really cool sticker. It's a fun play on words. But no, I think you're doing a great job of explaining these concepts so far. I could interject with a lot of, like, data science terms for these. But I think for explaining to a junior high student, then we're spot on.
MIKE: Well, excellent. And you're like, what does this have to do with machine learning? You're talking about the stuff I learned back in junior high.
TAYLOR: [laughs]
MIKE: Well, and here's where the next step is that we really get to magic. I'm going to go back to this idea of houses; then, I'm going to use this to tie things together. If you have ten houses that you want to do this with, and you say, well, here's the property values, and here's the houses, you could draw a picture, right?
You could say, well, this house has a property size of 30,000 square feet, and this one has a property size of 50,000 square feet. And then, you draw a picture, you know, mapping your...on one axis...so, you just draw a line, and then, as you go higher along, you're going to put the size of the property. The higher the property size, it goes higher on the line. And then, along the other axis, you're going to put the price of the house.
And if you kind of map those, you're going to see them following a pattern where it goes up into the right, assuming that the higher values go to the right. And you could draw a line through that and probably eyeball it and get a pretty good number to say, hey, here's the line that I'm going to use to fit to that data, you know, that you got that line of best fit. There's a fancy, mathy word for a line, for things referring to lines. They say linear because it has the word line in it with an E-R at the end, a linear, or an A-R at the end. You know, it's liner. A line you can use to map the data.
And it's probably not a perfect fit, right? But lines are pretty simple to use. And one thing about line is they've only got really two characteristics that you care about, which is how steep is it? And what's the starting point? There's starting to be a pattern here. How steep is it, and what's the starting point? Which you can use with that y equals mx plus b. You can use those two numbers. All that is...I can go back. Well, the two numbers that you care about is how steep it is, which is the same thing, as we said before, as that weight. The number that you're going to multiply by the size of your property and that bias sort of, you know, where does your line start?
But what if you have a million? What if you have a million houses? Well, you're probably not going to do that by hand. It's going to take you a long time. Even with the size of a million modern computers, you could probably figure out an exact line of best fit. But as you get to increasingly large datasets, it becomes less and less tractable.
Here's what gets even worse. What if those other data points that I mentioned are also into play? What if you also want to involve the proximity to public transportation? What about the neighborhood that it's in, and you want to treat that as a data point? What about how urban or rural or, you know, which urban area that it's in? You know, what about the age of the home? What about the company who built the home? There's all kinds of things that you can measure.
We're often used to thinking about things we can measure as being in our world around us and tend to mention Cartesian coordinates. We often think about, well, how far is it in front of me? How far is it to either side of me? How far is it up and down? So, we've got these three things we can measure: in front of me, to the side, and up, and we call those dimensions. But that word dimension just means things I can measure. So, I could also measure the distance from public transportation. I could measure the distance from the city center. I can measure in kind of a categorical sense categories of, you know, what these neighborhoods are. There's many things that I can measure.
And dimension is just something I can measure. Well, I could probably measure hundreds of things. And that's really hard to picture drawing [chuckles] when I've got hundreds of things. And for all of those hundreds of things, I want to find a best fit that kind of maps all of them in some complex multi-dimensional space. Well, that sounds really awful.
But the concept is very simple. I just want to have a weight for every one of them and a bias for every one of them. And then, I multiply those all together and, add it together, and get a number out of the other end. Well, that sounds like something I could do, although I don't know how to do it on paper. And if I've got a huge number of things coming in, I might not even be able to do it easily on a computer. But there is something I can do.
Here's where the learning part comes in: the moment of magic. What if I do my math...and I'm going to go back to the example of just one, and I'm just going to draw a line completely random, no thought going into it whatsoever. None at all. I'm going to draw a random line. And then, I'm going to take my first house, and I'm going to figure out, like, some exact numbers. Like, what do I have to multiply the property size by to get to the size of that house? And that will give me a number for my weight that's different from the one I have because the one I have is totally random. I mean, the chance those are exactly the same is basically zero. They're different.
And I'm going to take my number that I guessed at random, and I'm going to move it a little bit, not much, just a little bit toward the value I got for the house. Remember, these weights can also be called the slope or, in a fancy word, the gradient. So, I'm going to move my gradient a little bit more toward the one that that house would suggest it ought to be. And then, I'm going to do it again with the next house on my list. And then again, and then again, a million times, or a billion times.
And if you think about what's going to happen to that gradient, sometimes it's going to go the wrong way because we have some rundown mansion that's actually not worth very much [laughs] and some little place that's actually really fancy. It skews it the other way. But if I keep on doing that a million times, just moving a little bit each time, there's going to be a little bit of bounce around, thinking about following that gradient, following that slope towards the right one. Over time, I'm going to follow that difference, right? The difference between the slope values until I get something really close.
Think about me walking down that hill on that mountain. I didn't always go exactly towards the trail, I guarantee that. Sometimes, I was going around trees. Sometimes, I probably went uphill to dodge an obstacle, but as long as I was going in the direction of the trail, changing my position a little bit each time going in the right direction. And if I'm not being very watchful, maybe I'm going to pass it [chuckles] and then have to come back to that direction. But then I'm going to come back downhill the other way because I go back up the canyon on the other side and wave until I get down to that trail, which happens to be right at the bottom. And there's a fancy name for that: gradient descent.
But all that that really is...and you think that, oh no, it's got to be more complicated than that. Well, actually, no, it's not. Because simple things are how you solve really ugly problems a lot of times, that is really what machine learning practitioners have been doing for a long time is: they thought, well, okay, well, let's take a simple approach to this problem. Let's take my input data and, multiply it by some weights, add some biases.
And I'm just going to start with some random weights and biases. And now I'm going to adjust them a little bit with a batch of data, and then a little bit more and a little bit more and a little bit more. And I'm just going to do this a whole bunch of times. And if I do that enough times with enough data—and it helps to have a lot of data—I'm eventually going to get some pretty good weights and biases, right? It's actually going to be pretty good. And this is not a new idea. There is something called the perceptron, which followed this pattern, and it was invented in 1943.
TAYLOR: Wow.
MIKE: That does just this. Back then, computers were feeble enough compared with today's. They actually originally designed it in software, but they built hardware for it. They built hardware that would do this, and they thought this will change the world. This learning will change the world. Well, they didn't have very much data. And they didn't have, you know, a very big machine. So, it didn't really change the world that much. But it was able to learn from a set of data. Remember, this is 1943.
TAYLOR: That's amazing.
MIKE: And these ideas have been bouncing around since then, and people probably thought of that. This idea of linear approximation or linear regression, which is trying to find that line of best fit, has been around for a long time. Well, in the 1990s, some mathematicians started working more on this. One guy, particularly named Yann LeCun, and the team...he is a professor...a team of his started working on that. And they explored some of these ideas. Another important person here is named Geoffrey Hinton.
They kept thinking that this idea of using these weights and biases to solve the problem and making a whole network of them was going to be really useful. And at the time, you know, back in the '90s, there were actually a lot of people who thought it wouldn't work. Because they thought, well, that's too simple, you know, our brain really has these concepts of things. And we need to come up with relationships with things and make large databases of things. And that's how we'll really solve the problem.
The problem with that is you have a lot of different things. You've got a lot of different relationships [inaudible 17:04] them. It's just too hard to map out. The work involved in that is mind-boggling. And how do you do kind of fuzzy relationships between things? The whole thing just falls apart. It's just impossible to build that database, practically.
But there were a couple of really important things that happened between the '90s and the year...and I'm going to say, specifically, the year 2012. And here's the couple of things that happened. First, well, the internet. People started posting things online, which meant we ended up with vast amounts of data, huge amounts of data points that we could start doing stuff with. And the other thing that happened is maybe a little bit unexpected, which is the rise of computer gaming. And computer gamers really, really love to have good graphics. They want to have beautiful graphics that are very immersive.
And so, like, what does all this data and these graphics cards have to do with things? Well, first of all, it's really hard to learn something if you don't have very many examples to learn from. And in the older era, the only way you could get a bunch of examples of things is kind of find them for yourself and write them down, and that's just impractical to do as a single team. But by the early 2000s, there were teams that started scraping the internet, and you could get a million images.
Well, now you got a whole bunch of data. What can we do with this data to learn about it? And second thing with the gaming is that gamers, in order to have these good graphics, they use graphics cards, which is a specific, like, little computer in your computer. So, it's a dedicated processor that's really, really good at doing linear algebra. It's really good at doing this line math that we've been talking about. Well, they do that so they can navigate the way that light passes through space. But that also happens to just match really well anytime you want to find a line of best fit for things.
TAYLOR: Light moves in straight lines. That's why that matches.
MIKE: So, some researchers started to use those graphics cards to start crunching a whole bunch of this linear algebra to say, what if I take a whole bunch of data, start running through these graphics cards using these old techniques that have been around since the 1940s? And really, they go back farther than that, and these predate computers for the linear regression. But, the idea of using gradient descent with a computer to do it has been around since the 1940s. What if we start applying these ideas we already knew to some of the input data?
And they start with a lot of input data because an image, you think it has a lot of pixels in there. But really, those are all just a whole bunch of data points. We talked about a lot of dimensions. Well, each of those pixels just has a brightness, right? And some position, you know, it's like an X and a Y position. It's got a position on the screen, and it's got some brightness. Well, those are just data points. Those are just dimensions.
Now, if we throw all of those many, many dimensions into some of these little classifiers I told about where we just add a whole bunch of weights and biases that we picked at random, and then we adjust it and say, "This image you throw all of these numbers, and then that is a cat," this other big set of numbers and say, "That's not a cat; it's a dog." Well, you know, that's something that's just almost impossible to think about doing as a person. But the graphics card it does that all day, every day. And the combination of the two was amazingly effective.
And I've mentioned 2012 because, in 2012, the team of Geoffrey Hinton built an image recognition network, as they called it, that followed this process. And there's a couple of little tweaks that I haven't mentioned, which is that they did something a little tricky. Rather than predicting immediately with just one set of weights and biases, Elvis is a cat; they dumped the output of those weights and biases, so those numbers that come out of there into another set of weights and biases. Like, wait, what? What does that even mean?
So, you put another set of weights and biases in between and then another set, several layers of these linear approximaters, and then they get now put out of it. Well, that's kind of weird, right? I mean, why would you put layers? Because what does that even mean on the in-between?
It turns out that that allows something magical to happen [laughs]. I'm going to keep using this word magic because it feels like it when it finally works. Is that these in-between layers start not predicting directly the price of the house; they start predicting something maybe more sophisticated. Like, based on this input data, I'd maybe have some prediction about the value of the roof. Even if you didn't have that directly, they start learning that concept because you can see, like, how new it is, maybe something...the color of the roof. They start to learn these concepts. Go ahead, Dave.
DAVE: Right. Is this, like, literally, like, you're like, okay, this is a point six hectare, you know, acre lot. It's going to be a $180,000 house, and you're going to need to take the lead out of the paint because...right?
MIKE: So, that lead may never be explicitly specified. But some of these intermediate layers because they don't have to directly predict the price of the house; they can predict something in between. And remember, they're just starting with random initialization. So, maybe one of them actually started out pretty good at predicting the lead just by random chance. And by keeping with that and getting even closer to this lead idea, somewhere in between these layers, it was able to improve the predictions of the network as a whole. And so, it sticks with it.
And with images, you can actually look at what comes out of these intermediate layers when they do this. And you see things like some of them recognize lines, some of them recognize circles. You know, they start recognizing interesting characteristics of the image that are not cat, but they might be important for cat, something like eyes, for example. Or circles lead to eyes [chuckles], which lead to face, which leads to, oh, cat. And just by throwing numbers at it and adjusting your gradients each time.
Now, there's a problem here. It's hard to directly adjust your gradient when you have multiple layers. And here's where I'm going to get a little bit more math-y. And this was some of the trick that some of the researchers figured out. I think it was particularly in the '90s but then in the 2000s. And this trick is...it comes from calculus. And this is where I'm going to get math-y for a second.
If you've ever done calculus, there's something called the chain rule, which is that if you want to find the derivative, which is another fancy word because mathematicians have, like, ten fancy words for the same idea of this slope. And it's a little more sophisticated because they can find it, you know, it's more than a line. But the idea is another idea of this slope.
And if you want to find the slope of some complex curve that has a bunch of things happening to it, there's a rule you can actually use to work back from the final one to the first one. Well, if each of these layers in the network is just applying some lines, some weight, and a bias, well, we know how to find slope from that. And so we can work backwards through the layers using that calculus chain rule. If you know calculus, if you're interested in that, they call that backpropagation. And that was the trick they learned for training this thing when it got more and more layers. And adding those layers really ended up allowing this magic to happen.
So, in 2012, they made a network called AlexNet that absolutely clobbered a competition for image detection. And people are like, "Oh wow, I wasn't expecting that [chuckles]. We're onto something." And it's approximately decades since things have gone on. So, Taylor.
TAYLOR: Yeah, I was just going to mention that there are standard datasets used to analyze these models. So, if you ever have a great idea on how to like, detect images, and classify images, you can test them on these standard sets of images and produce research using those. And I'm sure they're different today, but I'm sure you can still find the dataset that that was trained on and see how newer models perform on it.
MIKE: It is. It's called ImageNet, and they still use it a lot in training. And ImageNet happened because some professors...I think it was Fei-Fei Li. Is that her name? I think she's at Stanford. And her team got together a whole bunch of images off the internet and came up with a classification competition. And then the graphics cards gave us the power to do a whole bunch of line math of linear algebra, put them together, and out of it comes something that we never had before: the ability for machines to very effectively recognize images. And that really has kind of sparked the progress toward modern artificial intelligence and machine learning.
And after that, you know, things started happening like Siri and Alexa, you know, name your smart assistant that can understand language. Well, underneath, they were using these same kinds of linear classifiers built into a network called a deep because there's more than one layer, a deep neural network. And it works. There's a couple of other things that I wanted to throw out there to kind of complete the story. But I'm going to pause here for a minute.
But I think we should have a little bit of discussion because we've covered the most important parts, which is you start with just trying to find some multiplier for your input, for your input dimension, your input value that will give you the output value you wanted. And you adjust it a little bit each time with your incoming data, and then you just pump a whole bunch of data at it. And that idea is the idea, and it's not a new idea.
But if you have the computers that can handle it and you have the data, you can throw at it with a whole bunch of values, right? A million images is a whole lot more than five. And today's datasets are absolutely enormous. With some of the text datasets, you have, like, a trillion data points that you...a trillion...I say data points, a trillion examples maybe you'll throw.
So, there's just huge amounts of data that they can throw at really big networks that are big enough to learn a lot of these little nuances, and that's it. But the math behind it is actually not that scary in the small scale. It really is just finding that line. Now I'm going to be quiet for a bit because I've talked a lot. Like I said, this one's going to be a little bit different because I wanted to lay out these ideas.
Dave said he was going to approach this more as the, hey, tell me more, and Taylor is more of the field expert. Well, there's a couple of things that I want to visit. For example, I have not talked about gradient-boosted trees, which applies the same idea to decision trees. There's other ways they apply this gradient idea to something that's also pretty simple but is a little bit different.
And I haven't yet talked about transformers and about what might go in between these layers where you do something to get something that's a little bit more than a line. But those are all just tweaks to this core idea. So, Dave, I'm going to look at you first. What are your thoughts?
DAVE: I am just falling all over myself to preemptively yield the floor to Taylor.
MIKE: [laughs]
DAVE: Because I have lots of ideas and lots of questions, but the actual implementation details of machine learning is something so far outside my ken that this is just fantastic. I'm really, really, really digging this.
TAYLOR: Yeah, I honestly deal a lot more with, I guess, the implementation details of machine learning, like in the day-to-day here at Acima. Oftentimes, that is far more important than the training of a model. Handling data in production and, assembling that data, and making successful predictions is very difficult, and doing it correctly at scale is even more difficult. But, Mike, I really liked how you're describing neural networks. They are inspired by the human brain, and that's where the neural part comes from.
So, it's a really interesting abstraction of how humans learn. And that when we're born, we're sort of initialized with a really basic [chuckles] neural architecture that the weights really aren't well defined, like, outside of our instincts. And as we grow, we really start to develop more neural pathways, and we strengthen those neural pathways. We correct those neural pathways. And over time, we optimize them so we can be, I guess, contributing members of society and successful humans.
But, like, when I'm learning, I conceptualize it in that way as well. Like, you have to take an integrative approach. If you do things more than once, it's going to take a while to get those weights to where you want them to be and to get them optimized, so, yeah, we take the same approach with machine learning.
But really contextualizing the problem is a lot more difficult when it comes to machine learning. It's really had to define all your input variables, and you have to structure everything very well, very consistently. You have to be very organized. I think, as humans, we kind of take that for granted. It kind of happens just almost subliminally. Like, we don't have to worry about categorizing data and storing it [inaudible 29:12]
MIKE: Because we've been building our model since we were born [laughs]. It feels like common sense because we have been practicing it.
TAYLOR: Yeah. So, when we're training these machine learning models, we have to be, like, very diligent and very specific. And if you're not careful, you're going to teach the model something that you didn't intend to because you didn't contextualize the problem correctly. And in machine learning, we say that the model is not going to generalize if the data you gave it isn't like the data it's going to see. So, that's yeah; it's just a fundamental part to training machine learning is getting that dataset correct such that it mimics the real world, it mimics what the model is actually going to be predicting on.
And we spend a ton of time here at Acima making sure that's the case. Because if you train a model and you think it's doing well, and your accuracy metrics are where you want them to be, but when you deploy, it performs totally different, like, that model is not successful at all. It might have done well in training, but it doesn't do anything for us if it doesn't do that same thing in production. And stuff like that's happened before, and that's when you have to debug and figure out where that dataset was corrupted or queried the wrong way.
MIKE: What you're saying there, it's hard to emphasize this point enough. I'm going to go back to the thought that this perceptron was named in 1943. But we didn't have this big AI explosion until after 2012. And remember, there are two halves to that: first one was getting the computers that could handle it, and the other one was getting the data and [laughs] everything Taylor is talking about here. That work of getting a good dataset was what it took 70 years to get together.
TAYLOR: Yes, it truly is the culmination of several different domains, like, really just getting to the point where they can mesh and do things really well. I think a lot of what we consider machine learning is just classical statistics, and it's been in use for decades at this point. We joke about the term AI in our department, and we think it's kind of funny because it's kind of just a buzzword. And artificial intelligence often isn't that intelligent. Like you said, they're fairly basic concepts that are playing out.
These newer forms of large language models are closer to what I would consider AI and more of the Sci-Fi stuff. We're still not there yet. But yeah, classical machine learning has been around for quite a while. We've had actuaries for a long time, and statistical methods have been used to underwrite risk for decades.
MIKE: I remember as a kid, a good friend of mine, his dad was a statistician, and I thought, wow, that sounds so boring. And I wanted to go into other areas of science. It turns out now, like, I work with that kind of thing for a living, and it's so cool [laughs]. And he went into other areas of science. We kind of swapped roles. But it's been around for a long time. We've just gotten better tools for applying it.
And I want to say kind of repeatedly that the basic ideas here...if you've been scared, like, I want to learn this, but, man, that sounds too complicated, but they use a lot of fancy symbols and stuff in the papers and sometimes, you know, it takes a while to wrap your head around. But the ideas here, at their fundamental, like the basics of what's going on in machine learning, are not that complex. It's mostly these weights and biases.
TAYLOR: Yeah. To add to your point earlier on, like, actually implementing these, the great thing about all these machine learning algorithms is that they're packaged really well, especially in Python. You'll see almost zero math notation when you're training a machine learning model, other than how you choose to define your variables and methods. But it's incredibly easy to sit down and get started with machine learning. There's tons of datasets online that you can play around with. And there's standards that people have come to, prediction standards on those datasets. So, you can get an idea of if you're doing something completely wrong or you're hitting the nail on the head.
But yeah, they're fairly easy to implement. We actually use a hosted solution to train and deploy all of our machine-learning models; it's DataRobot if you guys are familiar. And there's dozens of machine learning algorithms packaged up in there. It handles all your training data and iterates through different training methods, and we do some work to select the best models in there and deploy them there.
But tools like that are great, and, like, now is the best time to get into machine learning and artificial intelligence because there truly is just so much material out there to learn, just countless hours of Khan Academy or YouTube videos. Yeah, if you're looking to get into machine learning, you shouldn't be too scared.
I'm trained as a computer scientist. So, I don't have, like, a specific data science degree or even a math degree like some of the co-workers here at Acima. But, I was able to make the transition into data science because it really is just computer science mixed with statistics and mathematics, and I've really enjoyed it. And it's one of my favorite domains of computer science if not my favorite.
MIKE: You know, unless you are inventing some addition to kind of the toolset used in these models, you probably are not going to be doing a lot of math because you're not going to be inventing your own architecture of layers and such because people have already built that. And you can get it pre-packaged. You just should have some understanding of what's going on under the hood.
TAYLOR: Yeah, you can get yourself into trouble if you don't understand the underlying fundamentals. Machine learning models are very finicky. And, like I said earlier, if you feed them wrong data, they're going to learn the wrong thing, and they're not going to perform well.
MIKE: I think I said that I'd come back. There's a couple of things that I want to mention I haven't mentioned before that you're probably going to run into pretty quickly if you start studying this stuff. First of all, we've been talking about neural networks. But, there are a few other approaches that people use to do machine learning. Generally, they also use this gradient descent idea [chuckles] to learn over time. But, they may approach the problem a little bit differently than having layers of weights and biases.
There's support vector machines, which have kind of fallen out of favor, but there's one that's actually very effective. And if you look at competitions for machine learning where you have tables of data, something that looks like it came from a spreadsheet, they actually don't usually use neural networks because they don't work as well for that kind of category tabular data. You can fit something better with what's called a gradient-boosted tree.
And it follows that gradient approach, but they don't apply that to lines exactly. They apply it to decisions that are made to divide the data into sections. You take your data, and you come up with a lot of really simple classifiers that just say, well, taking the input data, I'm going to divide some of the data over onto this side and some of the data on the other side. And then, I'm going to subdivide it again and then subdivide it again until I get all my answers. They call that a decision tree.
And if you have a lot of simple decision trees, like, a whole bunch of them, and then you use your gradient descent together with your input data to emphasize, to increase the ranking of some trees versus another, you actually get a very effective model that tends to be better for tabular data. And it's a gradient-boosted tree. So, most of what you see, like ChatGPT, doesn't use gradient-boosted trees. But a lot of the machine learning that you actually use in the real world that you might not even think about is probably coming from those gradient-boosted trees because they're very effective for a lot of the less flashy but also important kinds of things that happen in business.
The other couple of things I wanted to mention is that if you have a whole bunch of lines, the best prediction you actually can get is just lines. It doesn't map to curve things very well, things that just really don't follow a line. But there's a little bit of magic you can do, which is you take your weights and biases, and you run them through another thing that isn't a straight line.
And the thing that they usually run it through, and this is going to sound ridiculous because it's so simple, is that if you get any value that's below zero, just cut it off and say it's zero. It's called a rectified linear unit, which is really a fancy way of just saying if it's less than zero, just call it zero. They call them ReLUs in the literature.
If you put that on your predictions between your layers, it makes your layers not be able to predict a line because, you know, a line that comes down and then stops at zero is not a line anymore. It's got this point in it that's not linear, and that discontinuous spot there actually–– throws everything off. And it allows these models to learn something that's not aligned that actually might have sophisticated curves in it. Because now it's got something that throws the lines off and can be bendy, and they call that the [crosstalk 37:21]
TAYLOR: Because, yeah, a lot of times in data science, we're talking about error and our loss function. And let's say that the true line of those house prices looks like this. But your prediction might...if it's really bad, it looks like this, and you're off in every direction. And your error is really high because the point over here is really far from the true point, and your point down here is also really far. And our loss function helps determine on each iteration of training how far off we are from that true line. And there's countless different loss functions and accuracy metrics that we use.
And, oftentimes, you have to train a lot and test a lot to figure out what the right loss function is or what the right...oftentimes, we're developing [inaudible 37:59]. Like, it's not as simple as just there being a home price. We have to do some work to develop sort of what we want to predict, like, is it home price? Or is it average home price over, like, the last six months? Because home prices change, and some of that's variance and random noise.
But backing it up to some of the different domains in machine learning, broadly, there's three categories: we call them supervised, unsupervised, and reinforcement learning. Here at Acima, we use supervised learning, so all of our records have outcomes associated with them. We've labeled them. We've supervised the data. We've given it a supervision on what to consider good and what to consider bad. And that's what the gradient-boosted tree classifiers would fall under. Unsupervised would be more of that neural network where it doesn't have labels on how it should consider the data necessarily, but it can learn relationships in that data in an unsupervised fashion.
MIKE: Well, let's talk a little about that because people have actually seen this, and ChatGPT, which has been so famous, is actually a good example of unsupervised learning. They didn't go out and say, "I want you to predict English, and this is what English is. Match that." Did not happen. Because nobody can define English, it's too big. Nobody even knows all of it. Instead, what they did, and this might seem crazy, but they just gave it a bunch of text and said, "Predict the next word," and that's it. They do that billions and billions and billions of times. And you do that enough times; it gets pretty good at figuring out what the next word is going to be.
And, in fact, it has to learn concepts about the real world that are embedded in our language, understands hot and cold, and all these other things. It understands these ideas because it had to to be able to figure out what the next word is. But there was no supervision to say, "Hey, this is what English is. And if you get good grammar, you're right." No, all they did was just say, "I got some data here. Pick the next word."
Also, if you've seen a lot of the amazing image generation that's out there, they did the same thing where they take an image and they'd make it blurry. And then they'd say, "Predict what the real image is." And they got that with blurrier and blurrier images until they just gave it pure noise and say, "Ah, predict a horse." And it manages to predict a horse out of pure noise because it's learned the idea of what a horse looks like.
Where unsupervised learning is like, what are you talking about? We're just talking about trying to make sense of the world around you without having, like, a standard. You're just trying to predict based on some data that you've got that isn't classified other than to say, well, maybe this is a category.
TAYLOR: And I guess the past few years, it has become less so. But, traditionally, those unsupervised techniques require a huge, enormous amount of data because to figure out the actual underlying meaning in that data requires a whole bunch of data because there are so many examples of, like, what does a horse look like? What does the light look like? Where is it? Like, what color is the horse? So, to be able to extract that information from data,––it requires a whole lot of data––some sophisticated machine-learning techniques.
But over time, these models have become very good at extracting, like, the underlying relationships between pixels and the full image or words and meaningful texts. And that's really been, I think, the groundbreaking revelation in the past ten years. It's just how much meaning can we extract from this data?
MIKE: And I interrupted a little bit to go into a little more detail about unsupervised learning. You're going to talk about reinforcement learning, something I love as well. Do you want to talk about reinforcement learning?
TAYLOR: Yeah, my first introduction to reinforcement learning was the game of Pac-Man. Our job was to basically build the Pac-Man game such that the ghosts try to eat Pac-Man. and Pac-Man tries to avoid the ghosts and eat the food.
MIKE: [laughs]
TAYLOR: And to do so, you have to go through thousands if not millions of iterations of the game of Pac-Man, developing rules to make Pac-Man eat as many food pellets as you can and not die as often as possible. And on every single one of these iterations, you're learning something, and you're reinforcing the behaviors in the model. Maybe the tweaks that you made didn't perform well, so you have to roll back those changes. Or maybe they perform really well, and you want to see how far you can push those changes. So, that's sort of the idea behind reinforcement learning.
And ChatGPT actually has a layer of reinforcement learning on top of it. And this is what you'll see sort of when it says it's just a large language model, and it can't do certain things. This stems from the actual humans' sort of, like, reinforcement learning the underlying transformer model to produce things that, I guess, the humans would prefer not to say. Maybe you guys have seen ChatGPT be hacked before, hacked in a way, or break out of its bounds [crosstalk 42:33] of that. So, you can kind of get, like, the raw output of the transformer model.
Yeah, the reinforcement learning on top of ChatGPT is really interesting. In many ways, I think it makes me productive. Like, it keeps students from cheating on their homework, or it stops people from making weapons of mass destruction. But that layer on top is reinforcement learning. And it did require a huge amount of human capital to label output from ChatGPT and kind of weight it in a way where ChatGPT was more user-friendly.
MIKE: Right. And you use reinforcement learning when you have a reward signal like winning the game [chuckles] or getting away from the ghost, or, you know, not saying racist things that you can use to give this number, like, that was good, or that was bad, a reward signal for an output that involves a series of actions over time. Generally, you know, you've got a sequence. And you need to be able to have this incoming state. Well, this is what happens already, and I want you to make a prediction of the value of a certain action going next.
TAYLOR: Yeah, like, in reinforcement learning, we'll oftentimes talk about the idea of, like, the state space. It's really easy to think about, like, in terms of the game of chess. The state space is the chess pieces on the board and the board itself. And depending on how those chess pieces are arranged on the board, you have a different state space. And there's certain optimal moves and the outcomes from that state space.
Like, obviously, the state space of all the possible moves and board conditions in chess is astronomical. But if you were able to calculate that entire state space and the optimal move from every state space, then you have, like, basically a perfect chess agent. And that's kind of where we're at [chuckles] with chess agents. Like, I think it was back in the '90s with Deep Blue. The artificial intelligent chess agents were able to beat...is it Kasparov or? I forget...the Russian grandmaster.
MIKE: Garry Kasparov.
TAYLOR: Yeah, Kasparov. I think there's a phenomenal documentary about that on Netflix or somewhere. But yeah, they developed an algorithm that was able to traverse the chess state space and determine optimal moves. And I believe that was an example of reinforcement learning. I forget what the specific algorithm they used was [inaudible 44:46] model.
MIKE: I think that Deep Blue actually didn't use...I think Deep Blue didn't use reinforcement learning. But --
TAYLOR: Yeah. They might have brute-forced it, right?
MIKE: Yeah, I think they did. It's just that if you look at enough games from enough grandmasters, you can avoid making some of the same mistakes that they did [laughs] --
DAVE: I heard an interview from Kasparov where they told him, "We just kept adding more and more memory to Deep Blue until it could beat you." And he was kind of disgusted to find out that, oh, you did brute force me. And they're like, "Well, we were going to brute force you from the beginning. That was the whole point."
TAYLOR: [laughs].
MIKE: But some of the newer models that have come out of Google's research teams, the Alpha class, AlphaGo, and AlphaZero, that are able to use reinforcement learning to do self-play, play against each other, and learn over time have gotten much better, much better than any previous modeling has.
TAYLOR: Yeah. And, obviously, with the game of go, you weren't able to brute force it because the state space was, I think, it was orders of magnitude higher than chess.
MIKE: Yeah, I think so.
TAYLOR: So, they had to develop these other techniques to extract optimal outcomes from, I guess, more abstract spaces, not specific state spaces.
MIKE: Fascinating stuff. And we're giving you a taste. So, our listeners, we're giving you a taste here of [chuckles] what's going on here. There's one other thing I said that I was going to cover, and so I want to make sure I don't neglect it, which is transformers. There was the wonderfully named paper that came out a few years ago. Taylor, do you remember what year? "Attention Is All You Need." It came out in --
TAYLOR: 2017.
MIKE: 2017. That has really allowed some of these models to get a lot better. And I mentioned that simple ReLU layer—they cut things off—that puts things in between the linear layers. Well, they came up with a more sophisticated one. It still isn't that sophisticated. They use some Gaussian curves, you know, like, the bell curves you're used to seeing, to try to predict where are the hotspots in the incoming data here. What data is important? And to emphasize that and use that, you know, to be able to kind of focus, again, the attention on some data versus others.
So, if you look at a lot of text, not all of it is going to be important for what your output is going to be. And being able to predict what is important this works better with these transformer models, as they call them, that pay attention. And those have been developed and have allowed kind of the next stage, the next level of skill in these models. And some of the amazing things like ChatGPT and some of the image generation use some of these transformer layers inside.
TAYLOR: Yeah. If you ever want to learn more about transformers, Spencer Reschke is actually the guy to go to from Acima. He's a mathematician by training. His degree should have also been, like, a computer science degree. He's got a phenomenal understanding of self-attention mechanism. He actually showed me that paper. He's tried to implement a transformer here at Acima.
Transformers are really good at tackling sequence-to-sequence problems. So, in the case of ChatGPT, sequence to sequence is just token to token or word to word or letter to letter. But in the case of Acima, we might use a sequence to sequence to predict how a lease traverses through; I guess, like, the rental period. So, given that the lease is this size and it's the first day, what is the probability, or what's the most probabilistic outcome on the second day? And, like, are they going to be delinquent, or are they going to make their payment?
So, we'd like to use a sequence-to-sequence model to predict eventual yields on our portfolio. So, these transformers and sequence-to-sequence models can actually generalize to a whole bunch of other problems as long as you can format them in that sequence-to-sequence format.
MIKE: You know, there were other kinds of models used for sequence to sequence prior to that, and they still use some, and hence the name of the paper "Attention Is All You Need" said, well, you could use those other techniques. And, originally, some of this transformer used...self-attention was used in tandem with those other techniques. Eventually, they realized––actually, just drop the other techniques entirely. The self-attention is everything you need. And that's been transformative with these transformers. I'm going to look at you, Dave. You've said a little bit, but not a lot. It's mostly the Taylor and Mike show [laughs].
DAVE: It's been fantastic. I've just been eating this up.
MIKE: Do you have any final questions or, you know, clarification you'd like to put in here?
DAVE: I think this is either going to have a quick answer, or much more likely, it's going to be like the beginning of another episode. But I worked with a PhD gentleman. I guess we call them doctors. I worked with a guy who had a Ph.D. in data science from Stanford. And he said that through the '90s and early noughties, the AI stuff was producing stuff that was really spooky. And it was like, you know, how is this? Why, you know, da da da. And there was a shift in the late noughties, early teens into explainability, where you could literally get your doctorate in explaining why the AI arrived at this. Is that still going on? Is that still a dominant feature of the industry?
MIKE: [laughs] It's still hard --
DAVE: What's going on with explainability? It's still hard?
MIKE: It's still hard [laughs]. I'll say that. It's still hard. Again, these things are bigger than we can fit in our head. So, somehow, you have to come up with a signal that we can comprehend [chuckles] in something that's far vaster than you can fit in your head. Explainability is still an active area of research. Go ahead, Taylor.
TAYLOR: And I think that's especially true in the areas of, like, transformers and neural networks and these larger models with many layers and tons of neurons. But in terms of some of the more basic machine learning approaches, those are fairly easy to explain or are responsible for explaining some of our decisions using these models here at Acima. And in the case of decision trees, we'll basically say you are on the wrong side of, like, these splits in the tree. Like, your credit score was too high or too low, or your income was too high or too low.
And in the context of, like, a trees classifier, you can explain those splits a lot easier than, I guess, the inner layers of a neural network and what those individual neurons meant and why they fired one way for you and worse for another person. So, I do think that is worth spending some time on explainability. Because if you don't understand why your model is producing some outputs, then it lacks in utility in that respect, especially in the area of finance that we're in.
So, we've never deployed a neural network or something like that. We tend to stick to models that have what's called SHAP predictions or SHAP-supported values. And that's basically a methodology we use to explain prediction output and which features or independent variables lead to the output.
DAVE: Awesome. Does explainability get into the positive side as well of, like, you get back a prediction, and it's like, oh, this lease is going to become delinquent in two weeks? And you're like, wait, what? Because they made a payment at 3:00 p.m. on a Tuesday, that's why. And you're like, okay, that's dumb. Stop...and then two weeks later, the lease goes delinquent, and they all do. It's like the data is there, and it's predicting. Does explainability get into that kind of spooky [inaudible 51:44] Ouija board stuff?
TAYLOR: I mean, sometimes there'll be odd predictions from a model that turn out to be accurate, but in retrospect, I think they would make sense. But for the most part, I think most of the decisions are intuitive. And there is some, like, some understanding you can gain just from looking at the decision. We make predictions a little differently from what you described. Like, we're not predicting delinquency. We're basically predicting if the account is good for the company or bad for the company.
And there are certain attributes that make the accounts good, certain attributes that make the accounts bad. It's pretty rare that there'll be an attribute that makes the account good for some strange reason or vice versa. And part of our job is to make sure that doesn't happen. Because that might suggest that the training data was malformed for some reason or it's representing a relationship that's not going to hold true in production, so we'll test to make sure that's not the case. And when we deploy new models, it's definitely something we're looking out for is: why was our input data suddenly not looking like the data we trained on? Like, we've got to do something that will stop this model.
DAVE: That's awesome. I have follow-up questions, but I'll save them for the next time we talk about this.
MIKE: Taylor, do you have any final things you'd like to say?
TAYLOR: I don't think so. I think we've covered a lot. And we can definitely spend another call just covering machine learning and, especially as it pertains to Acima. I would like to extend an invitation to anyone from Acima who wants to learn more or has questions for us; even if it's just a personal project, we're always willing to help and extend an olive branch.
MIKE: Awesome. Well, thank you. Thank you to our listeners for listening to another episode of the Acima Development podcast, and until next time.