Episode 97

Standups

Your Host

About this Episode

The episode of the Acima Development Podcast centers on database performance, using the concept of indexing as its foundation. Mike opens with a story about discovering Google in the early 2000s to illustrate how powerful indexing systems transformed access to information. That same principle applies to databases: indexes act as shortcuts that make retrieving data dramatically faster, especially in large datasets. The discussion emphasizes that while indexes can feel like a technical detail, they are fundamental to how modern systems function efficiently, much like search engines reshaped how people find information.

Bill Coulam then dives into the technical side, explaining that indexes improve read performance but come with trade-offs, particularly slower writes because both the table and index must be updated. A key rule of thumb is that indexes are most beneficial when queries return a small subset of data, typically under about 25% of rows. The group explores how poor indexing strategies, like over-indexing or missing indexes on key relationships, can quietly degrade performance over time. Bill shares a striking real-world example where adding missing indexes reduced a process from taking 24 hours per record to processing millions in just a couple of hours, highlighting how impactful proper indexing can be.

The conversation broadens into database design philosophy and performance tuning. The team discusses different index types in PostgreSQL, when to use them, and how to balance read vs. write performance depending on use cases like bulk inserts or high-frequency queries. They also touch on when relational databases fall short, such as full-text search or massive write-heavy workloads, where NoSQL or specialized systems may be better suited. Ultimately, the takeaway is that effective database performance comes from understanding your data, access patterns, and trade-offs, combined with ongoing maintenance and thoughtful design rather than relying on defaults or assumptions.

Transcript:

MIKE: Hello, and welcome to another episode of the Acima Development Podcast. I'm Mike, and I'm hosting again today. I'm going to start by introducing Bill Coulam, who's with us today. He comes to us from the data team. And he's been here before, but we're going to focus on some information that he has to share. So, he's kind of the star of the show today. Also with us, returning, we've got Eddy, Travis, Justin, Dave. Mr. Perez, great having you with us. We've got Mike Perez here with us, and Ramses.

As usual, I'd like to start with something a little bit outside of our topic in order to bring it in and tie it into the outside world. And I was thinking about a story I think I've shared before. The importance of this moment early in my career keeps, like, growing as I look back to it, like, wow, that was a big deal, and I didn't realize it at the time.

So, in the early 2000s, somewhere in the early 2000s, early, early 2000s, I was working for a guy [chuckles]; I'm going to say that. He had some projects, and he didn't have enough resources to do some freelance projects, and so I was doing some of his stuff. He was outsourcing his freelance work to me [laughs]. And he had a project that was in Windows, and there was something they wanted to accomplish through the API.

And I started looking through the documentation, trying to use Microsoft's tools to search the documentation, and I spent hours. I looked everywhere I could [chuckles], and I couldn't find it. I came to the conclusion, maybe this doesn't even exist.

And I came back to him, and he got back to me, like, 30 minutes later. He said, "You know, there's this new tool called Google, and I use that, and it's amazing. You should start using it because it works really well, and it led me to this documentation." Like, wow, well, I know what I'm going to use now. I'm going to use this Google thing [laughs] because that works way better than actually going through the table of contents, and the index, and the documentation, because that's really hard to search through.

Those older forms of indexing were insufficient. Now, Google had this brilliant idea, you know, the founders of Google, that, okay, we'll index the internet. And even back then, that was, like, an impossible goal [chuckles]. And there were other sites that were doing it. There were indexes out there. What they would do is they'd look at the words on a website, and they would create an index based on those. And so, if you look for a word, they'd look for a website that had a lot of those words.

Well, people really quickly figured out how to game that [chuckles], and, of course, they did. So, they were useless almost immediately because people would go into their meta tags, and they'd just write the same word a hundred times for something that the site was really not very applicable for.

What Google did is they came up with a different sort of index, where they would index words in the links that linked back to a site, and also give extra weight if there were a lot of them, right? And so, by building a more appropriate index that suggested popularity, rather than self-determined, a self-stated importance of the page for a specific topic, they were able to come up with something way more effective.

And you don't always think about indexes, you think, index? Like, I remember going to the library. It had, like, the Dewey Decimal System, which is really kind of weird and awkward and hard to find things with, but it was way better than the alternative, which would have been nothing. You don't usually think about indexes changing the world, but that index, that PageRank index, you know, the PageRank algorithm that they use to just create an index, that's all it is, right? Link this word, map this word to a website, so that when you're searching this word or phrase, then we can find it.

It literally, like, fundamentally changed culture. It's now a verb [laughs]. Like, you Google something, even if you're using Bing for those of you out there who use Bing [laughs]...

DAVE: Use Bing to Google, yeah.

MIKE: Exactly. You use Bing to Google, because information now is accessible, and that is something that didn't exist before that. For all the digital natives who've grown up in this world, like, how did you find things before? Well, you didn't [laughs]. You suffered. You wandered through libraries.

DAVE: We just got used to not knowing things, yep.

MIKE: Exactly. That's exactly what you did. You got used to not knowing things. It changes everything when you have an effective index. And I could talk about all the times in my career when something's missing from the database, and yeah, it was the index. It's always the index. There's always a missing index somewhere. It solves all of your performance problems. And there probably is an exception, but I can't think of it [laughs]. It's always the index.

That's what we're going to talk about today. We're going to talk about database performance. And we've been wanting to, you know, Bill's been preparing this and thinking about this for a while. If we're talking about database performance, indexes are going to come up over and over again. And this could seem really dry, and this is going to be a technical deep dive, right, we're going to very much going to talk about indexes. We're probably going to be focusing on PostgreSQL.

But this idea of indexes is not a trivial one. It's how we operate in the modern world. Our culture, our commerce has been fundamentally transformed. Our ability to know things and outsource, you know, to this Library of Alexandria that we've got in our pockets all depends on indexes, and it's amazing.

There's my introduction, Bill. And I wanted to lead out with some weight behind what you're going to be talking about today.

BILL: I love it. That was a fantastic segue. All right. Hi, everyone. I am Bill, Bill Coulam. I've been doing this work for about 30 years now. I started as a software engineer using COBOL and mainframes, but I don't put that on my resume because I don't want anyone to ever call me back to help with that. So, I tell people I started with C and C++.

I was actually one of the first users of Java back in 1995. My company that I worked for at the time, Anderson Consulting, they wanted me to go around to their clients and tell them what I thought of Java. And, at the time, I felt like it really wasn't ready for primetime, and so I kind of voted myself out of working on that platform.

But that's okay because I ended up, on every project that I worked on, working with Oracle, and, at the time, Oracle was the 800-pound gorilla. And I was in the telecom industry, where we had some of the largest volumes of data in the world, and so I learned a lot of great lessons working on those big systems.

It's a whole other world jumping between databases that have 10,000 to a hundred thousand rows to databases that have 500 million, a billion. Performance tests in your copy of production can take three hours. It's a completely different world. Anyway, so you learn a lot of good lessons working on data that big.

I ended up sticking with Oracle for a long time. It became my bread and butter. And went from San Francisco to Denver to Houston, and then back here to Utah where I grew up. I've been here longer than I spent time in my own hometown. So, I've been here in the northern central part of Utah since 2007.

Anyway, let's go ahead and jump into it. We're going to be talking about four areas: the fundamentals of indexing, some guiding principles, the two shared tendrils, index types that are available to us using Postgres as our source database, and some indexing dos and don'ts.

Firstly, some fundamentals. An index is a shortcut to get at the data. However, because an index is a separate structure from the actual table containing the data, it requires at least two I/Os to get at the data: one to search through the index, then one to access the rows in the table. Because of this, indexing can and usually does save time when querying large tables, but it can take longer than a full table scan if the number of matching rows is greater than around 25%. That is a rule of thumb, not a hard rule.

I did a bunch of testing back in 2024 on our setup here, and it was right around 25%. So, if the number of rows you anticipate matching your query being less than 25%, an index will typically make sense. Ultimately, an index is stored in a file. And updates of index columns, keep in mind, must modify and manipulate the table and the index. That's important when you start thinking about how many indexes your table has and the effect that that will have on write time. And, lastly, matching index and table results will get cached in case the same request is made later.

MIKE: So, I've got a couple of questions about that. Firstly, how often do you see in...and this depends on systems, right, so maybe there is no universal answer. But how often do you see indexes harm performance? Because there's this index that we probably didn't need, but now we have to write to it every time, or somebody went in and indexed 20 columns, right? There are certainly bad use cases.

Have you seen cases where there was a clear performance hit, and, you know, seeing data to show that? Is there some sort of rule of thumb where I should think, ah, well, actually maybe the database is a bad idea here? I'm also curious about those caching results. Do you sometimes get...in data sets that are growing really fast or something, do you end up with weird results from that caching?

BILL: Let me answer the second one first. The answer is no. Phil Karlton of Netscape, may he rest in peace, he was famed for saying something like, "There are only two hard things in computer science: naming things and cache invalidation." And there was some wisecrack that added to that, where it said, "There are only two hard things in computer science: naming things, cache invalidation, off-one-by errors." But yeah, cache invalidation is tricky, but the database engine teams tend to have done that very well. So, I've never had funky results from mainstream relational database engines, so that tends to work pretty well.

The answer to your first question, the quick answer to that question is no. I have not seen indexes cause immediate harm. Like, the old analogy of, you know, the frog in the pot of water that eventually gets too hot and cooks it, adding even crazy indexes, indexes with lots of multiple columns in them, and so forth, I've never seen an immediate and obvious degradation. It's even been hard to detect it when that pot is fully boiling.

When a table has 30 indexes on it, and inserts are taking two milliseconds per row, generally, you don't notice it. Over time, as these indexes are added, the team that works with that data tends to believe this is the way things are.

DAVE: Oh, it does that.

BILL: And they don't really question: could this be three times faster if we got rid of all the unused indexes? So, yeah, to answer your question, I've never seen it immediately [inaudible 11:39] performance.

MIKE: [crosstalk 11:39] three times faster. That's probably loosely data-driven, right?

BILL: Yeah, that's very loose.

MIKE: But you didn't say a thousand times faster. But there are very much cases where if you're missing an index, it could be a thousand or a million times faster [laughs].

BILL: Yes, mm-hmm. And --

DAVE: I actually have seen a case of this, but I think I'm actually in agreement with Bill, where the definition of a database, right, developers we always talk about toy databases, right? But the database...and you don't think of this as a database, but the system log on Linux, it's a log file, and we think of it as a log file. But you can also think of it as an unindexed table where you want a right row right...Insertion has to be very, very fast, and you can't spend any time indexing.

Well, if you have to write fast and you're saturating the hard drive...also, this is back in the days when a hard drive seq was 12 milliseconds, and so updating the index and the file was very painful, right? If you take that blurry view of, like, is a log file a database? It's easier to think of that when you start realizing that, well, everyone's now streaming their log files up to Splunk and Datadog, and these things that are, like, pulling their log files together.

And time series databases like Grafana now exist where you're supposed to log, log, log, log, log, log, and then, over time, they start compressing the old stuff. Like, they start batching it up historically, and you start losing data. It's kind of like compacting context for an AI. So, like, 100%, I agree that, like, if you're talking to a real database, you've usually got a lot of structure, and everything's, like, really, really solid.

But I have, back from the battle days when I was doing a lot of MySQL, we would have to sit down and go, is this table fast right, and we don't care about search? Or do we need fast search on this? And if so, can we pay the cost to index it?

MATT: I think it's specialized, right?

DAVE: Mm-hmm. Mm-hmm.

MATT: There's certainly cases where you will see this. If you have one insert and that insert is inserting four and a half million rows, and I see this, it's a problem. But I think it's a more specialized case and more one-off. But if you have 30 indexes on columns in a table that you're inserting 4 million rows at a time, you definitely see performance degradation for sure.

EDDY: So, that's kind of interesting, right? Because I think, historically, what we've done is we try to remove unused indexes, right, like when they become unnecessary. I think the rule of thumb is clean up to avoid degradation, right? But then it's kind of interesting because I think Bill's response to that is, I haven't seen that in practice slow down any inserts or writes, right?

So, I'm curious, like, is it just, like, the thought process is clean up after yourself always, regardless of whether that slows down degradation? Or is it...

MIKE: So, I asked the question because I think that there's so much power in indexes and, you know, it seems like that cost to write isn't that high, but I think there are other costs. Even if it was negligible, even if there was no impact at all, I think if you want to understand what's going on and you've got 30 indexes, you're in trouble. Like, I think that that cleanup matters just for, you know, we don't write in machine code because code is written for humans, right, and then compiled for the machine to understand. And I think the same thing applies here.

There's a human aspect to this that unless you actually needed those, I think that you're doing active harm to the users of that, you know, to understanding or even trying to fix if there's a problem, if you've got a bunch of junk there that you don't understand. I think that regardless of whether it doesn't have much impact, I think that there is still, like, a reason to keep it clean. Well, that's my thought. What are your thoughts, Bill?

BILL: You really need to know your data and know your anticipated access patterns. Matt was talking about a scenario where you want to insert 4 million rows in the telecom industry, or sometimes you'd need to bulk load 300 million rows. That's a different access pattern than inserting a single row. And you need to approach things differently.

In some of those bulk load scenarios, it was much faster for us to drop all the existing indexes, do the load, then re-add the indexes than it would've been to leave the indexes in place and let the database engine do all the maintenance. So, everything is an it depends answer, right? You need to think through those things.

But the second thing that I want to talk about, which is highly related to the indexing fundamentals I just went over...And these aren't, you know, industry standards. These are just, according to me, some guiding principles of index design. The first one we've kind of already talked about...actually, both the first and the second. That is that an index will make data access faster, but an index comes with a cost to write times.

So, just like Goldilocks and the Three Bears [chuckles], you know, you can't have too many or too little. It needs to be just right. In order to make it just right, you really need to understand your application, the business requirements of that application, the data that you're working with, the quality of the content of it. You need to understand the access patterns that are anticipated on that data, queries that you already know about, or anticipate. And by having all of that context, you can design a much better database schema and indexing strategy.

DAVE: One of the things that kind of blew my mind, like, 5, 10 years ago as I was getting into, like, NoSQL, and schemaless databases, and document databases, and that sort of thing, was somebody pointed out to me that SQL is actually a terrible language for reporting. It's not built for reporting. SQL is an ad hoc query language. It's how you get at your data when you have no plan to get at your data. And all the cool stuff, all the bells and whistles, like the query planner and indexes, are basically trying to get around the fact that you weren't prepared to do this.

And if you do know what you want and you've got that report well defined, you can make it so, so slick, whether it's indexing in advance, or materializing views, or shoveling everything over to the data team and letting them stick it in gigantic vertical tables.

EDDY: So, I think I've always just gone hand in hand with saying, "Oh, you want reports? You want SQL because that is the most efficient language used in any sort of database," right? Are you suggesting, or do I understand that correctly, that you're saying that that isn't its original intention? Because you're blowing my mind if that's the case.

BILL: If you want something that's super efficient, it's NoSQL, because a collection is built to match an anticipated access pattern; it will blow relational out of the water. I love that David brought that up because that was the next line of my presentation.

DAVE: Yes!

BILL: Is that one of the greatest strengths of relational databases is its ability to handle ad hoc queries. So, you try and totally understand your system and try and anticipate the queries that will be required of it. And some you will know right off the bat because they're right there in the requirements documents, but there are still plenty of ways in which that data may be used in the future.

And you just use your experience and your gut instinct to say, "Okay, we're probably going to need an index for these three columns here, because of what they're named, and what I expect, you know, the end users to want. This JSON data column, it's unlikely they're going to be indexing off of that, you know," so you make your best guesses.

But yeah, a relational database is really good at ad hoc queries. And if you have done your very best to index intelligently, then it will be able to handle most of those ad hoc queries well. And the ones that don't will be generally immediately apparent, unless you're on a tiny system, you know, trivial system. And then you can redesign things; maybe toss an older index and redesign a new one that's composite or partial or, you know, fancy in some way that matches your needs.

MIKE: I've got an anecdote related to this as well when you talk about NoSQL.

BILL: Yeah [inaudible 19:27]

MIKE: Some years ago in my career, I was working at a place that did content management, largely for newspapers. And you think about what an online newspaper page is; it is effectively searching the latest content. You don't usually think about that. Like, that's search? Well, yeah, it is. You want to just get the latest things. It's a feed, right, and then the oldest things drop off. That's more obvious to get something like the social media, where literally is this feed where it comes in from the top.

The newspapers, even old paper newspapers, you have the latest content, and the most important stuff comes to the top, and other stuff flows down to page eight. And we organized our presentation that way. It really was doing searches. And it got slower because a lot of that relied on text. You know, you're looking for this kind of text, well, this is the weather, right? We want to look at the weather stuff.

And to make our system efficient, we had to get out of relational databases. And we used a full-text index; we were using Solr at the time. It's similar to Elasticsearch or OpenSearch, all these descendants of the old Java Lucene library that allow you to efficiently build an index into your data. But it also is effectively a NoSQL database because it's searching the data.

In fact, you can even cache the data in that index, and so you never even hit your database. And we would do that sometimes, where we'd never even hit our relational database at all. We just used that to store the data before we indexed it, and then it came out of that other system. And we could not run. We absolutely could not run off our relational database because it was way too slow; it was unworkable. We had to use, you know, that NoSQL database in order to work.

BILL: Yep. And there was a telecom company I worked for in 2021 where they needed wickedly fast writes. And so, there we went with Cassandra. We weren't really worried about indexing or reading.

Now that I'm clear that this is not a presentation, I'm going to possibly just fly through the middle portion of it, which is where I train the listeners on the different sorts of indexes available to us in Postgres. Maybe I'll just mention them briefly and then get to the dos and don'ts at the very end.

In Postgres, we have a number of basic indexing types available to us, many of which are covered in your basic computer science courses, like Binary Tree and hashes. We also have some index types called GIN, which stands for Generalized Inverted Index, and GiST, which is a Generalized Search Tree.

The GIN indexes can support many different user-defined indexing strategies out of the box. We typically use them when indexing columns that are arrays on a Postgres table. But it is also used to aid in the implementation of full-text search on Postgres. And it comes built in with various operators that let you do things like nearness, and contains, and stemming, and some other things that a typical B-tree doesn't allow you to do.

GiST Index is fairly similar in that it allows you to build your own indexing strategies. It supports nearest neighbor searches, geography, spatial, and other very specialized types of indexing. This is the index most typically used for features within an application that have mapping features, allowing you to see how far away you are from the pizza origination point and things like that.

MIKE: Well, you know, that's interesting. And even I think, well, that's a very specialized case, but the most common query in our biggest application is one of those geographic queries, in order to find your [crosstalk 23:26]

BILL: That's in merchant portal, right?

MIKE: Exactly.

BILL: We're using that there. Then there's a really specialized version, which I have never used, called Space-Partitioned GiST, SP-GiST. The official documentation says it's non-overlapping. It lets you build your own indexing strategies, just like GIN and GiST. It's very flexible. It permits implementation of a wide range of different non-balanced disc-based data structures, such as quad trees, k-d trees, and radix trees. If you guys know what those are, that's awesome. I did not go through computer science, so I'm not even sure when those would be helpful. But, apparently, it is also used in geolocation-type applications.

The other sorts are a little bit more used. I've only used BRIN once. BRIN stands for Block Range Indexes. These are best for columns whose values correspond with their physical order in the rows of the table, so think of, like, now-serving number at a queue or a kiosk. As that number is monotonically increasing and going to some database column, that number is very close to the value of the row that preceded it. That is a perfect column for a BRIN index. And the reason you would want to use it is it uses far less space than a typical B-tree because it uses ranges instead of individual values.

Uh, let's skip over that.

MIKE: Do those ever get used for, like, primary keys or anything like that, for efficiency, or not really?

BILL: No [chuckles]. Most every system I've ever worked on defaults to B-tree for a numerically based primary key or surrogate key. But, in theory, it could be used for one of those. Yeah, I've only ever used it once, and, typically, the B-tree is fast enough. I don't need to eke out another hundredth of a millisecond by using a BRIN. So, typically, the default is used.

MIKE: Interesting. Maybe if you had some massive data set of sequential data or something.

BILL: Yeah, because this happened to me in Telecom, where we were trying to eke out every millisecond we could find. This might have been one of the strategies we tried.

DAVE: We had a really fun...out here in Utah, we used to have Mountain West RubyConf for about 15 years, and it was a fantastic conference that got put on. And James Golick came out. He was the CTO of FetLife. Don't Google that. It's an adult website, social media for naughty stuff. And because it's naughty stuff, it was very, very popular, and he was running, like, a terabyte of notifications through their database, like, every single day. And this was in, like, 2009, so, like, a terabyte was a lot back then. So, imagine somebody shoveling around a petabyte or, you know, half an exabyte trying to get that through.

And they were using a popular document database; it's not my fight to have, so I won't say which one. They bogged down so hard. They kept backing up and backing up. They bogged down so hard that he had to physically pull the cord on the server. Like, he couldn't shell into it to stop the server. He couldn't, like, I don't know if he had a bash prompt. He couldn't get the keyboard to respond. Could not get ACPI power button, that's when you hold down the power button on the front of the case, could not get that to respond. The document database was just spooling everything; it had just backed up and backed up and spooled out.

He ended up writing Friendly ORM, which is based on FriendFeed. And if you want to know how a document database works, go tear Friendly ORM apart, if you like Rails, because it's built on SQL. It runs on MySQL or runs on Postgres anything. And your data, your documents go in a table that has an ID and a blob column. And all of your indexes are tables, and every table has an index, a record ID, and whatever data you want to go look up, and it's got an index on it. And he just handled that in the ORM.

And when you talk about writing something in anger, he ripped out that document data store that same day. Like, it was on a Thursday or a Friday, and on Monday, they were running on Friendly ORM in MySQL. It was insanely angry.

So, yeah, if you want to know how NoSQL works, like, under the hood, it's fantastic because you get into it, and you go, wait, is this all there is to it? And yeah, that's all there is to it. All the stuff about, like, crawling over a database and indexing it and then searching back through, like, the problems of searching a document database when you don't have an index, it's very obvious because there's only, like, four moving parts. It's really, really cool.

BILL: Fabulous anecdote. I saved the B-tree index type for last because it's the most common; it's covered in computer science courses. But I just wanted to cover just a couple of nuances in Postgres, a couple of which I had to learn the hard way a year or two into using Postgres. One of which is that the B-tree is really good for less than, greater than, less than or equal to, greater than or equal to, and equals. It can also support some other equality and range comparisons, like the LIKE operator, BETWEEN, IN, IS NULL, and IS NOT NULL. But there are a couple of operators it doesn't support out of the box, so one of it is the LIKE operator.

If you do a LIKE comparison and you feed it a pattern that starts with a wildcard, it can't use that. It nullifies the use of the index for that comparison and will do a full scan on the table. In order to do that, in order for the database to be able to index the first few characters of a word, you would need to use the GIN index with the trigram ops. I think it's called an operator. Anyway, each of these index types has the basic default syntax for creating an index of that type, and then it has a whole bunch of optional things.

If you want to really know your stuff, get into the Postgres documentation and look at those options sometime. That's where you see some of the richer things, like, for the B-tree, it has some operator classes called text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops that I didn't even know existed until about three years ago. I won't go into those right now. But just know that there's a range of flavors of these indexes that you can activate by knowing what those options are and knowing when they'd be useful.

So, with B-tree, there are a variety of flavors of the B-tree index. There's the one that we use the most often, which is a single-column default B-tree. I won't talk more about that. The second flavor is a multi-column one. This can be used for indexes, sometimes referred to as keys, which are composed of 2 to 32 columns. You're limited to 32. I've honestly never seen any with more than 5.

This sort of multi-column index is used for queries where two or more columns are always or frequently used together in the WHERE clause. During index creation, you know, you say CREATE INDEX. You give it a name ON table_name, and then in parentheses, you list the columns that you want indexed. You list those columns in the order of selectivity. So, if you had, for example, a table of people, or employees, or citizens, which would you put first: social security number or eye color?

KYLE: Low cardinality first.

BILL: Yeah, yeah. So, the thing that would return the least amount of matches first would be social security number, which is unique. So, yeah, higher selectivity goes first; lesser selectivity goes towards the right.

MIKE: That's an interesting one because I think that those multi-column indexes don't get used as much as they could. A lot of the big, gnarly, slow-running queries do query against several, you know, they query against a number of things. How much benefit do you get from using a multi-column index rather than having several columns indexed independently?

BILL: A lot. The trick is knowing when you should have it. If you look at some of our queries on our tables and you run EXPLAIN ANALYZE on them, and you see in the query plan that it's going to be doing a lot of bitmap ANDs operations, bitmap ANDs are combining single-column indexes together in order to arrive at the answer quicker. If it's doing a bunch of bitmap ANDs and it's doing that over and over again, it's possible that you have a very common query that should have those two or three columns put together in a multi-column index.

But if that same query has, you know, 50 flavors of queries that are being thrown at it, you wouldn't want 50 multi-column indexes to match each of those queries. So, it's that balance we were talking about at the start. You have to know which of those queries are the most important, which ones are being hit a million times a day, and which ones are being hit four times a month, and plan accordingly. And that's something...one of the 15 projects I'd love to do here is optimize that.

MIKE: So, you're going looking through your slow queries, you know, using whatever tool you're using. It sounds like you'd, you know, have that in your toolkit at the ready if you see a number of...if you see queries that are, like, oh wow, that's checking against four columns in this table, you should probably have an index on those if it's doing those bitmap AND, or bitwise AND that you're talking.

BILL: Yeah. And if that query is being hit many times per day, it's a good use case for them.

MIKE: You know, most of the queries that tend to run really slow are doing joins, so I'm going a bit far afield here. So, what if the data's across six different tables, but you're running it all the time? That's a slightly different case. Do you have an approach for that specific situation?

BILL: Well, you first try to optimize that query. By the way, I have a few cardinal rules about query performance. And the first rule is asking whether or not this query should even be done. You would not believe how many times where something was really, truly awful, and we asked that question: do we even need this feature, or should we even be issuing this query? And how often the answer was no.

The second cardinal rule of the query performance is, if it can be done in SQL, do, instead of, you know, dragging the data out of the database and trying to replicate a database in, you know, in the middle tier. And the third cardinal rule of performance tuning has to do with the indexing that we're talking about. If your data is well-designed...well, it's making sure that the application data model has been well designed. Usually, when I had a really terrible performance problem, it was because the data model was not good.

So, I covered two of the things that most commonly fix massive performance issues, and that was something that doesn't need to be done at all, and the business requirements weren't well understood.

Once those things have been accounted for and your data model's good and clean, and you've made sure that everything's indexed well so the joins can be efficient, well, you've got this 6, 8, 14-table join. You've done everything you could, but it's still not fast enough. That's when you start exploring denormalization. And that typically leads a relational database person to materialized view. In Oracle, that was really beautiful because it had a built-in facility to keep that materialized view refreshed upon commit.

Postgres is just getting to that now with an extension called pg_ivm, Incremental View something or other. I think it's coming standard with 17 or 18. But, anyway, that's when you've done everything you could and dotted all your i's and crossed all your t's, and it's still not fast enough; that's when you need to look into materialized views. And if that doesn't work, then you're probably on the wrong database engine for your use case, for your application.

MIKE: That makes sense.

WILL: Generally, it goes back to, like, sort of, like, database performance, like, in general. I'm not a database engineer. I know, like, an index and a join and, like, how all this stuff works, like, under the hood. But, like, I suppose, like, the biggest query that I've got from, like, a database, like, somebody who makes databases their trade is, like, if I'm looking at a database performance dashboard, like, what am I looking at to sort of, like, diagnose performance issues? Like, how are you looking at...when you look at, like, a database and, like, how it's running, right?

I know if I have a server and it's like, oh, it's using too much memory, okay, there's a problem. My queue depths are starting to, like, get really big, okay, that's a problem, right? But, like, when you are looking at, like, sort of, like, the dashboard of a database, like, what are you looking for to say, like, oh, okay, this is a problem; this isn't a problem, you know what I mean? I'm just curious, like, how do you sleuth out these performance issues?

BILL: Yeah, it's not too bad. A mature, well-instrumented database engine usually comes with some facility that allows you to peer into the resources being consumed by all the queries in the system, and it'll show you front and center what the hotspot is. I mean, if the database is really hurting, it's usually pretty obvious. Sometimes when it wasn't obvious, it was due to the network and something else.

But yeah, usually when you peer into a dashboard, there's a big, old bar, a big, old spike, a flame, that shows you exactly where most of the runtime is being consumed. And you're able to click into that, and it will usually tell you which query it is. Now, there, a lot of the dashboards kind of let you down in that they only give you a piece of that SQL. And very often, you need to see the entire SQL in order to figure out what the culprit is.

Once you have the entire SQL, then you're able to run it either through EXPLAIN, which gives you an estimate of what the database would do, or, if you are able, run an EXPLAIN ANALYZE, which will show you exactly what the database is doing when it's pursuing the data. And that is where it's really critical to know both the database engines indexing and your data, in order to determine whether the query path that the planner is showing you in the explain plan whether that's the plan it should be using.

So, you look at all these steps, and you need to know how to read it. Okay, it's doing this one first, then this one, then this one. And if you know your data and you know what it should have been starting with and what it should have been doing next, and you look at that plan and it's not doing that, then you know you have an issue. You know you're missing statistics, or you're missing an index.

Or some table got accidentally blown out with 5 million rows the other day. It was a bug. And they got rid of those 5 million rows, but they forgot to reduce the high watermark. But the database still thinks it's a massive table, and so it's making the wrong join choice. That's where the expertise comes in. That's why you get paid the big bucks, is being able to combine all those things and figure out, yeah, the database is not doing the right thing here, and here's what it should be doing. And how do we get it to do that?

WILL: How can you tell, like, differentiate between just a hot query that's just a hot query? Like, a lot of people want the homepage, let's say, you know, like a [inaudible 38:49] example, right? How do you say, like, oh, this is just, like, everybody wants the homepage, versus, oh, the homepage, you know, is misconfigured, right? Like, how do you tell the difference?

BILL: The vast majority of the systems that I've built have been well normalized, and designed, and indexed, and so forth. So, when we had an issue, it was because something changed, and it was more reactive. Someone noticed an issue, they called us. We looked, oh yeah, yeah, like that scenario I just described, where a table got blown completely out of proportion and shrunk the next day, and it changed the nature of the query path.

Ideally, you would have a more heuristic system that learns from what is typically running on that database so that when something's out of the ordinary, it alerts you ahead of time. I've never lived in such a world; that would be lovely. I have not seen it. They probably exist.

WILL: Oh, don't worry, don't worry. If the database starts going south, we'll call you.

[laughter]

BILL: I might be conflating this with my previous client, but there's a tool called...there are several, but one that I've used most recently was called SolarWinds. I don't know if any of you...Kyle if...

KYLE: Yeah, that's the one we use here.

BILL: Okay. And I haven't been using that, or I haven't had a need to use that heavily here. But I believe it has some facilities like that to tell you the difference between one that is frequently hot and heavily used, versus one that's not been seen before and is consuming all the resources [inaudible 40:19]

MIKE: You know, Kyle, I've been meaning to ask you...because we're talking about the monitoring, because you get asked those questions. People come and say, "Hey, DevOps team. Everything's on fire. What do I do?" And you're like, "I don't know your system. I'll pull up a dashboard," and you usually manage to find something [chuckles]. Like, what's your tactic, Kyle, for finding database problems?

KYLE: Database problems, I usually look for high I/O, disc depth, CPU, memory, and then connections. I'd say spiking connections would tell me quite often that there is a problem. And then that queue depth, if that queue depth gets very large, we know we've got a gnarly query in there somewhere. And then, at that point, that's going to trigger me to go look at a tool like New Relic, or, you know, something that can do the APM analysis from the service side and tell me, like, what that query might be. And then, from there, generally, we're able to say, oh, we're missing an index here. You guys should go add this index, and that'll increase performance again.

BILL: It's when Kyle and DevOps reach that point that they usually involve me. So, that's why I wasn't able to answer your question [laughs] terribly well, because I'm usually getting skipped until that point.

WILL: So, if you had, like, a lock or something that was deadlocking on a database, or, like, a, you know what I mean, like, some kind of table lock, like, how would that manifest itself?

KYLE: So, that'll show in your performance insights tool. I did skip over that. That's another one that we commonly look for. We go in, and we see if there's a query that's got a lock on it.

MIKE: [inaudible 42:01]

WILL: How does that manifest, like, a bad lock where you're stuck, versus like a good lock, where it's just, like, business as usual? You got to lock a table; that'll happen.

KYLE: Yeah. Most of the time, I throw that back on the engineers. But if it's been locked for, you know, I've got a query that'll look for any locks that are over five minutes. And if it shows up in that query, I think we've got an issue.

MIKE: Makes sense, long-running locks. Good lock is a short lock, yeah.

BILL: There are a few preventative parameters that we could be using in Postgres that we're not, that can prevent idle transactions from hanging around too long, statements that take too long, and can log and notify when some of these things happen. It's one of the things I'm going to be talking to the engineering managers about in the near future.

Just to finish off the theme of the B-tree indexes, there are three other flavors. One of them is covering. It's kind of an interesting name. I prefer to call them payload indexes, but Postgres calls them covering. And that is where you index a column or columns that you want to match on or to quickly narrow down your matching data. But you also include a couple or more columns that are part of the select list. You're not necessarily matching on them, but they're part of the data that you're looking for.

And by doing that, you can potentially get what is called index-only. You can get index-only queries, where they don't even have to touch the table. They're able to satisfy everything that the query wanted in its WHERE clause and everything the query wanted in its SELECT clause, just from the index columns and the payload in the INCLUDE portion of the index. So, those are called covering indexes.

Another flavor of B-trees are called partial indexes or conditional indexes, and that is where you get to use a WHERE clause in your index creation. And that is where you only index a row if the row matches a certain condition that you have. And that can be valuable when you have a 700 million row table and only 5 million of them match a certain criteria, and those are the only rows of interest to you anyway. So, you'd only index those 5 million rows that match that criteria, that way, you're not indexing 700 million rows, and 695 million of them are a waste.

Finally, we have function-based B-tree indexes, and these are used where you know that your access pattern needs to compare the column where the column has been manipulated by a function. Like, the more commonly used example is where you want to compare a given index search term that was obtained from a field in a web app or a mobile app, looking for a matching email, and you don't want to deal with all sorts of possible email variance that the customer might have fat-fingered into the database. And so, you want to normalize the data.

Ideally, you'd normalize it before it gets written, but let's say you didn't. And so, you want to wrap the email column with a lower function. Well, now you've just excluded yourself from using the index on the email table or the email address column, I should say, because you wrapped it in a function. But you can index the application of the lower function on the email address column, and that's called a function-based B-tree index.

And there's all sorts of functions. Eddy and I were exploring the use of full text search in merchant portal, and that requires a call to the to_tsvector function. And to make that quick, you would want to create an index on the to_tsvector of the textual columns that you're full-text searching or allowing a full-text search upon.

There were a couple of things I wanted to cover, some dos and don'ts, some gotchas about indexing. Again, I mentioned this in 2024, but I want to mention again to anyone who's tuning in to the podcast.

The first one is that you should index each key. Now, you don't have to worry about primary keys or unique keys. If you, in your data model, are declaring a certain column or a combination of columns to be your primary key or your unique key constraint, the database will automatically create an underlying index to support that uniqueness check.

Now, the foreign key constraint, you should also index by default. Some of those don't end up getting used, so we can clean those out, but they should be indexed. And the database does not index a foreign key constraint automatically. And that's why that's one of the things I'm checking for when I'm doing data architecture reviews.

Just a little anecdote to go along with this. One company I went to work for in 2019, when I walked in the door, they had multiple dumpster fires in their flagship Oracle database. They had had a data architect up until 2015. They had been doing without one since then and had lost sight of a couple of best practices, one of them being indexing your foreign keys. It turns out that they had two primary causes for all their performance issues, and the biggest issue was the lack of foreign key indexes we added.

You know, the system had been evolving and growing; features had been added; columns had been added. And, over time, they had added 53 columns that were child columns logically related to parent tables, and none of those 53 columns had indexes on them. And that's normally not a huge problem if you're not querying on those columns; you're not joining on those columns.

But if you try to delete a row from a parent table through a foreign key constraint is related to data on a child table, when you go to delete that parent row, it has to scan through, ideally in an indexed manner, all the child tables related to it, to determine whether or not it can safely delete the parent row or whether it's going to create orphans. If it's going to create orphans, it says, "I can't. There's child rows that still pertain to this parent value".

Well, at this company, they were trying to adhere to the GDPR regulations because they had customers who had employees in Europe. And when those employees would leave, GDPR says, "You should be able to request that all your user data be removed." Well, because all of these foreign keys had been added without supporting indexes, their attempts to remove user data had been getting slower and slower.

The last time they'd been able to run it had been two or three years before I got there, and it had taken 24 hours to remove a single user, and then they just gave up. When I got in there, we added the missing foreign keys, and immediately, we were able to catch up on 2.4 million user deletion requests in two hours. From 1 user taking 24 hours to 2.4 million in 2 hours. Indexes can make a huge difference.

So, what else should be indexed? Index each column used in filters, otherwise known as WHERE clauses or predicates. Index each column used in a join. And if the join is to a multi-column key, that's when you want to index the columns together, of course.

If you have a multi-column key and this particular flavor of query that you're sending at this table doesn't use the first column in that key, but it does use the second column, that's not a problem in Oracle because they have a feature called skip scanning. It can...I'm not sure how exactly they implemented it, but it can skip over that first column, and it can index on the second column in the multi-column key or multi-column index.

It turns out that many users of Postgres have wanted that for many years, and it is now a native feature as of Postgres 18. So, that was some good news I wanted to share with you. We're currently on 17.4, but I imagine that 18 is not too far off for AWS.

What should be avoided? Over-indexing. Back when I thought this would be a presentation, I wanted to demonstrate that we have a number of tables in our systems that have well over 25 indexes. Did I say, "Indexes"? We have a number of tables in our systems that have well over 25 indexes. And one I was looking at the other day has 31 indexes on it. And of those 31 indexes, 15 of those indexes have never been used.

MIKE: So 50% basically.

BILL: Yep. There's a whole lot of cleanup that we could be doing. So, that's why it's important to monitor indexes over time to make sure that you're not leaving a bunch of crafty indexes around that aren't touched.

Let's see. Avoid indexing a column more than once in the leading position of indexes on the same table, and we have a lot of that going on as well. Don’t index columns with very low cardinality. So, if you have a table with a hundred million rows, you wouldn't want to order the active flag column, where 50 million are Y and 50 million [inaudible 50:54]. That's not going to do you any good to index that.

Avoid indexing mostly null columns. We talked about that when we were talking about partial indexes, where you can use a WHERE clause to avoid indexing those columns that are mostly null. And avoid indexing columns that are heavily updated; that one involves some trade-offs and understanding of your system.

WILL: So, what's the drawback to, like, if I have a column, let's say, you know, like, I don't know, date of birth or something, right? And I don't want to have an index off of date of birth twice. So, I don't want to have an index, like, date of birth and, like, zip code, and then also date of birth and phone number area code, I don't know, whatever, you know what I mean? Like, I don't want to have that.

If I understood what you're saying, like, correctly, I don't want to do that. I don't want to have date of birth in X, and then date of birth in Y, and then date of birth in Z. What's the issue, or what's the correct way to approach that, right? Because I could think of scenarios where that'd be relevant.

BILL: Yeah. So, let's just use some aliases for some columns in a table. If you looked at your indexes and you see an index on A, another index on A comma B, and another index on A comma B comma C, you would want to get rid of the first two and keep the third one because that satisfies all three. If you instead had looked at your indexes and you had index on A, index on B C A, index on D F G A [chuckles], that's where you really need to understand your system, your queries, which queries you use most frequently. Do you go ahead and, you know, allow all of them?

I don't have any really astute advice there other than do your homework and understand that, you know, if A is being used as the leading column in 3, 4, 5 indexes, it's very likely that a few of them can be eliminated. Sometimes though, it can't, you know, like, in that one example, you're seeing it is B A C or C B A. You may need to keep all of those around to satisfy some query-specific indexes. In our systems, we do have a lot of instances where we have an index on A, an index on A B, and an index on A B C, and those first two can be eliminated. We've got a lot of instances of that.

A common mistake, and one that I frequently make as well, even when I'm doing the reviews, even though it's the...I think it's the second or third bullet point in the checklist. It says to make sure that your table has a natural key on it, which is a unique key constraint, unless duplicates are expected and welcomed. And even when I'm doing reviews, even though I wrote that list, even though I try to live by it, I still forget. When I'm looking at table designs, if I see a primary key, my mind says, yep, it's good. And I tend to forget to put a natural key on it to make sure that duplicates can't accidentally slip in there. So, that was something I wanted to get across.

Another little tip in Postgres is to make sure you're using the keyword CONCURRENTLY on large index creation and rebuild, so that we aren't locking things up. And make sure that you test before and after index creation to make sure that you're getting your intended results.

And that is the end of what I wanted to say.

KYLE: So, my question...I feel like a lot of this, of course, comes from the viewpoint of a software engineer, right? And we've kind of discussed, you know, generally, indexes are good, with, you know, more wins than losses. But I'm also very aware, from a software engineer's standpoint, infrastructure is free.

So, where I care about the non-existence of free infrastructure, at what point would somebody on the infrastructure team start getting nervous or start questioning the amount of indexes that we're adding? Because I assume this isn't going to be free. This is going to impact CPU, memory, I/O. And then the one that I'm thinking about the most, correct me if I'm wrong, but this will elongate the time that, like, a vacuum will run, right? And that's always a hidden cost under the hood when an auto vacuum kicks off during a querying issue.

BILL: Yeah, unless you're going nuts with indexes like we are with some of our tables...Because, honestly, the most indexes I'd ever seen on any table before I came here was 17, and I thought that was crazy. And we have a number here that have 29, 30, 31. So, unless you're going nuts with index creation, you're generally not going to see a big drawback. The exceptions to that is when you start to get to massive scale, billion-row tables, lots of indexes on it. Now we've got to do a cleanup. For some reason, we need to do a VACUUM FULL, or we need to do a pg_repack.

In both of those cases, it has to create a copy of that table and all of its indexes before it swaps them at the last second. And so, whatever space that that massive object is occupying, let's say the table is occupying two terabytes, you now need to have double that space in order to make that operation even work. That's where...massive scale is where things start to really show up and matter in cost.

KYLE: Okay, so at large, large databases is when you're saying is when it'd become a problem, okay. [inaudible 56:26]

BILL: I typically don't notice the blips until the table and its indexes are occupying more than, say, 200 gigabytes. That's when I start noticing. That's when I start feeling the pea underneath the mattresses.

MIKE: I appreciate all the deep dive, you know, and the feedback, you know. You came prepared with this list of things, and we've been grilling you on specific use cases that we get down into the gritty details. I mean, there's going to be more, right? We could go on forever. But is it mostly just about following the rules that you've mentioned, and then you cover almost everything, and then the weird cases, well, they're going to be weird?

BILL: For a relational database engine, yeah, I think I've covered most of the tips and tricks. So, if one can get good at the things I've talked about today, I think you can call yourself a full stack developer [laughter].

MIKE: People will call themselves a full stack developer.

[laughter]

BILL: The reason that I kind of chuckle at that is because since about 2010, most of the students that I've seen applying for positions that I've been hiring for have maybe done a hundred thousand row table on Mongo in a course in college, and they're calling themselves a full stack developer. I think they need to be hardened by database scars before they can call themselves a full stack developer.

WILL: I think you should be able to build a mobile app, full stack developers. I see you all on your phones.

MIKE: [laughs]

WILL: Nobody knows anything about it though [laughter].

MIKE: Yeah. Then you've got to build the frontend and the backend.

BILL: Well, thanks for having me on your podcast.

MIKE: Yeah, thank you, Bill. I really appreciate it. You know, I started by talking about the importance of indexes and how they transform things before we, you know, transform our modern world, before we did the deep dive. Maybe I'll come back to that as we sign off.

We got deep into technical details, and it's easy to think, oh yeah, you know, I'll worry about that sometime. But as Bill said, you know, you pay attention to these things. You go through your checklist, and then you don't have a table where you can't delete rows from it for years [laughs] because it's not possible. It's like hygiene and conscientiousness. It's brushing your teeth, and if you do that, your teeth are healthy. You end up having a much better life and much fewer calls at 3:00 a.m.

Thank you, and until next time on the Acima Development Podcast.