Acima Development Episode 95: What Do Data Engineers Do?

About this Episode

This episode explores the role of a data engineering team within a company and how it differs from traditional application development. While app developers focus on performance and real-time systems, the data team is responsible for collecting, syncing, and organizing data from many sources into a central warehouse (like Snowflake). Using tools such as Fivetran, data is continuously pulled from dozens of systems and stitched together into a unified view that business users, analysts, and dashboards can actually use. A major challenge discussed is how microservices (great for engineering) create fragmented data that must be carefully reconstructed to tell a complete story, such as the lifecycle of a customer or lease.

A large portion of the conversation focuses on “data transformation,” which is the process of turning raw, scattered data into meaningful insights. This involves complex pipelines of queries and scripts that combine, clean, and interpret data across systems. The speakers emphasize that this work is far from simple—it requires deep understanding of both the data and the business context. Done well, it enables decision-making (like tracking revenue trends or customer behavior), but done poorly, it can lead to incorrect conclusions that impact the entire company. They compare transformation to cooking or even building a rocket: the output is fundamentally different from the raw inputs, and small mistakes upstream can cascade into major issues downstream.

The group also discusses practical challenges in data modeling, system design, and collaboration between teams. Topics include the tradeoffs of normalization, handling schemas across evolving systems, and frustrations like poorly defined enums or lack of communication when engineers change databases without notifying the data team. Security is another key theme, especially around controlling access to sensitive data (PII) and preventing misuse. Ultimately, the episode highlights that data work sits at the center of the organization: it depends on upstream engineering decisions and directly influences downstream business outcomes, making clear communication, documentation, and thoughtful design essential as systems scale.

Transcript:

DAVE: Hello and welcome to the Acima Developers Podcast. We've got a fun group today. I've got Eddy. We've got Kyle. We've got Thomas. We've got Mike and Justin. We've got Bill, and we've got Zach. Now, Bill and Zach are infrequent. Bill's our DBA, and Zach is the...what are you? The head of the data team?

ZACH: Technically, my title is Senior Manager, Data Architecture and Governance. But that's a fancy way of saying that I am heading up a data engineering team. Yep.

DAVE: They made you widen the column size to fit that job title in.

ZACH: Yeah, pretty much.

DAVE: Yeah. Yeah. So, for people that don't know, I've been at Acima for almost five years, six years. I don't keep track of numbers. I worked in engineering for a couple of years, then I went over to work with Zach on the data team for a year. And then he got rid of me and sent me back to engineering. And I've been back over here for, like, a year and a half now.

And I think it's really, really fascinating the different ways the teams work. Like, app dev focuses on latency, and we love to do everything with compute, and we're very scarce with storage. And the data team is kind of the other way around. You've got the great big warehouse. Storage is free. Compute is crucially expensive. It's like, you've got a table that has all the integers in it, and you look them up by ID because you can't calculate anything. That's a joke.

But people don't believe me when I tell them you have a day’s table that is literally every day from 1970 forward. We don't want you to calculate the name of the day of the week. Just look it up in the table. We don't want you calculating the first letter of the day of the week. That's a separate column in that table, right?

ZACH: Yeah. I don't think that that table was originally built for that reason specifically. I think a lot of people used it for that reason. There's a lot of really good days logic built into, like, Snowflake, Redshift, and all of the warehouses. However, when Acima first started, warehousing was a little bit newer, and so maybe a lot of those functionalities didn't exist.

Now it's more like, what's a holiday [laughs]? And that's the main reason we're using that table is, what is a holiday? And that table is not always the most accurate on what a holiday is, either. But it's way more accurate than if we didn't use it [laughs]. And it's a data source that my predecessor exported from somewhere a decade ago and runs all the way through, like, 2060. So, I'll probably never adjust it, you know. It’s just --

DAVE: That was going to be my question, so when do we even run out of days?

ZACH: It doesn't matter to me. It'll be long after I've, you know --

EDDY: Is that only taking into account local holidays, or now that you're considering, like, international growth, like, does the table also consider international holidays, or is it only local?

ZACH: It's not been updated to consider international holidays. We don't have to do a ton with holidays on the data team. Really, that's going to be on our production systems, right? Like, we are consumers of data. We are not...Well, I mean, we generate data, too, but we're mostly consumers of data. If you look at the flow in, it's mostly data coming in.

So, it's really important for, like, LMS to understand what a holiday is in every single country that they're in. Not as important for the data team because the events that should not happen on holidays, there should be no data for because they didn't happen, right? But no, I've not expanded that table for, like, Mexico or Canada or any other country. It's just U.S. And even then, like I said, it's not fully accurate.

DAVE: I remember when I started here, we had no plans to go outside. We were just U.S. company, and so don't worry about it. And businesses pivot and grow. Zach, I got a question for you. I jumped straight into some detail, but I don't think a lot of people know what a data team does. We were talking about this in the pre-call. Like, the DBA does the architecture, but you guys...you said CrossFit.

I work on Merchant Portal. My job is to help keep the merchants happy so that they can give leases to customers and get the product out the door. That's an application database written in Postgres. Where does my data go after, you know, like, every night, what happens to my data? What do you do with it, and who do you give it to, and what do they do with it?

ZACH: Yeah, so that's a loaded question. Every 15 minutes, it syncs to the warehouse. We use tooling for that. That tooling is Fivetran. They're a great company. They have a bunch of people like me and smarter than me focusing on just, how do we sync data from data source to Snowflake or Redshift or a data destination, basically? So, it's the best way, in my opinion, to sync it. We used to have an in-house solution. It would miss data. We didn’t focus on it a lot because we have a bunch of other stuff. So, now it syncs into the warehouse.

And especially in a system of microservices, which I know are great for software engineers, they're terrible for data engineers because the next piece of the puzzle is I have to stitch all that data together. A lease record, for instance, or really any record, is not going to be wholly in one service. So, now I need to create transformation tables so that our business users, our end users, our BI analysts, and the people viewing their dashboards can see the holistic view of the lease.

Because, as you know, there's a certain point where Merchant Portal just doesn't care about it anymore, and it moves on to LMS. And then LMS doesn't necessarily care about all the nitty-gritty of what's happening behind the scenes in all the other microservices for, like, payments or anything like that. So, we really become the place where we're stitching that together. In the last count I had, I think there's 68 Postgres databases syncing into the warehouse today.

DAVE: Wow.

ZACH: We do not care about all of them [chuckles], to be frank. We do care about around 30 of them, and we use them for transformations. And then there's a bunch of just, like, batching, right? Like, I don't want, and you guys don't want, nobody wants the production customer-facing services spinning up jobs in the middle of the night to grab thousands or hundreds of thousands of records to throw them in a CSV and shoot them off to, like, a company that needs that information, right? Like a third-party company, maybe that we integrate with.

And so, the last time I recorded, there was something like 50 third-party integrations that we're also handling. That data will go into those companies; data's coming out of those companies. Maybe the data goes into those companies in real-time events through the production consumer-facing services, but I am siphoning them into the warehouse so we can start to see, like, is this third-party company worth using? What is the effects that we are having here?

Or maybe those companies are enriching our data, and then we look at that on the back end, and we let that adjust business decisions. And so, all that's got to come together in a singular place. And it's a lot. Like, the last time I checked, it’s...I keep saying, “Last time I checked,” I don't watch this like a hawk. But we had, like, 13 and a half thousand tables in the warehouse. So...

EDDY: So, Zach, you mentioned something interesting, and I kind of want to elaborate a little bit. So, you said you have about 60-plus tables that have data, but you only care about half of them. What's the point of us --

ZACH: 68 schemas. So, like, Merchant Portal is a schema. Merchant Portal has, like, 218 tables. I care about those 218 tables, right, or however number it is.

EDDY: What’s the point of, like, writing into a warehouse if you don't care about that data? Like, what's the benefit of even though you don't care about it, it’s still valuable to receive?

ZACH: Yeah, so there's a couple of things. Like, when I say I don't care about it, I'm not running transformations on it. It's not being used for business.

DAVE: So, you want the data, but you don't have to mess with it.

ZACH: Yeah, I'm a data engineer at heart, which makes me a data hoarder. I want all the data [laughter]. I want every last scrap of the data.

However, a huge use case that we did not have until moving to Snowflake is now we have a place where the software engineers can go in and look at the data in a 15-minute lag and start debugging, right? Like, think of console access to production. It's insanely limited, and it should be, and most people shouldn't have it. But now you can get a user inside of Snowflake, and I will let you see the production data in a 15-minute lag for debugging purposes. And that's massively huge, even for all those schemas that I'm not transforming on and the business doesn't want to see.

JUSTIN: So, I just want to give my two cents on this from a security point of view. I have a colleague whose name is Dan Hamilton. He said, “Data is the most...” well, let me rephrase that. PII data is the most toxic data that you can have in a system. So, anytime that you're, like, propagating that, whether it's to Snowflake or to any of those other systems, it's something that you got to think about in terms of who has access and how long they have access, and is it auditable, and everything else like that. So, it's an interesting point of view because data is awesome, but data is also, you know, it's what makes a company valuable. And if that data gets exfiltrated, that's something you've got to be concerned about.

Unfortunately, I've got to drop. But something that's, like, bread and butter for me every day is just like, hey, who's playing around with data? Who has access? And are there ways that it could be exfiltrated? And so, you've just got to keep an eye on that, so...

ZACH: Thanks, Justin.

DAVE: Very cool.

JUSTIN: Thanks, guys.

DAVE: Thanks, man. Take care.

ZACH: To expand on that real fast before we move on, that's an argument that I have a lot here, and that's why the structure is the way that it is for the teams here that are used to it. Mike, I ran this past you, right? Like, the way for security for data is limitation, right? And everybody wants access to more.

MIKE: Yes.

ZACH: And you have to draw a line somewhere. You can't just give everybody access to everything. And so, we have those lines drawn here, and we stick to those lines. Not everybody likes it, but it's what you have to do to try to keep your data safe, so...

MIKE: Well, that's an interesting point. I have access to some of the raw, untransformed data, but not necessarily other transformed data. And sometimes people from the BI team will say, "Oh yeah, go look at this table." Like, well, no, I don't have that one. But we can usually work things out.

I, about a week ago, was helping debug something and was pulling in data from three different databases, you know, from different systems, logging from mobile app, and stuff from Merchant Portal, and over from our contract funding, and tying it all together in this amalgamous stuff, which ended up being crazy helpful, and the mobile team needed that. So, I had enough. But, you know, I think it's the right choice. Keeping the privileges limited, sure, it's a pain. But you know what's even more painful? Is giving somebody privilege they really shouldn't have it and having them abuse it.

ZACH: Exactly.

EDDY: It’s actually --

DAVE: We base this not on the value of getting it right, but on the price of getting it wrong, right?

EDDY: I was going to say...I'm so sorry.

DAVE: It’s all right.

EDDY: I was going to say it's actually made my life a little easier because I used to have access to even tables from other teams, right, from where I worked on. And so, when that got presented and said, “You're only going to be given access to the immediate team that you're working on, and that's it,” it was kind of bittersweet. I'm like, well, that sucks. Like, I want to be able to look at other data, and it makes my job easier.

What actually made it easier was me saying, "I don't have access to that data. Give it to me, [laughs]” and then we'll figure it out later. And so, it ended up being, like, a blessing in disguise in a sense, where I'm just like, well, now that I don't have access to the data that you're asking for, I could just punt and say, "Hey, ask this person. Once that person gives it to me, then I'll answer your question." But --

ZACH: And you can do it that way. The other thing is, like, there's a certain level and above that has this elevated access that Mike's talking about. And that was a lot of pushback that I think we got. "Well, there's going to be a bottleneck." Well, I haven't seen that be the case, actually, right? There are people on your team before you get to Mike that can do those cross queries. You just happen to not be one of them.

BILL: Zach mentioned earlier that he has to stitch together data from a number of systems just to be able to compose a whole picture of certain entities, like a customer. We were talking about that the other day, how one of the guiding principles I teach in my modeling classes is that duplication is evil. Try to avoid it at all costs unless you absolutely have to. And, unfortunately, microservices encourage duplication a lot.

And there are times when I really miss monolithic systems. If you needed to debug something, it was all in one place. You could stitch together. You didn't have to wait for data to sync. It was just there. But, obviously, there’s some benefits to some microservices as well. You mentioned CrossFit earlier. I'm thinking data engineers are more like craftsmen, plumbers, and chefs.

ZACH: We had a member on the team that wanted to change our team name to The Data Plumbers because he thought about, like, the pipelines that you're putting together. Some of the team wanted to be Data Wranglers, and that was outvoted from Data Plumbers [chuckles]. I'd say CrossFit with data because that was a popular thing when I started becoming a data engineer. And it makes sense, right? We pick up data here. We put it down over here.

The thing I didn't get into with a lot of people, especially the non-technical people, is all the transforming and the difficulty that comes behind that, right? Like, you're working inside of a software application, and you're working with row-level data. You just have to know that you're working with this customer maybe, and this item, and that's what matters there.

You get into, like, data engineering, well, I might be writing a query that affects millions of people, millions of items. And I need it to be extremely performant because I can't be running 18-hour queries against the warehouse. There are people that do that [laughs]. And so, then I have to also work with them on how to not do that.

But yeah, so, it really becomes, like, an idea of understanding the compute, how the memory on that compute works, how to narrow down your scope as much as possible. And when you do narrow it down, you know, there's window functions. There's a bunch of compute options on data that could slow you down. And how do you effectively do that?

And then just understanding because, like, when we talk about warehouses or compute, right, it's actually a cluster of machines, and they all have their own different tasks. So, like, having an understanding of that and how your data flows through those is extremely helpful, too, not entirely necessary. You can do a lot of damage on a warehouse without knowing that and still be just fine, but it helps to understand how all those flows happen.

DAVE: That's actually a good difference between app and dev, or engineering and data, is that on the application side, the thing we never want to see is a query go out without a limit. Like, we don't want you to say, you know, "Select first name from applicants semicolon" like, that's going to burn the whole freaking table from top to bottom, like, all the way down. And then I got to the data team and, like, Casey, who worked...is he still over there? He [inaudible 16:54] the whole team.

ZACH: Yeah. So, he left quite a while ago. He's back again working with Rob. So...

DAVE: Awesome. Very, very sharp guy. But I remember him sitting us down and saying, "Please don't ever do select star from table, even limit one, because every column is in a different server, and you just spun up the entire data center to get one row of data."

ZACH: Yeah, it's not really a different server. Like, think of a disc, right? And I know we're on SSDs now, and those are awesome. But things are still stored in different places on them, and you have to go find them, right? But think of a spinning disc. And if you think of a spinning disc and you think of, like, a Postgres system, or a MySQL system, or these row-level systems, one file on that disc is that entire row.

So, when you do "select star from table where ID equals 10," it only has to go one place on that disc. But if you do that to, like, one of my transformation tables that has 250 records, it has to go find all 250 files, count down x amount of numbers so that they match across those 250 files, and then stitch that back together because it's all columnar instead of row-level, right?

And that's why it can be really fast when you do summarizations because you go to one place, find that file, and then sum it, right? Or even when you limit it a little bit, you go find three different files, figure out the line numbers you care about, pull them out from the other two files, and then summarize that, do your group-bys or whatever. So, those operations are really fast, where those same operations on, like, a row-level system are really slow because now you’re the opposite. You've got to go find all these row-level files, and then pull the right column out of it, right?

And that's why warehouses are incredible for, like, analytics, but you wouldn't want to point any of your applications at the warehouse, at least not unless you're paying for Snowflake's...they've got this new thing; it's pretty cool. They'll store all the data in the table, right, and you can point your application to it, and it's row-level data. I read something about it. I don't know where it's at, but it's kind of a cool little idea.

DAVE: I think I checked will it fit in RAM a couple of weeks ago, and I think they're up to, like, 128 terabytes now will fit in RAM. It's not cheap, but we could make it go.

BILL: How many of you are aware that Snowflake doesn't even have indexes, well, not the ones that we're used to?

DAVE: I just figured it was magic.

BILL: [chuckles] It looks like it.

DAVE: So, when I was on the data team, what I discovered is, you can do, like, a 75-table join, and it will come back in, like, two and a half seconds. And you can say, "select first name from an applicant, limit one," and it takes two and a half seconds because it's got to go through all the military-grade, weapons-grade query planning. How do I distribute? Oh, just one. And then once it's done all that selection, then, oh yeah, here's your data. That was [inaudible 19:52] to bring one piece of data, one teaspoon over.

But when you say it's not indexed, is that because the data's organized, like, almost, like...I’m going to say physically, but you know what I mean, like the spinning disc, like, partitioned out differently to be pre-indexed?

BILL: That was the teaser. I was hoping Zach was going to expound on that.

DAVE: Oh, dang it.

ZACH: Sorry, what was I expounding on? I was looking up and fact-checking myself, trying to find [laughter] the row-level thing that I had mentioned, and I can't find it. So, maybe I dreamed that, but –-

BILL: Yeah, I teased the audience with the –-

MIKE: He mentioned that Snowflake wasn’t indexed. Yeah, go ahead.

BILL: I was teasing the audience with the factoid that, in Snowflake, you don't have to worry about designing indexes for your tables.

ZACH: Yeah, no, I was on a call with them one time, and they said they probably do it better automatically than we will. At Redshift, you had to do compound indexes, sort keys. Actually, they weren't really indexes; they were sort keys, right? You can put indexes, like, you can do it if you need to.

And we've found a couple of tables that probably make sense for us to figure out what we would rather have it sorted by. And they're not necessarily considered, like, it's not like, "create index" inside of a warehouse. It's like, "sort it by this," because then when you query it by that, so you sort it by date, and you have, like, thousands of dates in there, and you're just looking for these six months, then they're all going to be in the same area of the file. And it gets an idea of where that's going to be. So, they're more like sort keys, and you can do it in Snowflake. It's just that we don't at all.

BILL: In Oracle and Postgres, that same sort of thing is called a cluster, where the data is ordered and clustered really close together.

ZACH: Yeah. And the other thing, Bill, that I, while I wasn't paying a whole lot of attention, I thought you were mentioning is, like, primary indexes, right? Like, how in a Postgres system you do a primary key, and it's, like, an incrementing number, and you can't duplicate that. Snowflake does not support that either. I could do that, and it could increment. But let’s say I add 1, 2, 3, well, I could go enter 2 back in there, and it doesn't care. It does not enforce those.

BILL: [inaudible 22:03] integrity and primary key integrity and --

EDDY: I'm so glad you guys are the ones that have to deal with data and not me [laughs].

ZACH: And if you go and look through a lot of our tables, our primary keys are actually multiple columns, right? A lot of times, our primary keys are not just one column, like an ID column. Our primary key will be, like, lease number, date, and then something else that makes that table unique. And we enforce that through code.

EDDY: So, Zach, I've actually wanted to ask you something really interesting. What are some of your biggest pet peeves that we engineers do that really pisses you off that you wish you could change, but we're so fine-tuned doing our own thing, you know, that it's kind of fighting an uphill battle? You basically are, like, throwing the table and being like, "I'll just work around whatever you guys are doing."

ZACH: I think the biggest one for me is Ruby on Rails has an enum system, right? And this doesn't get used a lot anymore because I fought [laughs] these battles with software engineers. But it just puts numbers in the database, and then the references to what it actually is is only in the code. I'm not a Ruby engineer, and I don't want to go look through 68 different repos to figure out what all these numbers mean. And I don't want to manage a table that maps that for me because when a new number comes along, and I'm not told about it, I don't know what it is. And so, that would be, like, my biggest pet peeve.

And it's not just Ruby on Rails that does it. It's every single ORM has some sort of functionality like that. But, like, Django and Python would do it, too. But you could specify, like, string, string for your enum instead of, like, it being a number, and then the string is only relevant in the application itself. I would say that's, by far, my biggest one that frustrates me when I'm in the warehouse.

DAVE: Yeah. Well, and, to be clear, like, the BI team, the business guys, come over to you, and they say, "Give me all the leases that have this type." So, they're actually asking you to actionably query on those numbers, right? If those enums were just in that database, you wouldn't care; it wouldn't matter. But you're actually being asked to make intelligent decisions off of those enums, and we'd much rather have an enum table with a foreign key at that point, right?

ZACH: Yeah. Correct, yeah. Like, if you're going to go that route, then in the source system, have a foreign key to an enum table, and I'm fine with that. But since I don't end up with that data at all, because it's just in the codebase, then it creates a need for us to create these transformation tables so that people downstream from me, which there is a lot, right, the whole business is downstream from me. I'm downstream from all the software engineers and all of our third parties, and then there's more downstream from me that is actioning on this data. And so, it causes us to have to do a lot of, like, transformation tables just to make the data legible.

DAVE: We had two tables that had enums that they were effectively the same enum, but one of them started at one, and one of them started at zero. And it was the same three fields: 1, 2, 3 and 0, 1, 2. And there was some parking lot therapy [laughter] where we cornered an engineer, and we explained some things.

MIKE: One thing that...you keep on talking about transformation. And I want to call out we don't want to undersell, "Oh, you're just transforming data. What's the big deal?" I was thinking, if you want to make a rocket engine, well, you just start with some rocks and transform them, right? And you get a rocket engine. That shouldn't be that big a deal, right? You just start with your ore, melt it down, go through some processing. You can build a rocket engine. Well, that's just transformation [chuckles].

ZACH: Yeah, that's a good call out, right? Because, like, I feel like, and maybe if there's any other data engineers listening, or data analysts, or data people, right, like, “Oh, it's just pulling data,” and it’s like, it’s not. It's understanding the requirements of what you want because the hardest part about data is you could have all the right data and make all the wrong decisions if you don't understand it, right? Or if you put it together wrong.

And I was just talking to an analyst today, and he was like, "Yeah, well, people don't understand. It's like, 90% of the job is just making sure it's right and that you've got the right metrics so that the company actions correctly.” And it's the same thing with, like, these transformations, right? Something goes wrong in the transformation upstream where we are, everything downstream is broken. The decisions made are no longer good. Or maybe a happy accident happens, and they're great [laughs]. It could go either way, I guess.

But you're right, like, transforming the data, it's not a simple thing. It just sounds simple because we go high-level when we talk about it.

EDDY: So, what do you mean by transforming data? Like, I understand. For, someone who's listening in to this and doesn't have a concept of transforming data, what do you mean by that?

ZACH: Yeah. So, we have multiple sources that a customer can get into our system, right? We have partners. We have a mobile app. We have a website. We have emails that get sent out. We have all these different things. I don't know if you guys are aware of this, but our consumer-facing systems are very bad at telling me where a customer's coming from.

And so, one of the transformations I do is this massive statement where I'm checking across six to seven different systems just trying to figure out where did we get this lease from, right? And that would be, like, a transformation. And those are hard, not only because of the logic that's involved, right, which any programmer is going to understand that logic can be hard.

But, like, you have to have a serious understanding of that data, right? So, you can't just say, "Oh, well, we're just going to plug this big case statement in," or "We're going to do this summarization here." You have to understand what that data is, or else we would be telling everybody the wrong origination.

Another good example of that is there's a very complicated functionality that we have. I won't go into a lot of detail over it, but it essentially has to check every record for every single day that it's open and, like, go in a very specific order because things are changing, and it has to recalculate it, right? And not only does it take a long time, it's one of those ones that needs fixing, but it's extremely complicated and uses a ton of window functions. So, you have to realize that, like, when you're selecting this, you're actually talking about the row behind it, or the row in front of it, or we're summarizing up until this point, or, you know, there's some complication into that as well.

DAVE: That's awesome. So, related to transformations, I remember we have a bunch of tables in the warehouse that start with MP, and that's the Merchant Portal side, the data that came from there. We also have an f leases table, right, that's, like, is that aggregated? I know it's got way more stuff on it than we have over in Merchant Portal. Is that just a combination, or a transformation, or both?

ZACH: Both. So, that table is the way that we can allow our data scientists and our business intelligence people to see what a lease looks like across all of our systems that are important to a lease, right? And so, it's also got that functionality that I was talking about, like, where did this lease originate from, right? So, there’s those transformations in there.

And then there's a lot of like, well, okay, Merchant Portal knows until this point, and LMS knows after this point, and, you know, these other systems over here know a couple of other things. Let's put them all in one place so that we can look at this new, transformed leases table, and say, oh, this is everything we know about this lease. To an extent, right, there are some tables that that joins to that helps fill in some gaps.

But, yeah, it's really just the merging of all the microservices, which is why in the beginning of this, I said microservices are great for software engineers, but they suck for data. Luckily, here we have a really good global identification system. I've seen places that don't, and then it gets even harder to get this data together. So, it's easier here than it might be in some other places.

DAVE: It gets fun when you've got a record that has a proxy key that's just your integer primary key auto-increment, right, and a GUId, and a public-facing one because we don't want a customer writing down a 64-byte, you know, token thing, and then something else for, like...we've got a table that's got, like, four different IDs, and it's not stupid. Like, there's a different role for each of those IDs.

MIKE: You’re talking --

ZACH: Yeah, there ain’t much more to comment on that one [laughs], so I got –-

DAVE: Okay. [inaudible 31:18] Is that a question?

MIKE: Eddy, you were talking about transformations, like, what are they? I was thinking about cooking. When you're cooking, you combine the ingredients. You can look at the recipe and say, "Oh, well, I'm just combining these things." But what comes out the other end is fundamentally different in character than what went in. Like, sometimes you combine things, and you get something. Well, you say, "It's just made of these things." And chemically, that's true, right? It's just made of those parts.

The outcome, you know, some eggs and flour or whatever, you know, having a cake come out, a cake is a different thing than just a pile of eggs and flour. The combination actually matters. And I think that when you're thinking about that data, the putting things together and maybe performing some operations on them, mathematical things, you know, some summing, some averaging, you're going to get something out the other end that is fundamentally different in character than what you started with.

Zach keeps talking about making decisions. I can look at a list of records. I can't make a decision with that. There's no way I can look at a bunch of tables of records, you know, think about them as just a bunch of spreadsheets, then say, "Oh yeah, I've got lists of customers, and I've got a list of leases." I can't make any business decisions off that. That tells me nothing.

But if you do the right processing out of there, you can see, "Oh, our revenue is going up, or our revenue is going down, and it's because of this thing over here that changed." And that is fundamentally different, even though it starts from the same place, right? You're starting with those ingredients. What comes out the other end really is a fundamentally different thing. And I think that it's important to recognize that. You think, “Well, yeah, I mean, I'm just changing it a little bit. I'm just combining stuff. Does that really make a big difference?" Well, yeah.

If you're thinking about that cooking, you know, a cake really is different than what went into it. Likewise, here where you're doing even more steps, being able to make a key business decision based on some limited numbers is fundamentally different and a critical business function that's completely impossible with what you started with. And it's not a simple step between those. You probably have 50 steps between those in some cases.

ZACH: Yeah, I was going to say, and to follow that up is like, we're not talking about, like, oh yeah, a script runs, and there are some transformations, and now you have f leases, the table that David was talking about. What ends up happening is you have 15 to 20 scripts run, and then you get f leases, and then I need 15 to 20 scripts after that to make more actionable. And they all build off of each other, and there’s all dependencies on these tables, right?

So, it is, it’s a pipeline. You have to think of it as a pipeline, and each step in this pipeline is a script or SQL that's building the next thing that might come into this next table or give us more insights, right? So, --

BILL: So, I really like the chef comparison earlier. Because, like you were saying...I know you said CrossFit, and I think that's a great one as well. But, for me, I think almost like of culinary arts, right? The structured alignment of these different resources coming together, kind of like what you're saying, Mike. But then also it's an art, right? Because it's presentable. It's got to be presentable to a person that might not understand the basics of data or something, you know. They're able to pull it, access it, and still be able to analyze and acknowledge what that data houses, you know, just kind of, like, in layman's terms, I guess.

DAVE: And if you're getting data from me, when I was on the data team, it was omakase. It was a surprise, and you got what the chef gave you [laughter].

EDDY: You know, one of the things that kind of rings the bell when I was asking you what's the biggest gripes that a software engineer does that really, like, rubs you the wrong way, and I sort of answered my own question, but I kind of paused because I wanted to see what your biggest gripe was.

But I want to challenge that a little bit, and I want to ask you if this is something that maybe infuriates you even more than dealing with enums in a database. You ready? Having software engineers, right, treating a database at the application detail and not as a shared contract, right? So, let's say, for example, we go in there and manipulate our own schema, our column names, right? We drop tables, and we just don't tell you about it [laughs]. We just don’t tell you about it. Like, suddenly, right, I'm assuming, right, that that has some detrimental side effects on your team, right, because we didn't delegate any of those ones.

ZACH: That is accurate. That's also something that we've worked on here since I've taken over the data team. I've worked on getting closer with Mike and the other engineering directors and working top-down like, "This is our new process." Everybody here, the GitHub auto-assigner puts Bill and either Ricky or Kim as approvers, right? That’s our way past that. So, like, if we went back to that world, Eddy, where I woke up, and nothing that I wanted to run, and DevOps was reaching out to me saying, "Hey, you're taking down Merchant Portal," yeah, that is my biggest gripe.

But we are multi-years removed from that at this point, so it's not my biggest gripe anymore. It's pretty well solved. We've had a couple of issues recently; we put in some more stuff to get past that. And really, that is a lack of communication, right, and is what that boils down to. So, we've bridged that gap very well here at Acima. So...

DAVE: If I recall, Casey, or maybe it was Casey, somebody early on...this blew my mind when I came. Because I'm like, yeah, that was my question too, Eddy, when I went over to the data team. I'm like, I don't see you guys doing anything with our migrations, and I know we're migrating the database every single day. And Casey was like, "Eh, it's just a Tuesday for us." And in the list of reports that run every night, one of them is "Go deal with all the schema migrations and just update the warehouse,” and down the road you go.

ZACH: Yeah, we've come a long way. The other thing that helps us out with those a lot is Fivetran. Fivetran is non-destructive, our homegrown solution that we had. Back when David came to join the team and help us move to Snowflake, it would break it. You dropped a column, it would break it. You updated an entire table with, like, a backfill, I'd take your system down on accident without even meaning to [chuckles]. And then we moved to Fivetran.

DAVE: Sometimes you meant to.

ZACH: [laughs] Nope. You won't get me to admit to that, ever.

DAVE: You never meant to, but sometimes you didn't feel too bad [laughs].

ZACH: But Fivetran is very...it’s not destructive. You drop a column. I don't drop a column, which can be hurtful in another way, right? If you guys were to drop a column or stop writing to a column, and I didn't know we stopped writing to the column, and I was transforming off of that column, well, now you could have just made 37 tables have a null feature for no reason and break some reporting. And then I have to hear about that from the business, and it's my fault, you know [laughs], and so... And it's never, as everybody here probably knows, it's never a good feeling when somebody off of your team comes to tell you about issues on your team.

DAVE: Yeah, I remember one of the cool things about having worked in data and then going back is, we had a thing where we had some tables where it's like, oh, we just need a phone number, just stick it on, right? This is how databases go straight to second normal form, right? Oh, now we need a work phone; now we need a cell phone. And we let it get out of hand, right? And so, we had, like, 11 tables that had phone numbers on them and three different kinds.

All right, we need a phone numbers table. And that came through, and I was looking at this, and I'm like, okay, we can build a table. We'll export it. And this is going to take a while to get everything off. So, we're going to do triggers that go back and forth, Rails triggers after, you know, after hooks on the code. If you update this one, we update the master. You update this one; we update the outward record. Okay, great.

And then I put a note in the ticket: go talk to the data team because they have reports that go off of this table, and if we stop writing to this, they're going to be very upset. And I remember talking with Casey, him tapping me on the shoulder and saying, “The dev team are changing the encryption keys, and we need to be able to decrypt this information.” And I said, “Okay, how soon do we need this?” And then I said, “Wait, let me guess: they've already changed it, and we can't decrypt data and give it to the call center.” And Casey said, “Yup.” And I'm like, yeah, so I got to go back to engineering and yell at Adam and say, “Okay, what happened?” And he thought he had communicated it, and it just...yeah. So...

ZACH: Yeah, I remember that because there was a lot of late nights and tagging another software engineer who was very smart with encryption. Because it's not just that, like, we changed the encryption algorithm, right? We have to convert what Rails is doing to Python.

DAVE: To Python.

ZACH: And understand what it's doing under the hood so that we can recreate it. And we've had a lot of problems with that in the past, that being one of them, and from one of the systems that's the most important.

But going back to your second normal form, I found a table one time, and I got a lot of pushback about changing it, but it was essentially...and, Bill, you might have been working here at that point, but it was tokenization, right? And it was, like, a company name token, company name tokenization at. And then there was a column like that for every single one of the companies that we've ever used for tokenization, and we were adding another one.

And so, there’s this, like, 10 columns, and I'm like, what are we doing? This is horrible data architecture. And we wouldn't even be needing to make this migration at all if we would have just set it up properly, right? Like, just get a tokenization table that links back to this other record and then make it very dynamic. And so, that was, Eddy, to your question, too, another frustrating experience because I was completely ignored on that one and two more columns got added to the table, and who knows how many since then, so...

DAVE: Now I want to go look [laughs].

EDDY: Well, I don't have the access to, unless it was [inaudible 41:45]

DAVE: Not fair [laughs].

EDDY: [laughs]. You were talking a little bit about, like, phone numbers’ table, Dave, and it got me thinking. I guess it's really easy to kind of just think, oh man, if multiple tables can have a name column, why not just create polymorphic tables, you know, with ownerships, you know, and then just shove everything that can be polymorphic be polymorphic? So, where do you draw that line, right?

So, for example, phone numbers, you can have a phone numbers table; email, you can have an emails table, right? Address, you can have an address table, et cetera. But, like, I'm assuming you don't want that for name maybe, right? Or do you want that for date of birth, for example, et cetera? Like, is the default always...if multiple tables can share the same data, does it just make sense to always make it polymorphic? Where do you draw that line, you know, even if you are repeating yourself in multiple tables?

ZACH: I’ll do the simple answer, and then let Bill come in with the more complicated [chuckles] answer if he wants to correct what I say. Phone numbers make sense. You, Eddy, can have multiple phone numbers. That is a one-to-many relationship. But you, Eddy, are one person, so, like, you have a date of birth. You have all these facts about you that sit on your customer record, but you could have multiple phone numbers. And so, you put that into a secondary table, and you just match back. And you can have multiple emails. You can have multiple bank accounts. You can have multiples.

So, when you could have multiple things, that's when I would do that because when you start finding yourself doing things, like I was saying, or underscore one, underscore two, underscore three, that needs to go somewhere. And Bill's going to have probably a better explanation than that, but that's where my idea was at, yeah.

DAVE: Like, how far to go down, right? Like, the extreme case to be like, should we have a first names table and select, you know, like, Bob belongs to these three applicants and you just, you know, first name...Is that the logical conclusion, Eddy?

BILL: That’s it.

DAVE: Of, like, way too far? How much is too much? This has become a form joke of, like [crosstalk 44:05] ID.

BILL: After you've done it enough, you just get a feel for it. The example you just offered, that would be one of those times where you're like, this is just ridiculous. This is, like, fourth, fifth normal form. No [laughs], going too far. If you have a repeating attribute, like Zach was talking about, like multiple types of emails, multiple types of phones for a given person, that's pretty simple; you normally stick that in a child table.

But you were talking about polymorphism, a single record being able to represent multiple types of things, which you frequently find in, like, event tables and whatnot, where different sorts of things can be stored in that same table. That's usually about the only place I use polymorphism. It is a case-by-case basis. It's mostly art and less science. I actually don't have a really good answer for that, when to use polymorphism. I almost never use it. I'm actually surprised at how often we use it here.

DAVE: I might have a good follow-up, then. So, the way to know the right answer is experience, and what is it? Good judgment is how you get experience, or the other way around: experience comes from good judgment. Judgment comes from...you know the quote, right? Experience comes from bad judgment; that’s what I was saying. What does it feel like when you burn your hand on the stove, when you have over-polymorphized or over-normalized your form?

BILL: Nobody likes to work with your schema. Developers hate it. Now, in general --

DAVE: Mike, I think I may have over-normalized my form.

BILL: [laughs]

DAVE: [inaudible 45:31] of my data.

BILL: I have found that developers have a...you asked earlier one thing that is a pet peeve of ours. Mine is that developers have an unnatural fear of joins. If the data model is well-modeled and solid and doesn't go beyond third normal form, a relational database loves that. And I've had tables with billions of rows, and joining them is not a big deal, sub-second response time. So, that’s something. I wish developers would not fear joins. That's somehow related to what we are talking about, and I have since lost my train of thought.

DAVE: It's all good. I think --

MIKE: I had a thought about the normalization. A phone number has a defined structure. It's an entity with clearly defined structure where that internal structure matters, right? Like, you could conceivably have a phone number type in the database even, right? And I'm sure some databases probably implement that. There are probably some telecom [laughs] companies that very much do have a phone number type in their database. Likewise with an email address, right? It's an entity with a clearly defined type.

Whereas a first name, it's just a string. There is no internal structure. There's no expected internal structure. In fact, it varies across cultures. It varies in language. You really, really don't want to impose structure on it because that would be a really bad idea. It's important that you recognize that as just a string. Also, the number of them is unbounded. You can have arbitrary strings there, right? I mean, you might truncate it at the end if you have something ridiculous but, you know, it's just arbitrary data.

I feel like that's fundamentally different in character than the other things we've talked about. An address is something, you know, it's its own...it's got its own little schema, right? An address is a thing that has a clear definition that represents a concept. Now, a first name does represent a concept, right? But it’s not in and of itself anything other than just a string, right? It is just a blob of text, no different than any other paragraph, right?

And somebody probably has done something ridiculous by putting a whole paragraph as their first name. And [chuckles] that's perfectly legitimate for that, which is different than the kind of thing we're talking about with an address. There's a meaning to the address in the way that there's not on that first name. Not that first names aren't important, not that they don't have meaning within, you know, cultural meaning, but they don't have a meaning in terms of the data in that respect, other than it’s just a string that’s an identifier. And --

BILL: A little [inaudible 48:06] of a thought that I have to add to that.

MIKE: Please.

BILL: Sometimes the decision about how far to go in normalization and going crazy with your data modeling depends on the business context. My first eight years of my career was spent at telecommunications companies. And there, a phone number had to be split out. So, you had separate fields for the international code, the area code, the exchange, and then the line number. But at most companies, you don't need that. There’s no reason. So, sometimes that’s the answer. What is the business –-

MIKE: And that makes the phone number...And you just answered my question, like, yes, it does exist, right [chuckles]? It does matter on the business context. And now that you mention it, I bet that if you were working for a company that was doing, like, genealogical work, like ancestral stuff, then maybe there are some last names, for example, where you might actually care a lot, and you might care about normalizing those. Like, you might want to represent some of those as special, if there are some high-frequency ones. I haven't really thought about this. I’m just talking [crosstalk 49:10]

BILL: There were some hard lessons I had to learn when I worked for MYFaith [SP] for 11 years because they operate in 281 countries. And I bet this is found in the link that Dave just shared there in the chat. But there were some things I did not know about names in certain parts of the world. Like, some countries, you have a single name. It's not a surname. It’s not a first name; it's just your name. And we had modeled our data to be very Western-centric. It expected you to have a first and a last. There's all sorts of fun stuff that you can run into when you’re modeling.

DAVE: For those listening at home, you can Google "Falsehoods Programmers Believe About Names," and it's a list of, like, shocking things that you believe: oh, they'll fit within 30 characters. Oh, they'll fit within 50 characters. Oh, they'll fit in ASCII. Oh, they'll fit in Unicode.

BILL: [laughs]

DAVE: People have names at birth. People have names within a year of birth. People have names within five years of birth. That is not always true. Like, again, you're getting into a pretty esoteric data set at that point. But yeah [crosstalk 50:12] people have names.

ZACH: Yeah [laughs]. I was going to say, that's good.

DAVE: The author got challenged on that. He said, "Oh, come on, show me an example where people have names, where it's a large data set." And he said, "Cataloging mass graves." And I'm like, ooh. Yep.

BILL: And that's one of the things I love the most about data modeling is using experience and knowledge like this to anticipate problems and avoid them in the initial stages of design.

ZACH: Yeah. And the cool thing about it is you want to avoid them all. So, you're always learning [laughs], and there's always going to be something that you didn't expect, some user input. It's like a video on LinkedIn, right, where it says, like, "Programmer watching QA," and it's, like, one of those boxes with the different shapes. And they're like, "Where does the square go?" "Yeah, in the square hole." She’s like, “Yeah.” And then it’s like, "Where does the circle go? That’s right, in the square hole." They’re like, "No [laughs]." Especially if you work for a company like this with a lot of user-inputted data, like, you have to be careful with that.

DAVE: SQLite. Let me finish on this real quick, Eddy. SQLite, I discovered this this week: everything is a square hole in SQLite. SQLite uses a variant type underneath the hood, and it uses data affinity to determine what type it is. And it literally does not care in the schema what column type you declare. I literally tested this; you can try this at home. Create table test, open parenthensis, ID as banana or ID [inaudible 51:48]...ID banana comma name banana phone number banana, and then insert into it a number and a string and some other, you know, whatever you want. And when you select it, it will come back in that type. My faith as a programmer is broken. Nothing makes sense anymore [laughter].

EDDY: Well, like, and mobile apps use SQLite? So, I can only imagine on, like, how detrimental [laughs] that can really be. So --

DAVE: Sorry, I cut you off a minute ago, Eddy.

EDDY: Oh no, I wanted to say something, but I also didn't want to cut anyone else off. I wanted to kind of expand a little bit because I'm actually really curious. When I first started and I started to really understand, you know, like, data modeling and data types, you know, and, like, non-nullables, and, you know, and constraints and all this stuff, right, my default thought at one point, and I know the answer to this, but I want to ask it just because I want to see you guys’ reaction. I was just going to say, like, why don't we just store everything as varchar, right, just to be safe, you know? And that way, you don't have to worry about what data they send you, you know, and you can just now worry about schemas. And why is that bad, I guess?

DAVE: SQLite's saying, "Preach it, brother [laughter]."

ZACH: Yeah, yeah, Matt, you do, and I give you crap about it all the time. And your response is always, "Oh, this was just for me.” And I don't care [laughs]. I don't care if it's just for you; do it right [laughter].

So, a really large reason is data quality, right? What if you're expecting a number and you get a string and everything is just a varchar? Or, like, what if we're expecting, and I know we do this a lot here, and I do it a lot, too, where, like, Postgres doesn't use up all the space. Like, MySQL, if you said, like, varchar(250), it's using 250 bytes, right? If you do it for Postgres and you put two bytes worth of data in there, it's using two bytes. And it's the same with Snowflake. Now, Redshift works the opposite way, where I had to be careful about the sizing of varchars.

But, like, let's say state code, for instance, right? If you're operating only in the U.S. and you're doing state code, you want that to be two characters, and if it's something more than two characters, you want that to break. You don't want that to go in, and then you want to catch it in the application. Because the best place to do data quality checks, especially when you have humans giving you the data, is the application. And you want your data types to match that for data quality.

And that's a huge [inaudible 54:31] about data quality, and that's not the only reason. Previously, it was faster, and it probably still is to some degree, but computers have grown a lot since then. But why a lot of, you know, you got a lot of relational tables, and you would do enums that go out to another table, like numbers are faster to look up. That's less the case now, but I'd still argue varchar-ing everything is a terrible idea, even for look-up speeds [laughs].

BILL: When you use the right data type, and if the business rules require it, constrain that column to a certain length; you get built-in data integrity checks for free.

DAVE: I did a lot of geolocation at a previous job where we were, like, trying to find, you know, pins on a map. And k-space indexing, like, two-dimensional geospace indexing, if you're just throwing JSON strings in there, good luck. You're just going to have to scan the whole database if you want to find anything in the U.S. But if you index it based on, you know, geolocation, by having that in a special format, you can index a lot better.

ZACH: Your geoms. I've never done, like, mapping things out until I came here. So, like, geoms, and, like, all the functionality inside of warehouses, they'll let you, like, plot locations on a map. There's some that will take into account the curvature of the earth, and some that’s just like as the crow flies. And they have their own data types of geoms, which is very foreign and very fascinating.

EDDY: I'm just throwing out a bunch of data because I'm taking advantage of the fact that I have data people here who can just answer my questions.

DAVE: I love it. I love it.

EDDY: And so, [inaudible 56:20] right? So, this is a genuine question, right? Why would you ever want to use ints for IDs, right?

ZACH: Yeah, you don't. You never want to. You want to use BigInts, because if you just use ints, you run into a problem that we've seen, where you run out of numbers. And also –-

BILL: It’s happening at LMS right now.

ZACH: Yeah. And so, like, if you ever sit there and think, oh, I’m making an ID; let's just do an int, that'll get you a way, sure. Why not? But then you're the reason we all have to struggle and figure out a creative solution to turn that to a BigInt. So, [laughs] Eddy, BigInt IDs always.

EDDY: Is that just fair to say that's the default? Like, even if you don't fully expect that table to grow exponentially.

ZACH: Yes.

EDDY: Is there a cost for associating a BigInt versus an int?

ZACH: No, not in Postgres, which is what you're working in because Postgres is only using the amount of bytes that it’s stored. The cost can happen in systems. Like, if I remember correctly, and it's been a while since I worked in it, MySQL, you fill that space with, basically, think of it as bases, right? Like, you say 64, or is it 126? I can't remember which, but for BigInt, right? I think it's 64. And so, like, if you have 2 numbers in there, it will fill...think of it as filling the other 62 with spaces, and it uses that in storage, but Postgres does not.

MATT: MySQL allocates it.

ZACH: Yeah, there is no cost [inaudible 58:03]. So, I would argue that even with the cost, it's worth it [laughs] to not deal with running out of numbers.

DAVE: And again, like, on the data side, storage is free and compute is expensive, right? We're on app. It's the other way around. So, we're like, oh, conserve space, conserve space. That is awesome. When I worked on the data team, I used to tell people that we were in charge of the numbers, and last Tuesday, we almost ran out of sevens [laughter]. Yeah, so...Oh my gosh. Anybody have anything to wrap on? This has been fantastic, Zach, Bill. I hope we can have you guys come back. This has been fantastic.

ZACH: I think the idea was floated around where we get, like, my whole team, and I think we should. We should.

BILL: This has sparked a number of ideas for me as well.

MIKE: Oh, nice.

DAVE: Fantastic, Fantastic.

BILL: I'd really like to start talking about partitioning.

DAVE: Oh yeah.

BILL: Because we have a number of systems with tables in the billions, and, normally, you start partitioning when you hit about 100 million.

DAVE: That is fun.

EDDY: What I think, Bill, you started to introduce, at least at Acima level, is, I think you take for granted, you know, that you're working under your schema for so long that you just understand it. But when you're coming in fresh, and you're expected to understand what all that data means, right, and we don't document, right? Because you had a big push, like, guys, add comments to everything so that I know what you mean on what you're storing here, right? I think you really opened up, like, a fresh perspective on, like, guys, we don't all work in your table and in your schema, right? Like, please be nice and tell me what that is, right? And --

BILL: Those comments are meant to trickle all the way down to analysts, scientists, and users. Yes, it's definitely not just me. And this is just the base bedrock layer that's needed. On top of that, there's...I don't know if you can see in your articles on LinkedIn lately, but on top of that, is the semantic model, the ontology, the decision tree. There's so much context that goes around a company's data, and just the basic definition of it is the most bare-bones thing I could request right now. But there's a whole lot more to it. And once we have that kind of meaning, we can turn AI loose on our data and do amazing things.

ZACH: That's what, previously, right, this initiative, but previously, you had...and I'm saying previously, six years ago, right around the time that I started, up until, like, four years ago, probably, we had less microservices; we had less data. We had a person named Casey that you could go ask what things meant, and he was so entrenched in the data that he would be able to tell you. We've far outgrown that. I don't know everything here. And Casey no longer knows everything here, and he's still here. There's just still a lot of unknowns because we've grown too much, and that's, you know, the comments in the databases. All the stuff Bill's talking about, those are part of growing pains. You've got to make sure that people understand what the data is.

And I get it asked all the time in some data channels of, like, “How do I do this?” And I go, “I don't know. Maybe you should go ask Merchant Portal.” And then that question gets put into Merchant Portal for the data owners to actually answer, right? Because you work on a microservice. You are the owners of that data, where I'm the consumer of the data. So, I'm not going to make speculation on what it is. But once all this documentation is done, they can go look. They can see what it says. And if they have questions at that point, go ask, and then you know you have to update your documentation because it's not good enough [laughs] --

DAVE: We're going to get to a point where instead of asking what the lease is doing, we can ask how the lease is doing.

ZACH: Yeah.

DAVE: I would love that. This is probably a good place to wrap. I would love to have you guys back, even just for a SQL show, SQL, pun unintended [laughter]. But this is a great spot. Thank you, guys, so much for coming. Let's wrap here, and we can move into an after-call.

This has been the Acima Developer podcast. And thank you for coming, and hope you'll listen to us next week.

Episode Link

Embeddable Audio Player

Download URL

Social Network Quick Links