Episode 68

DevOps / ZeroOps

00:00:00
/
01:01:36

March 19th, 2025

1 hr 1 min 36 secs

Your Host

About this Episode

The latest episode of the Acima Development Podcast takes a deep dive into the evolution of DevOps, platform engineering, and the concept of ZeroOps. Host Mike, along with panelists Justin Caldwell, Kyle, Justin Ellis, Will Archer, and Eddy, discuss how infrastructure should function like a well-designed vending machine—providing a seamless interface where developers don’t need to worry about the internal mechanics. The team explores how DevOps has transformed over the years, moving from traditional sysadmin roles to cloud-driven automation, enabling developers to focus on delivering code while minimizing manual operational tasks. The discussion introduces ZeroOps, a modern approach that seeks to eliminate the need for infrastructure management by automating processes and providing a frictionless experience for development teams.

Justin Ellis provides historical context on IT infrastructure, explaining how organizations transitioned from manual punch card systems to today’s highly automated cloud environments. The panel highlights best practices in infrastructure management, emphasizing the importance of standardization, automation, and observability. They discuss the shift from managing infrastructure as “pets”—custom, hand-configured systems—to treating it as “cattle”—standardized, scalable, and disposable resources. Tools like Terraform, Kubernetes, Docker, and Prometheus play a key role in creating resilient systems that support modern software deployment strategies. By implementing these practices, companies like Acima have been able to scale deployments from quarterly releases to multiple times per day, improving reliability and developer efficiency.

The episode wraps up with a discussion on future innovations in DevOps, particularly the pursuit of fully dynamic environments where developers can spin up isolated test environments on demand. The team acknowledges challenges like managing stateful applications, dealing with legacy infrastructure, and ensuring effective observability, but remains optimistic about advances in AI, cloud-native tooling, and automation frameworks. As organizations continue to refine their DevOps strategies, the goal is to build systems that are not only scalable and efficient but also reduce complexity and operational overhead, allowing developers to focus on building great software.

Transcript:

MIKE: Hello and welcome to another episode of the Acima Development Podcast. I'm Mike. I am hosting again today, and I've got a panel with me here today of Justin Caldwell, who's coming to us from the cloud engineering DevOps side. Kyle who has been on here a lot before also from DevOps. We've got Justin, Justin Ellis, who has been here before, and we've got Will Archer. And we've got Eddy.

As usual, I'm going to tell a story to lead us in here to kind of launch our discussion. Oh, I don't know how many years ago, a lot of years ago [laughs], I worked at a sheet metal shop where we made bubble gum machines. You've probably seen a vending machine where you put in a quarter, and the bubble gum ball does a lot of things before it makes it down. We made, like, roller coaster-type things. So, the bubble gum ball would drop in the top, and it would roll around and do some loops and stuff, and it would come out.

And it's interesting to actually make one of those. There's a sheet metal around the side. We'd use the wire, and we'd actually stretch it out so it would bend better [chuckles] and curl it around a jig. Anyway, there was a lot of work that went into it. And then when you run it, you go through all the work to see that bubble gum ball roll all the way down the machine and come out the end. It's exciting to watch. People do it not so much for the bubblegum ball, or maybe, but to watch all of the machinery going on inside because it's fun, right? People like to watch that.

And the interesting thing about this is what you get out at the end is exactly the same as if you didn't see the machinery at all [laughs]. In either case, you got a mechanism that drops a bubblegum ball, and it comes out. And the interface is exactly the same. They probably don't have that many coin-operated [chuckles] vending machines anywhere, you know, a lot of times, you put in a card, but same deal, right?

You have a mechanism. You press the button or turn the dial, and bubblegum ball comes out. It really doesn't matter what happens in the machine. What you want to have, and this is really important, you don't want to have a weird vending machine where you have to put in something other than the quarter, right [laughs? You don't want to have some vending machine where it asks you for something in a foreign currency, for example. Like, well, that's kind of cool, but I don't carry pesos, right [laughs]?

And this is kind of the hard thing about switching over to cards, like a lot of things have done. Well, do I have a card with me? Do I have a coin? And now it's kind of flipped to the other side. Like, who carries cash [laughs]? So, it's flipped. You want to have that common interface because, otherwise, it's awkward, and nobody knows how to use it, and then you have to know what's going on inside.

If you have to know how the machine works, you've failed. The idea is to have this as a simple interface. You don't care what's going on inside. Of course, with a bubblegum machine, you care a lot because that's the main reason you're looking at is to watch the ball come down. But in the end, if it doesn't work inside, then that's a fail. What you want to do is have this simple interface that never changes. You just want that bubblegum ball to come out.

And I say this because it applies directly to how, well, to a lot of things, the interfaces between systems and software. Today we're specifically going to talk about the integration with DevOps. You may call it by different names, whatever interface that you interface with to get your infrastructure. Because there's different ways of handling this. There's times when we have people embedded into the team, so each team has their own person, and they build their own stuff. There's a way of approaching this. It's something that's called ZeroOps, where you just press the button and the thing comes out [chuckles].

DevOps provides a common interface that everybody needs to use, and nobody needs to know how that bubblegum makes it down the track because it simplifies systems so much. And we've got Justin here who's been passionate about this idea for...we were just talking on the pre-call. I've worked with Justin for 15 years [laughs], so he's been passionate about this for a long time and has built this more than once and made multiple companies really successful in terms of infrastructure because of this idea.

And I'd maybe ask Justin first to take over from here and give some thoughts, kind of launching from [chuckles] that introduction. And tell us what you think the interface with a cloud infrastructure, however your infrastructure is, maybe you have an onsite infrastructure, what is that interface supposed to look like? And why is that important?

JUSTIN ELLIS: Yeah. Thanks for the introduction. I have thought about this problem really for the past 15 years about how to build systems better, how to make software development easier. On the system side, you're thinking about availability, reliability. You're thinking about security. You've got to think about a lot of different aspects of it.

But at the end of the day, really, our main job is to enable software developers to be able to deploy code, push code out, manage their code. And they don't really care about the mechanics underneath, right? They have a job to do, and then we have a job to do. And I think about if you go back to the very first instances of kind of our field, in the '70s, people used to write on punch cards. And they'd basically create their punch cards. And then, they would create a huge stack of all these punch cards, and then they would pass these punch cards to someone called an operator.

And that operator had a very specific job. They had to, A, keep those punch cards in a very specific order. And then, they had to feed them to the machine very carefully and keep the order. You couldn't drop them. You couldn't have any of those issues. And then, the program would get executed, and then they would get the output, and then they would help return that output back to the user.

Things have changed quite a bit, and I think there's been some major movements. There's kind of those traditional IT systems where you'd have sysadmins. And then, that kind of moved into the cloud world when AWS was introduced to the world. And that also kicked off this new movement called DevOps, where it was like, hey, instead of passing it to the operator and then having the operator do everything, let's get everybody kind of involved in this process and enable developers to be able to pass this code and then also understand what's happening, what's getting tested, kind of move that forward, right?

And then, since the DevOps kind of movement mindset idea, however you want to call that, that's kind of moved into, you know, SRE is still a big style of managing systems, site reliability engineering. And then, that's kind of moved into now this new model which is very similar and kind of inherits from a lot of the previous models, which is kind of platform engineering.

And the goal on platform engineering is to create kind of a standardized platform that developers can interact with. They can put the gumball machine in. They can put the quarter in, and then their code gets built, tested, make sure everything goes well, and then that gets output to a system that you run.

So, long story short, we've gone through a lot of different movements, and I think that we're still perfecting and getting to a better place. And I think every single iteration has had new innovations and new additions to the systems, things such as orchestration, which we have with Kubernetes and all of that kind of stuff.

And so, through that whole journey, I think we've learned a lot, and I think we're still, not early days, you know, we're well into it. And we're trying to provide kind of a standardized platform the developers can develop against. But it has been interesting to kind of see that evolution over time, and then see kind of new processes, new ideas on how these things get going and get worked.

I will say the one interesting part of our job is that there's not a standard way to do things. I guess that's how software development is, too. There's not one golden path that you can just say, "Hey, just take this golden path and do that up." It all kind of depends on your requirements, and the people involved, and all of those kinds of things.

MIKE: Thank you. That was a very comprehensive answer.

JUSTIN ELLIS: [laughs] Okay.

MIKE: But I think a lot of us…you talked about some of these old days with sysadmins and punch cards. My father-in-law's career actually was a computer operator [laughs]. And he would move the tape reels, you know, the magnetic tape from one place to the other and ran this big computer inside a bank. It was a thing [laughs].

JUSTIN ELLIS: Yeah.

MIKE: And the thing is, having to know about those details, having to know specific details. I remember, back in those days, I would be configuring a server. I remember one time configuring a server, actually, over to my parents-in-law's house (It was, like, a two-hour drive.) banging away on the computer on a remote connection, configuring this server along the way.

And every time you do that, it's going to be a little bit different than the next one because you make some little difference, right? And especially if you've got physical hardware, maybe this one's not exactly like the other one. And so, you have this pool of snowflakes. Every one is a little bit special, and every one is a little bit different. And managing that requires somebody with deep in-depth knowledge, who knows how those systems work. And that person becomes a point of failure. But then every system is hard to work with. It's brittle. It's very challenging to expand. And you're talking about a very different way of thinking about it, Justin.

JUSTIN ELLIS: Yep. Yeah. Yeah. Kind of within the whole DevOps world, we talked a lot about pets versus cattle. And the idea here is that you can have a pet, which is your pet server, that you've got to go in and you've got to care for. You've got to patch. You've got to update. You've got to make sure that everything is going well.

And then, once you have 10 servers, once you have 20 servers, once you have 50 servers, it's the same thing as having 50 cats. It's a lot of work to keep up on it. It's a lot of work to continue that forward. With the cattle idea, it's like, okay, what is an application? What are the types of workloads that we run in our system? And can we standardize, and can we build to a specific type of target?

Now, there's been a lot of tools introduced like Docker containers that kind of allow us to kind of define those targets and define where we're going so that every application that we build, whether that’s an application or a service, that could be kind of bundled up and then pushed out in just the same way as all the other services. And then, we can kind of scale those out independently.

So, you might have one service that has a lot of traffic on it and another service that has very little traffic. This allows us to think about those applications very similarly, build the same tooling for how we do that, but then you can kind of scale those independently. So a lot of power there.

MIKE: You know, I'm going to ask a question for not Justin, or not Justin Caldwell here, and maybe particularly to people who've been around for a while, who've worked with systems that do not treat things like the bubblegum machine, like the cattle, like just hand it off and it works. What are some of the failure modes that you've seen [chuckles]? Because that kind of provides some context as to why this is so important.

JUSTIN CALDWELL: So, I worked for Fidelity Investments for a number of years on their identity and access management team. And we would have a team of DevOps engineers who were very, very familiar with their login infrastructure, and that was all they did, and it was their cat or series of cats.

And we would only be allowed to do an install once a quarter, and it was nuts because we'd have to schedule it out. We could only do it from starting at around midnight, our time, and we only have a window until about two or three o'clock in the morning. And if we didn't make it during that time frame, the install, we would have to start the rollback process and try to get it another night.

But the thing about it was that only they knew all of the arcane secrets. And only they knew, like, where all the servers were, where they physically were, and what version of whatever was on there. And we were at their mercy in terms of an application team. And if any of them, God forbid, if any of them left the company, you knew that that next install was going to be crappy [laughs]. So [chuckles], it was nuts. And that's my little story. Yeah.

JUSTIN ELLIS: And I think it's great to have Kyle here because, you know, I'm trying to tie it back to the gumball machine as you kind of watch it go through. But as a developer kind of pushes some code, they also need to understand what's going on in the system. They don't need to know the fundamentals, but they need to understand, okay, how is my build going? How did the deploy go? Once it's deployed, is it successful? Are there problems kind of going forward?

And so, kind of a big part of the platform, a huge part of the platform and something that Kyle has been focused on for quite a while, has been kind of how we observe it once it's running in the system and everything's going well, but giving feedback to the developers. Because developers are best suited to understand what's going on in their application, best suited to understand what bugs come up, or, you know, maybe that last deploy didn't go as well as we wanted to.

So, I do see our job as kind of also completing that feedback loop so developers can understand what's going on in their system and all of that kind of stuff. And that goes back to also this other idea that's kind of fundamental to continuous deployment, which is application ownership, where engineers are really the ones that are running their applications in production because they know their code best. They're the best person to kind of address those. But having engineers kind of involved in the process, again, they don't need to know the fundamentals, but kind of involved there.

So, that's Kyle. Kyle has built an amazing observability stack that allows us to kind of see that gumball as it moves through the machine, which gives the developers the power and the understanding to see what's going on in the system.

KYLE: I would say, with that, on the DevOps side, that's one of those things where it's almost insulting if the developer can come to me and tell me what's wrong with the gumball machine mechanically, right? I want to know, first, that there's something wrong mechanically with the gumball machine.

If their gumball comes out and it's a result of what they put into the machine and they got a funky gumball out that way, that's fine. They can debug that. But if it's something fundamentally wrong with the gumball machine, that's another side of the coin where that type of information needs to come to us directly so that we can get that addressed so that it is not hindering the engineers.

MIKE: It is true that sometimes the gumballs would literally go off the rails [laughs]. It's a little bit bumpy, and there it goes. And that observability is important. You want to be able to look in there and say, "Okay, it is sitting on the bottom. That did not work." Now, it would be far preferable if the machine could automatically detect that and throw a new gumball down, right [laughs]? It makes a huge difference if you can make a machine that can do that.

JUSTIN CALDWELL: So, one thing I do want to address, classically, DevOps were not viewed as engineers in the classic sense. A lot of them were viewed as, like, oh, point and click like ClickOps. That was the phrase and everything, or back to the mysterious arcane people that have a bunch of bash scripts and things like that.

But in modern day, like, in the last five years or so or maybe even a little bit longer than that, with the introduction of Terraform, with the introduction of other IaC solutions out there, it is nuts to me how much engineering code, like, basically, to me, as a software engineer, I look at what the DevOps teams are doing and I'm like, you guys are just coding [laughs]. You guys are full-on engineers. You have your process. You have automated tools that's checking your coding just like application engineers. And it's, you have your history. You're checking in code. You're doing code reviews.

There is no more ClickOps. And if you are still doing ClickOps, I think people recognize that and they are like, oh, let's move this to Terraform or some other IaC. And it's just like, you're engineering complete systems. You're having data flow diagrams. You have the code that controls all that.

And I look at it like, man, I need to go learn that because you're speaking my language, and it just looks awesome. Rather than putting boxes up on the front end like a web interface or anything new, you're dealing with whole infrastructures and everything. And I think it's so cool to see that work being done that way.

EDDY: I was going to say the appreciation that I have towards DevOps has only been the exposure here at the company. But I hear things, like, terminology tossed around all over the place. Like, that was foreign and still is, right? "Oh yeah, we need to run Terraform," you know, or, like, "Our clusters are weird," and, you know, like, "Oh, that container is really odd or whatever." To me, having a team that understands that at a very deep level and a granular level that's just less things I have to worry about as a developer.

So, when the infrastructure is well built and you have a whole team dedicated specifically for that and it goes smoothly, that's just less things on my plate that I have to worry about. So, I just want to say the value that DevOps provides is enormous.

KYLE: To go along with what Justin was saying there, with this infrastructure as code, I know our team, and I hope other teams are the same way, if you run into something that was done, ClickOps, the entire room shudders. We're all just like, "Oh, that's disgusting. Like, why would you do that? You better have a good reason, right?" Maybe a proof of concept. But if you don't have a good reason, otherwise, it's just like you get chastised. You're like, "Why would you even consider doing it that way?"

So, we have gone completely to this infrastructure as code world. And it is like Justin saying: we have linting; we have formatting; we have code syntax. Some of these are their own specific languages. Like, over here at the company, we use Terraform, and Terraform is historically in HCL. However, now going forward, they're creating support for your popular languages. You can do TypeScript. You can do Python. You can do Ruby. To go along with that, yeah, you are. You're putting Terraform functions and Terraform resources inside your code.

I know here we heavily use, what, Bash, Python, Ruby. We even have some Go. That's a CLI. But the other three, like, those are integrated with our HCL code. It's generating HCL code. It's help functions, you know, those types of things. So, to go along with that, I don't think the idea that DevOps is not engineering anymore really fits. And it is kind of like, so, where do you put them? And I think that's kind of where the community is landing on this idea of we need to call them something else. We need to call them, you know, platform engineers, or SREs, wherever you're landing in that community, but they are engineers now. They're part of the engineering department.

JUSTIN ELLIS: Yeah, we call them SREs at my place. And the other thing we think about a lot is the interface that developers have with us. Now, I feel like we should probably be more mature on the interface side of things, you know, maybe even have a web user interface to allow folks to kind of see what's going on in the system. So, right now, it's all CLI-driven, and then also additional tools like GitHub Actions and Jenkins to kind of provide that interface for us.

But we thought a lot about how do we interface with developers? And one of the things that we've done is thinking about if you're running an application or a service on the platform, how do we allow developers to say, "Hey, I need a new bucket. I need a database. I need a new resource for my thing"? And a lot of times this takes a lot of steps, a lot of coordination between those. And I've always said if developers have to understand Terraform code, or are writing Terraform code, or if they have to know what Kubernetes objects there are, or how the fundamentals of Kubernetes works on a deep level, then we’ve failed our job.

So, we've kind of created what we call kind of an application manifest, which is basically very simple. Well, for some applications, a very simple application manifest that allows developers to kind of list out the resources that they need. And I think that does give feedback to developers, too. So, they can see, okay, what databases do we have? Do we have a bucket? How do we communicate with Kafka? All of those types of things.

So, we kind of want to continue to kind of lean into that style of giving developers that feedback and also empower them to be able to kind of build their applications with components that we've added: security, and guardrails, and all those good things, high availability and everything else. But allow them to say, "Hey, I just need a database. Here's some configuration options for that database," and then we can kind of take it from there. We can make sure that scales properly and kind of goes forward.

So, I think we've done a good job so far, but I think we need to continue to kind of lean into that idea of, like, how do we make that interface for developers as easy as possible to give them legibility to the system, understanding of how their application works? And when the new people onboard, they can kind of see, oh, this is how this application is. This is the database and all the configuration that they have for their application.

MIKE: And you're describing a very proven pattern. You look at the Unix, Linux world; you have tools, and you have a configuration file that's around that tool, and that's the interface that you use. You don't point and click. And if you do point and click, it's going to generate the configuration file. It's a very standard way of communicating. And then, it's configuration, right? It's checked into your version control. It is a form of code, but it's something that is clearly well-structured and that you can learn and fill out the config. Much better than having to speak with the arcane wizards and have them build something out.

EDDY: So, I have a generic question, and this might be a little difficult to answer. But how much control, typically, should you give an engineer, like, a developer, over your configuration file?

JUSTIN ELLIS: Yeah, as much as possible, so as much as possible. But then you still need kind of those checks and balances in there. So, every kind of configuration change that gets made, just like with code that gets reviewed, and PR'd and quality checked and all that kind of stuff, we try to do that same system. So, if people can make a configuration change because they want to make that modification, it doesn't go directly to production. An engineer looks at that, like a PR, understands what the diff is going to be, and then approves it for an application within kind of an environment. So, I always like to empower developers as much as possible.

I've worked places before where, as a software developer, I didn't feel informed, or I didn't understand what other applications there were out there. So, I've always felt we need to empower. But then, obviously, you have other things like security and guardrails and all that kind of stuff and then costs as well. You need to understand like, okay, do you actually need this gigantic database to run this new application? So, there needs to be some back and forth there. But yeah, to answer your question, as much as possible, for sure.

WILL: Listen, I just need 10 more instances, okay? It's the last time I'm coming to you, 10 more instances will do it. This is the last time. Everything's going to be 10 more. And I promise, it's the last time, 10 more, and it'll be cool.

MIKE: [laughs] Nailed it.

JUSTIN CALDWELL: I do want to point out to, Eddy, like the mechanism. I mean, so, from a security point of view, the mechanism for controlling that is through the code owner’s file in your repo. And so, you can go ahead and create a PR that has all of this new infrastructure that Will wants. Will wants 10 more instances, so he goes and modifies that. But the code owners will be, you know, one of the teams will be the SRE team or the DevOps team. And they'll be forced to review that and say, "Sorry, Will, you only get five."

MIKE: And, hopefully, you have somebody...and to go back to Kyle and the observability, they say, "Well, we gave him five, and now it's using half the cluster's worth of resources. Something is wrong here [laughs]." Maybe we should go talk to Will, say, you know, "The problem's not on our side [laughs]. We need to take a look at your code. What is the reason the database is running constantly at 90%? And you've got all this queuing for your requests. Something is not right here." And that conversation can happen. Go ahead.

JUSTIN ELLIS: Oh, no, I was just thinking, you know, so kind of a theme of this is ZeroOps. And just to kind of give you some background, Upbound as a company, generally, we haven't really talked about ZeroOps. This is kind of a new initiative. And I actually have just recently changed roles to kind of focus on ZeroOps specifically. And I think just like anything, naming is hard. ZeroOps doesn't mean that you literally have zero operations. In my mind, it means that you have as few operations as possible. So, that means that we're not running sysadmin-style on servers and keeping those VMs kind of managed. This is more and more moving towards container and orchestration.

So, when I think about kind of ZeroOps across our enterprise today, I've kind of worked on developing that strategy and roadmap for where we're going with ZeroOps. And part of that is understanding where our operations costs are coming from. We kind of have a company where there's different lines of business that have different maturities. As far as kind of platform engineering and infrastructure as code, we have a lot of legacy stuff where things still live on VM. We have a big on-premise structure, right?

So, when I think of ZeroOps, I really do think kind of like increasing our capabilities and moving more towards a platform engineering style, moving towards an interface for developers, for data, for security to allow them to write security rules and then having kind of that same PR process across all of the organizations.

So, just like we've treated developers, kind of start to treat those other folks, whether that's QA or security, kind of moving forward and allow them to kind of go through that same process, where they're able to write new security rules, new QA processes. And then, they have a really clean interface to test out their code in non-production environments and then, eventually, kind of roll that stuff out to production. So, that's kind of ZeroOps. And I'm kind of curious, when you guys hear ZeroOps, what comes to your mind? What is the general feeling about ZeroOps?

WILL: It sounds like ops has been running really good, and the gumball hasn't gotten stuck in a while. And so, some very smart person is looking at a spreadsheet, and thinking like, why do we have all these ops people when ops is running so good? And I feel like it’s people maybe being a victim of their own success, you know what I mean, sort of like, well, gosh, that spare tire is so heavy.

MIKE: [laughs]

WILL: And we don't need it. I mean, the emergency parachute, two parachutes? Why do I have to pack two parachutes? So, I mean, I don't know, I spent a little bit of time last week with the gumball stuck in the machine. It's not that I can't do that kind of stuff, but I'm not as good as you guys are because why would I be if I do it once a year when it's, like, the gumball stuck? Who's got the key? Oh, crap. What's going on?

So, I mean, efficiency is efficiency. One developer could do more work, right? And one ops engineer can do more work. So, you're doing more with fewer people because you have the tools that allow you to have that increased productivity and leverage. But it does give me a little bit of anxiety, a little bit of stomach acid just because I know how good I am at it, and I'll figure it out eventually [laughs], but eventually being the operative word.

JUSTIN ELLIS: Yeah, ServerOps, to me, means...and this might be a way of thinking of it as less SSH-ing into servers, less manual interventions moving towards more automations. So, instead of like, you know, you have a problem that pops up, right? And instead of solving the problem kind of on that local level of just like, hey, let's just get this issue fixed and then move on, it's like, how do we provide a system to prevent all of these errors kind of going forward? So, how do we make sure that this problem kind of never pops up again? How do we build a tool to kind of watch this and intervene, maybe do self-healing? Whatever that may mean, but kind of more sophisticated automation.

And automations are tricky because, especially with interventions and all of that, things can go terribly wrong. So, you have to be very careful and very structured and think about it more of a framework when you do that automation. And then, also, there's systems that are much better automated than others.

We think about kind of stateful systems or stateful applications versus stateless applications, meaning that a stateful application would be something on a VM that requires a disk, and that state is kind of on the disk. And if you need to scale up that server, you've got to go in and change the VM, which is going to require maybe some downtime and all those kinds of issues.

When you move towards more of a stateless model, that means that all the storage is abstracted away from the application itself. And so, that just gives you a lot more tooling and a lot better ability to add automation pieces so that you can scale out. You can address problems, and you can fix those kind of going forward. So, I think that kind of stateful versus stateless distinction and moving more into kind of a stateless direction really does help a lot to add additional automations, and then giving those platform engineers the ability to work within a framework and push our technology forward.

MIKE: There's a couple of questions I'd like to ask, because there's something that you touched on, Justin, that there's legacy systems because that's the reality pretty much everywhere. If you're starting on day one, great, but that's almost nobody, right? Because that only happens on day one. And then, after that, you've got a legacy system that maybe you wish you'd redone on day one. And I'd like to hear about the process. But before we do that, I want to start with a little bit of the goal.

We heard before about a company where you could install or deploy once a quarter, and if you fail that, out of luck, right? Maybe you can do it the next day, but you still get once a quarter. So, I'm going to ask our platform engineer folks here, and I'm putting you on the spot a little bit, but how many times do we deploy a day at Acima? I'm going to ask about Acima line of business specifically. So, how many times do you think about Acima deploys a day? And it doesn't have to be exact, but approximately.

JUSTIN ELLIS: Across all of our applications, I would guess we probably...more than 30 times, sometimes 50 times a day. But somewhere around that range in the multiple dozens of times a day across all of our applications. I would imagine we've had days where we've deployed out a hundred times a day. So yeah, quite a bit.

MIKE: Okay, so up to a hundred times a day. I would argue that that is a much better situation to be in than once a quarter. If I've got a bug, I'd sure like to hear that we're deploying a hundred times a day. I'd much rather hear that than, well, we can get that for you in Q4 [laughs]. It's hard to even express the dramatic difference. And those of us who have kind of seen the industry move here, you saw that happen.

There was a time when we thought, deploying more than once a day, oh, that is my worst nightmare because you knew how bad the deployment process was. But if you actually build the infrastructure, it completely changes the paradigm to where deployment is nothing. And the infrastructure is there; you don't have to think about it. You put in the quarter; you get your gumball. Why would anything else happen? The system just works. And it really is a game changer.

So, you've seen that example of how, I mean, I don't know how many orders of magnitude that is. I'd have to start thinking about it, once per quarter to a hundred times a day. That's a lot of zeros of difference . How do you get there? Because it doesn't start there. How do you take a real company that maybe has some legacy stuff and get them there?

JUSTIN CALDWELL: Yeah, so I would say, if you're in a company where you have a lot of legacy systems; you have a lot of stateful systems; you have a lot of systems that kind of break down, or you may be a place that deploys, once a month, right, and you want to change that. I think you really have to start at the fundamentals and kind of the systems level and start to sketch out how you want to handle your kind of cattle going forward. Start to understand what the patterns are across all the applications.

And then, my recommendation is always start small. Start with maybe one cluster that's kind of a new cluster. This may provide orchestration. But use the mature frameworks and platform tools that we have today, and maybe transfer that into a stateless application that you can run, right? Follow that process. Think about the fundamental abstractions, about what is an application? What's a service? How do we think about scaling this out? How do we secure this right? Take that one application and try to modify it to get to that state. And that's not always possible, right?

But there are always applications that you can do that on, where you can start to think about abstracting it out into its fundamental pieces and fundamental parts. And that may mean, okay, let's get this database off of a VM and maybe move to a managed service like RDS because we're not Postgres experts, or we're not whatever else. But we know enough to handle a Postgres instance on something like Amazon, or AWS RDS or GCP database. So kind of extract out, think of it in pieces, use modern tools.

So, I think that would be my general strategy when tackling those legacy projects. And then, sometimes if you're running like, say, commercial off-the-shelf software, you have no choice. You always have to run a VM, right? Think about you might have to migrate to a new tool to get to this more stateless fast-moving thing.

And then, the last thing I'll say, and I know I'm going long, but it's just, when you do have deploys that are spread out by weeks or maybe even months, what you're going to do is you're going to have a lot of changes into that, you know, if you think about all the changes that a developer does across an entire month, and then you think of all the developers across all the teams making changes to an application, those changes build, and build, and build. And then, you're going to deploy that out, which means you're going to have a really good idea of how this works.

Once you have multiple applications talking to each other, it's a complex system, and you're going to get side effects that you never anticipated and you never thought about. So, when you have that deployment style, you're going to run into more failure modes, and they're going to be larger failure modes, where if you deploy once a day, twice a day, three times a day, you're going to get that quick feedback. You're going to make those changes smaller. So, you just see a lot of benefits by moving quicker and having a solid rollback strategy.

WILL: Yeah, but also, I would say that I think rapid deployment, generally, general best practices, and for sure, better than quarterly or even daily. Like, just run it, but what if I happen to put two quarters in the gumball machine? What about three-quarters? What about five quarters? You know what I mean? What about 30 quarters? And here they go.

And this is maybe less of a, you know, if you have your release strategy for production worked out, I think most people going to production still...it's not the Wild West of no matter how good your pipeline is. But for stuff like your test environment, your staging environment, your QA environment, I have seen many very Wild West kinds of situations where it's an absolute free-for-all.

And you lose a lot of time on that, where it's like, why are my GraphQL queries...just no GraphQL today. Sorry. Oopsie. We made an oopsie. There will be no GraphQL today. It's like, okay, all right. So, it's too much of a good thing. How do you manage access to the gumball machine when you don't have an operator, let's say.

JUSTIN ELLIS: Yeah, you know, this makes me think of application size. So, I know there's this big monolith versus microservices architecture, right? Some styles are better suited for certain environments. And I wouldn't ever advocate a really tiny microservices infrastructure where you have these tiny, tiny microservices.

I think there's a right size for applications. I think once applications get too big and you have too many engineers, say, working on a single application, once you get there, then you're starting to think about queuing, and dependencies, and ordering, and all of those kinds of things.

I think once you get to that point, I think it's probably wise to start thinking about, do we need to break this application up? Is this one application that, say, 20, 30 engineers are working on, right, and the amount of code that they run on? Is it time to start thinking about maybe extracting that out into multiple services? So, if you're starting to queue, I would say maybe it's time to start thinking about separating those concerns. I think it's always smart to start with a monolithic architecture and then extract pieces from there kind of as you go.

WILL: Oh no, no, I'm not talking about a queue although that is a problem. I'm talking about I'm developing my microservice, and you're developing your microservice, and Eddy’s got his microservice. And Ramses and Mike they're working on theirs. So, we've got five microservices. So, we're not going to go to production until we have our ducks in a row.

But we're all on the staging environment, the QA environment, and we're all pushing up changes, and things are getting weird because we're all just trucking along hoping, expecting, right? Because all this microservice, these complex applications where we have successfully split them up. But we still interdepend on each other and if everybody is moving in different directions at the same time, like, things are not always…

We've got a microservice architecture, but we're all pushing our changes up to staging and not for nothing. Like, the data environment is big enough and complex enough so that I can't just...it doesn't just fit on a box. I work for a large retailer, and there's a lot of background knowledge around pricing, and availability, and SKUs, and all this stuff which is way bigger than anybody can have. But I have to have it so that I can get my changes out. And I've seen the end state to a lot of this advanced DevOps stuff, and it can be really challenging in a different way.

MIKE: There's something...and I'm not on the platform engineering team, but I've heard something from them [laughs], and I've talked with them a while about this. You're touching on something, and the problem is the shared environment, right? And you talked about the complex data, absolutely. And this is not an easily solved problem, but it's something that we're actually working on. Because if you can come up with a replicable data environment, create this complex data environment, take a snapshot somehow, right? And that may be through some sort of seeding, or maybe you actually do have an actual snapshot that you copy.

If you can get to the point where you can replicate that data and on the platform side you can replicate the environment, then you can get to the point where there isn't a QA environment, spin up your environment. All the other services are known working based on the most recent build that's in production of the other services. And you're just testing your own in that environment. And that seems like the holy grail.

That's something we've worked for for a while. And kudos to the [laughs] platform engineering team for getting there, because we're actually pretty close at Acima. We're actually just kind of right at the cusp. I think we'll be launching that maybe in the next couple of months to where we can have that and then there isn't a QA environment. You get the environment you want and it's isolated. It's a dream [laughs].

JUSTIN ELLIS: I can't tell you how many times...I've been talking about dynamic environments for a long time, but dynamic environments is still the goal. It would be great for QA to spin up a whole new environment, have just your feature branch in that environment. Everything else is on the main branch with data that's there, that's relevant, that's correlated. I agree with you 100%, that is the holy grail. We're working towards it.

But it is a wicked problem. You have to think about the data side. That's one of the primary things. You have to think about how it interacts with exterior systems, say, maybe, like, a third party that you're interacting with, how that data gets all shuttled through the system, so agree 100%.

MIKE: So, there is a way. There actually is a path through this. There's been kind of a through thread through all this is that things that require interdependencies, coupling, dependencies between each other are hard to deal with. Anytime you have something shared like that, it's hard to deal with. And a shared environment, yeah, you're coupled tightly, and you got to deal with that. If you can figure out how to decouple that, so, no, it's not a shared environment. Everything else is not shared. And then, you just got to put up your own thing. It changes the paradigm, right? It switches the way we think about the problem.

Like, you have, oh yeah, well, deployment is hard, so I'm only going to do it every now and again. Well, now I have all of this entanglement of things and all of the interdependencies that I have to deal with. Well, instead of you disentangling, I just put up one thing at a time, then I get rid of that problem. And you have to rethink the way you approach the problem to eliminate the complexity, and then the problem becomes much easier. What were you going to say, Will?

WILL: Oh, well, you know, I had a question. We were talking about stateful versus stateless systems, right? And I think everybody has bad ops in the beginning for the exact same reason, which is a good reason because you can't afford it. You're just getting up however you can. You don't have money for ops. You've got whoever can get their EC2 instance up and running and rolling, and then you go with that. And then, you get ops, hopefully, when you can afford but probably more commonly when it really starts to exert pain and pressure on the organization.

And so, what makes a service stateful, right? Like, this has to be stateful. This is a stateful thing, where it has to be like that. And it isn't so much of like a well, I'm just going to throw up one instance, and I'll just roll the database in with the instance because I only have one monolith, and there you go.

We'll just put it all up on the same box because that's relatively easy to do. But where it's just like, okay, I'll move it off this box. And now I have two boxes, and then we go and do the thing. But what are the really tough nuts to crack in terms of applications that want to be stateful and they don't want to move into that stateless process?

JUSTIN ELLIS: To me, it's all about the disk, right? So, if that specific application and how it's run and that process needs access to, say, one specific disk, if it's like a regular, say, virtual machine that maybe runs Linux that writes files directly to the disk, that's where you run into problems. Because what you want is, you know, stateless is kind of a misnomer in that you are still storing state. It just happens to be in an external system. And a lot of times we'll use managed services for those external systems.

So, really, it's trying to abstract out and push this stuff into a database, maybe into an object storage and kind of getting that out. Once you've got that stateless, then you can kind of scale up that 1 process into 10 processes, and they can all use those same data sources. They can still all use that same state. They can all reach out to the same Redis instance. And then, that Redis instance is the one that's providing that state going forward. And just a lot of applications just generally depend on the disk. I don't know, Kyle, what else? Can you think of anything else (I'm kind of putting on the spot.) that kind of drives the stateless versus stateful system?

KYLE: The one I'm sitting here thinking of is observability-minded, but I'm thinking of Mimir and how they have that, right? They've tried to get out of being stateful. They use an object store, S3. You can use that as your backend, but you still have some of that that is still very stateful. And that's any of the current time metrics that you're wanting to store. But kind of like Justin is saying here, any of that, as soon as you can just move it off into the object store or anything that's managed, then you kind of get it out of the way or out of the way so that you can become stateless to a sense.

I think the other thing that is helpful is generally a quorum. If not everybody can be talking to it, not everybody can be writing to the same table at the same time, or whatever it happens to be, a quorum to know who's talking to it at the current time, because that's another system, or who's going to be accepting even incoming requests.

Maybe you shouldn't be receiving incoming requests to everybody, load balancing. So, anything like that to where that one source of truth, I guess, is kind of your state, anything not to overload that to allow those frontend replicas or whoever's interacting with that stateful process to allow those to become more stateless to a sense. Hopefully, I made sense on that spiel [laughs].

JUSTIN CALDWELL: Yeah. And so, you mentioned Mimir, and Mimir is basically just a way to manage Prometheus and kind of scale it out. So, I'm kind of curious, on the Mimir side, are they using...I know that we use kind of EFS volume. One of the challenges you have with not writing to disk is that object storage can be slow. Disk I/O you got to go over the network and all that kind of stuff. So, I'm curious, does Mimir use it just for kind of the Disk I/O bandwidth generally, or is it just an architectural consideration?

KYLE: Sorry, I guess I'm not understanding your question.

JUSTIN ELLIS: So, when I think about kind of, you know, back to Will's question of what are those main challenges when you are trying to move from a stateful system to a stateless system, what things are going to be difficult to move into kind of more of a stateless model and getting those into, say, object storage or whatever else?

The one thing I can think of there is that having a disk on hand, having these new SSDs available really does give you a lot more bandwidth. So, data proximity really kind of is a big issue. And when you can write directly to disk, that's going to be so much faster, so much quicker than, say, making a network call out to S3 and writing that file. And you want to read those files as well, and that needs to be quick and fast, right?

So, in an application that really does require to consume a lot of data process, a lot of data in a quick fashion, which I would imagine something like Mimir does, that may be one of the reasons that there are kind of stateful bits throughout their general architecture.

KYLE: Right, right. So, on both the read and the write process, to my understanding, you kind of want that local timeframe, meaning that timeframe that you're going to be using the most of. And it kind of requires a little bit more of an understanding of Prometheus itself. Prometheus data itself gets chunked. And so, you can change that chunk, but I think, by default, it's three hours.

So, when you get that chunked, that chunk can then be sent up to your object store. However, this recent one that has not been chunked, that's the stuff that's being currently written to, so you need quick access to that. So, that's very stateful, whereas that chunk can be up there and be the, quote, unquote "stateless" in S3.

Same thing on the read side. We're going to keep anything that's been queried recently we're going to keep that down, and that's when the long queries are happening is when you're having to go past that short time frame. And you're querying for a day, two weeks, you know, a month or longer, then it's going to have to go and pull and this down. And that's either landing, like Justin is saying, on a disk, or there's also caches in the Mimir stack as well to have that in memory, at least for the time being until that needs to be flushed.

MIKE: So, since there's rapid access memory, you say it’s tricky to send out to something managed [laughs].

JUSTIN ELLIS: Yes, right. Yes. Any applications like AI applications need to read a lot of data at a specific time. So, there are unique systems that do require a lot of disk input, output in kind of a quick fashion, and those ones are going to be a little difficult. You're going to have to think in more of a distributed systems. And distributed systems are hard to manage. And going forward, you're thinking about disk. You're thinking about quorums. You're thinking about ordering and all those kind of complex problems.

Some of those are solved. Some of those are still yet to be solved. But we do really do rely a lot on kind of managed services, especially for distributed systems. But we also can run distributed systems in-house as well, and they are just a little different to manage, a little bit more difficult.

KYLE: And we have kind of a new technology, new-ish technology that's coming down the pipe, where we're also investigating the utilization of streams. Specifically, we're looking at Kafka, you know, what can that provide us? Because you can just assign multiple listeners to a Kafka topic, and then get the stream as you need it, and that can make things stateless as well.

EDDY: I was going to say, I thought maybe as we're inching closer to the end of this...we touched a little bit about holy grails in DevOps, so I want to elaborate a little bit on that. And what are some things that you'd like to have in DevOps that isn't quite possible yet but in an ideal world for a holy grails system?

JUSTIN ELLIS: That's a great question. One thing on my wish list, I think it is possible today, but there has been a lot of movement in what's called an internal developer platform. This could be something like Spotify's Backstage, which is an open-source project, which kind of provides the fundamentals of an internal developer platform.

What that is is, basically, instead of having CLIs and having web services like GitHub Actions and Jenkins, you have kind of, like, a central kind of observability coordinator that can tell you the status of your services and your applications. It can tell you what build is out there today instead of having go to a CLI and kind of integrating all the data from the systems to give developers access to be able to see things and perform basic functions. So, I think that is possible today, but we're not there quite yet. Dynamic environments was the other one that we want to get to. And then, that's all I can think of right now, but I'm sure there's probably 10 more [laughs].

JUSTIN CALDWELL: So, I was going to ask, if you guys could just bullet fast, just list through the tools you guys use every day that you love.

JUSTIN ELLIS: Yeah. So, Terraform that's what we use every day, and we do love it. Terraform really is a great system going forward. Alternatives to Terraform are CloudFormation. And there's some even newer ones that work a little bit differently, maybe within Kubernetes, like Crossplane. But I still think Terraform is great for infrastructure as code, and something that's awesome.

Another tool that we use heavily and we use every day, and this is kind of our orchestration framework, and that's Kubernetes. So, Kubernetes really does have some great fundamentals. Kubernetes comes out of Google who originally how they managed their system was a system called Borg. And Kubernetes is kind of an offshoot of Borg and has a lot of the fundamentals that you want when you're running a system as far as ingresses, deployments, applications, and scaling those out. Those are kind of my two favorites today. Kyle, I'm interested to hear what your favorites are.

KYLE: I'm going to surprise everybody here and say, in general, the Grafana Labs tools, specifically. It's a great product. Most everything that they have is a great product, and they are open source. They serve their open-source community really well for monitoring. It has been needed in my tool belt. I would also throw out Prometheus. Prometheus is something that was needed, a high-performant time series database that was needed. To go along with Justin, tools, Terraform, and Kubernetes.

I also find that I have to throw out my IDE, and that's VS Code. VS Code is something I could not live without. I've tried to use some of the other IDEs out there, and I'm not as comfortable with them. So, there's that. And I'm trying to think if there's anything else. Oh, Lens is a visualization tool to see into Kubernetes. Right now, I'm currently using OpenLens, the open-source version, but any visualization tool to see into Kubernetes and be able to manage that. AWS' offering is getting a bit better. But I still really rely on OpenLens in order to do my day-to-day.

JUSTIN ELLIS: And I'll just add, obviously, Docker and containers just in general.

KYLE: Oh yeah.

JUSTIN ELLIS: That's kind of our bread and butter. I know that developing in Docker sometimes can be a pain. But for what we do when we manage clusters and applications, Docker really just gives us a single target to build towards, and then we can basically put that anywhere. And then, I'll just also say AWS, generally, GCP I really like as well. These cloud platforms have become so mature and so great.

It does require a lot of kind of...this is why I don't want developers to necessarily have to understand AWS. When you go into the console, you see all the tools that they have and all that kind of stuff. So, I've always wanted to create kind of an opinionated framework about what tools we kind of use going forward. But AWS is just phenomenal. I really like GCP as well. Modern cloud platforms are really just pretty awesome and just a game changer. And something so fundamental I don't even think about, you know, AWS has only been around since 2008, or something like that, but really just super fundamental to what we do.

MIKE: And that does bring us, I think, really well to a good conclusion. We've talked about a lot of things, about how to bring legacy systems forward, about the critical nature of having the easy plug and play [laughs]. It does what you expect it to do. You have a configuration file and done, so you don't have that complex interface. And then, the issue of reducing complexity all around and how that applies in a lot of different parts of our work.

Any final words particularly from the platform engineering folks that you'd like to share as people finish this podcast?

JUSTIN ELLIS: I will just say, I think it's a really exciting [laughs]...I know it sounds weird, but it's a really exciting time for infrastructure with this new AI tooling and increased capabilities there. I'm just really excited about the future and how we kind of manage infrastructure going forward.

I think a lot of those fundamental pieces that we were just talking about are now in place. There are some really impressive open-source projects. I'm just really optimistic about the future and how we manage stuff going forward. I think we're just in a great spot in that these ecosystems are going to continue to grow.

Kubernetes has kind of spawned what's called the Cloud Native Computing Foundation, which is kind of very similar to the Linux Foundation. But if you go there and you look at the landscape, and all of those tooling, and SaaS, and frameworks, and open-source projects, it's just really an exciting time in infrastructure, and I think we're in for a lot of improvements in the near future.

MIKE: Great. Well, if you're listening [laughs] and you'd like to get here, it's possible. There's so much opportunity to dramatically improve the experience and return on investment of your infrastructure by approaching with some of these tools that are available. And kudos to the team here for having made that happen.

JUSTIN ELLIS: Thank you.

MIKE: With that, until next time on the Acima Development Podcast.