Episode 36
Dubugging Tools
January 3rd, 2024
45 mins 42 secs
About this Episode
The panelists discuss the nature of debugging in software development. What even is a bug? Various perspectives are shared, highlighting that bugs often result from unexpected behavior in software despite the software doing exactly what it was programmed to do. Sometimes bugs, sometimes seen as errors, can become features or lead to new insights.
The discussion shifts to more philosophical and methodological aspects of debugging. It's pointed out that encountering bugs in software implies a need for more understanding of the program or the problem it solves. This realization opens avenues for deeper exploration and learning rather than fixing a fault. The role of testing in avoiding bugs is emphasized, particularly the importance of understanding and aligning with user expectations. The group also touches on the evolution of debugging techniques over time, moving from simple console printouts to more sophisticated methods but still acknowledging the effectiveness of basic approaches in specific scenarios.
Additionally, they talk about the complexities of debugging in a modern, service-based architecture, where bugs might involve multiple services and systems. Strategies like dividing the problem space, creating narratives or models to understand the issue, and ensuring environmental consistency are discussed, and the conversation ends with a focus on the human aspect of debugging, including the necessity of taking breaks, the value of teamwork and peer review, and recognizing one's own cognitive limitations during the debugging process.
Transcript:
DAVID: Hello, and welcome to the Acima developer podcast. I'm David Brady, and we are going to talk about debugging today. And we put out the call that we're going to talk about debugging, and we got quite a quorum. So, it's exciting. We've got just a whole pile of people on. When you jump in the call, just tell us who you are, and we'll just go from there. Debugging, why?
TAD: Well, should we even get more simple than that?
DAVID: Sure.
TAD: What exactly is a bug? What would you call --
DAVID: Okay, yeah.
TAD: Is it bad values? Is it bad process? Is it bad behavior? Like --
EDDY: I would like to premise that bugs, historically, have led to some really good features.
DAVID: Minor trivia bit: the very first bug was an actual bug. Google that if you want. Grace Hopper fished an actual moth out of one of the relays, one of the very first computers. And there's a picture of it online. She taped it in her journal and saved it for posterity.
MIKE: Well, that says something about what a bug is, though. She went and pulled that out of the relay because there was unexpected behavior.
DAVID: Yes.
MIKE: [laughs] And the software does exactly what we tell it to do. The problem is sometimes that doesn't match our expectations. And that can be a minor inconvenience, or a neutral, even a positive thing, or it can be catastrophic [laughs]. I worked at a plant nursery many years ago. And my boss asked me, "What's a weed?" And I tried to give some explanation. And he says, "A weed is any plant that's growing where you don't want it." Bugs is kind of the same thing. I think it's unexpected behavior.
DAVID: One of the best things that somebody told me really early on in my career is kind of an intellectual ego check. He said, "Whenever you have a defect in your software or a bug in your software, it means you don't understand your program." And I'm like, "But yes, I do," oh, no, I don't. Provably, I do not because I think it will do this, and the computer is doing something else.
And that kind of opens you up to the wonder and puts you in a positive frame of mind of like, well, I need to figure out why this is doing this, rather than, like, I'm dumb, or the computer is dumb. Just like, okay, I wanted this, but I got that. Let's figure out why.
MATT: And that could also mean that you don't understand the problem that you're trying to solve.
DAVID: Or the solution, right? You don't understand the solution you're trying to do.
TAD: I thought it was really interesting...last week, David and I wrote some code together, and we put it in production. And they had us pull it back out. And some of the people in management are like, "Well, you should have tested it better." And the thing is, we had written a bunch of tests for it, and the behavior was exactly what we thought the behavior should be.
But when we put that in production, the people who were using it were like, "Wait, this isn't what I wanted. This wasn't what I expected." This wasn't [laughs]...like, the tests proved that the behavior was exactly what we thought it should be. But it was unintended, unexpected behavior by the end user. And so, technically, is that a bug? I guess if it's unintended behavior, and it's from somebody's perspective that's using it.
MATT: And maybe we can talk about how do we prevent these bugs, right? I think it starts with a clear understanding of requirements, a clear definition of requirements, and testing, too, those requirements. Because a lot of times in software, the business side will have a lot different expectation than we do on the engineering side of what should happen, right? And I think if you can clearly define those things, then we run into fewer, what we call, bugs.
DAVID: That's an interesting definition. One of the early things that we talked about when unit testing was getting popular in the early days of XP, so, like, we're talking, like, Y2K, like, 20 years ago, they mentioned that one of the great things about a unit test is that it restates the problem or, you know, it kind of describes the behavior at a higher level.
And it gives you this ability to go to, like, the end user and say, "Is this not what you wanted it to do? Because if this is what you wanted it to do, I've got green dots. I can prove that it does this. But this is that, right?" It's like, "No, that's not what I wanted it to do. I wanted this other thing."
MATT: Yeah. You know, many, many, many moons ago, before all of these testing frameworks existed, we were writing code based on our expectations, right? The early days, old PHP days, Perl days, those are the languages I was in primarily, you know, 20-plus years ago. We were creating our expected behavior, and it was really hard to meet the needs of the requester because there was no clear test case to prove that you're meeting those needs. So, you run into a lot of unintended consequences and a lot of downstream problems.
JUSTIN: One of the things I've seen as well in my career was, initially, I was programming to meet the requirements. And then, I was programming to meet the requirements, assuming that the customer is an idiot. And now I'm programming to assume that the customer is malevolent. So, it goes along to, you know, really define what your requirements are and what your limits are.
DAVID: You've moved from programming Murphy's computer to programming Satan's computer, right?
JUSTIN: [laughs]
DAVID: The actual quote from a security talk I went to a while ago which is that Murphy's computer, anything that can go wrong, will go wrong. Satan's computer, everything that can go wrong will go wrong, at the same time, at the worst possible moment because that's what an attacker will engineer, right?
MATT: That being said, what happens when we introduce a bug? How do we track them down? What tools are we using to find them? And what is everyone doing? Because I imagine everyone in this room and on this call uses some different techniques.
MIKE: Can I say something here on this one? In some groups, I'm an outlier; in some groups, I'm not. Early in my software development career, I learned to be kind of shy of debuggers because they sometimes got in my way [laughs]. You know, they're great when they work, and other times, they're slow, and they break. And then, you could spend a lot of time wrestling with the tool rather than solving the problem.
So, almost all of my career, I may print out a line to the console debugger more than anything else. I'll go in my code, have it print something out. It works in every language. You can have a consistent process that you just know what to expect. You can get good at it. You can get it to print something out. You run the code. You exercise it. You know that you're exercising that code because you get something printed out. You get good at filtering through logs. It's a really straightforward process that anybody can do.
I don't want to say that debuggers can't be useful, and sometimes they can be really useful. I might have used, like, a C debugger to debug Ruby to try to get down to some really gnarly stuff before, but I do that very, very rarely. And I find that just printing out stuff to the console is extremely effective. Because, for me, the most important thing I'm wanting to do when I'm debugging is figure out what's going on somewhere. If the bug is something that's unexpected, then there's something in there that I don't understand.
And so, I want to document, you know, to have some documentation thrown out from the code saying, "This is what's going on here," so I can figure out where my expectations are not being met. And I think that we'll probably talk about more sophisticated tools than that. But I think that kind of helps set the stage for further conversation is that what we're really trying to do is figure out what's going on. And simple tools, honestly, I've been doing this for a long time, and they've served me very well.
MATT: Seeing your output, right? Because, generally, that's the bug is you're getting output that is unexpected. And just seeing the output as you step through your methods or your classes it's powerful. And that's generally how I track them down as well.
TAD: Some of the hardest bugs I've ever had have been bugs with, like, timing issues, where the IO of a put statement or a print statement throws off the timing. And so, putting print statements in fixes the problem, and taking the print statements out reintroduces the problem.
MATT: Nice.
DAVID: Nice, a wicked Heisenberg.
EDDY: What's a Heisenberg?
DAVID: Oh, sorry --
MIKE: Ooh, you're learning a beautiful, new word [laughs].
DAVID: Yes. So, it's a corruption of the Heisenberg Principle, which is that at the quantum level, you cannot observe a thing without changing it because the act of bouncing a photon off of an electron moves the electron, right? You can't observe something, so... And so, yeah, Heisenberg is a bug that when you look at it, it changes, or it goes away.
My favorite demonstration of this is some code, and you're teaching somebody to step through it in the debugger. So, you write a loop. It's like ten times do, and then you seed the random number generator with time dot now. And then you say, print, you know, puts, you know, rand, and it puts rand 10, right? And so, you run it, and you step through it in the debugger, step, step, step, three, step, step, step, five, step, step, step nine, right?
And then you take out the debugger, and you run it, and you get 7777777. It's all sevens. And you're like, what? Time dot now returns time down to the second. If your entire program runs inside one second, you're seeding the random number [inaudible 09:24]. But if you're standing there going, 'next instruction' in the debugger, you're burning down the clock. And that changes the state of the time input.
EDDY: So, it's actually really interesting. Mike, you mentioned...you said that you, early in your career, you printed stuff out, right? Do you actively use that as part of your go-to resource in order to debug something, or has that –-
MIKE: It's almost my exclusive resource still, after decades. [crosstalk 09:50]
JUSTIN: So, when you print stuff out, at what logging level are you printing it out?
MIKE: Oh, good question. So, guys, nothing against using log levels, right? Using a logger saying...my default, it depends [laughs]. It depends. If I'm in a local environment, I'll probably use debug. That way, it's harmless if it actually went into production [laughs] because production doesn't have debug turned on. It's a nice little safety thing.
JUSTIN: I'd like to push back on that just a little bit because I have seen releases to production with the debug turned on because it didn't understand which environment it was in. And that resulted in passwords being [laughs] put out in the logs.
MIKE: Oooh.
DAVID: Oooh.
JUSTIN: So, that was a fun time. But I think that even if you are debugging or you're doing a debug log level, you shouldn't ever debug with PII data. And so, either mask that or log something else that will help you debug. That's my only advice there, so...
MATT: And for context, to our listeners who are not going to visually see this, that was Justin who was just speaking. And he is on the Acima security team. [laughter]
JUSTIN: Yes.
DAVID: You guys are actually implicitly talking about something that I think is really, really interesting. I will debug with puts statement. I won't even use logger.debug. I'll use puts.
MIKE: I do, too.
DAVID: It's sloppy. It works great from localhost. And we have a code quality checker that will say, "Don't use puts; use logger." I'm like, no. What I want is to delete this, right?
MIKE: Exactly.
DAVID: But I don't --
MIKE: That's actually what I usually do as well, too [laughs].
MATT: I do the same. I use Awesome Print because it formats for me and makes things look a little nicer.
DAVID: Oooh. Awesome Print is a tool I haven't looked at. That, I'll have to check that one out.
EDDY: Can you explain what Awesome Print is? When you're done, Dave.
DAVID: Yes. I was just going to say the converse is that if I'm running something on preflight, I don't have access to the IO console, or if I'm...well, in production. This week, I got access to preflight, so I can see things that go to the console there. But, like, in production, I don't have that. So, going to a logger is really good.
But you know what also falls down in production? Is running the debugger because I can't stop the server, you know, remotely. I mean, I can. Right now, I can. Justin would stop me if I did, at least in production. But, like, that's a bad idea, right? Bringing down the entire website just so you can debug something in one thread, probably a bad idea. So, there's trade-offs here. It's fantastic. Eddy, you were asking about Awesome Print. Matt, tell us about Awesome Print.
MATT: It's a Ruby thing. It's a gem. It will just output to your console, but it formats for you. It'll color-code hashes, you know, just make things really readable and legible. I use it every single day, and that's my go-to usually. Second would probably be Pry. But I also still do some really old-school techniques, and that is just tail dash f my development log, and that way, I can see live output.
DAVID: There's a lovely thing...we got to see if we can track down Charity Majors see if she would come on. She loves talking on and on and on about observability. And she works at a company that does logging and metrics, and so she's all about that. But when you talk to somebody about this, you start realizing that's all debugging really is, right? Is we've got all these questions, like, what's going on, and how can I look into this? Where can I get some observability into my data, or pretty-printing, console debugging, and logging, right?
I will say, as a person who came from, like, the C world and debugging, I spent a ton of time down in, like, Ruby debug land. I love using, like, Byebug because I can watch a variable. I can do a thing where it's like, I don't know who's changing this, but I can stop the program when they do. And I can, you know, put a breakpoint on this. And then I can walk away and go do another thing. And then boom, it hits. It triggers. That's a little harder to deal with.
But, I guess, with the puts, you could go in and say, you know, "Print when this happens." But if it's like a variable, like a change on a variable, you can track that. And that's very, very exciting to watch something, you know, live, but have the thing actually fire a trigger and say, "No, no, no, no, hang on. We want to talk about this." And that's, like, a great observability tool.
MATT: Yeah, and for things like Ruby, you can step through with debuggers, and a lot of the IDEs just provide them. And that's nice. I rarely use them because, generally, the things that I am looking for I know where to look. But that's really powerful if things are happening deeper in your stack, and you don't know what you're looking for.
DAVID: Yeah. The downside compared to Pry of, like, Byebug or Ruby debug is that your REPL, your little read-eval-print loop thing, is just kind of, like, a one-line thing. You can evaluate a statement in Byebug, and it will return you the value, but, like, modifying code on the fly is a little bit trickier.
I know that the folks that use Pry love this because it basically gives you an IRB prompt. You can rewrite a function and then rerun it. I am not super fluent in that. Is somebody here...does anybody on the call like to do it that way? If not, we can tell people to go Google it. But does anybody here debug that way, that's good at that way, that can talk to that?
MATT: I've done it a little bit. If I have an idea of a change, if I'm in the right console, occasionally, I'll make that change, and then see if the output is what I expect, and then go make the change in the code. I wouldn't say I do it often, but I certainly have done it.
EDDY: I think I use it religiously, honestly.
DAVID: You do?
EDDY: Yeah. Like, I'll inspect something, and I'm like, cool, is this outputting whatever I expect it to output? Or, you know, if I have a check in place, you know, like, if current user dot permission granted, or whatever, and like, I'm trying to make sure that it does have the right rules or whatever. Or if I'm checking for a bool output, I want to make sure that that [inaudible 15:43] class is doing what I expect it to do. It could be a typo, you know; it could be checking something that I wasn't expecting. So, being able to manipulate the data inside that session is pretty nice.
DAVID: Fantastic. Does anybody here use, like, the code rewriting ability by Pry? Because that gives an IRB prompt. You can literally just, like, open a class and jam in a method. There's a related technique that I also don't do, which is writing code at an IRB prompt or at a Rails console. Does anybody use that technique?
MATT: That's what I was referring to. Like, if I have a class method that is doing something unexpected and I have an idea for a change, I'll just put the entire method in and rewrite it inside of the console.
DAVID: Okay, yeah.
MIKE: I do a REPL all the time, you know, that's why I use simple, you know, print statement out to the, you know, standard out. And then when I want to test something, I'll pull up, you know, the interactive console and play with it. Software isn't quite tangible. That's about as close as you get to tangible [laughs].
DAVID: Oh yeah. It's [crosstalk 16:42]. Yeah, you can actually manipulate it.
MIKE: I'll often write methods where I put...and you almost never use semicolons in Ruby. You're like, you can't use semicolons in Ruby? Well, you can, but nobody does. But when I'm doing interactive coding with a console, I do that all the time. I'll have my method written with semicolons between it, so I just tap my up arrow, go back and modify it, run it, tap my up arrow, go back and modify it. It's just really easy.
MATT: We use the semicolons a lot in the production environment. So, we don't get output when we're loading models.
DAVID: Oh yeah. It's like load user semicolon nil.
MATT: Yep. And I'm finding a little bit of a pattern here. I'm noticing that those of us who have, and, you know, I don't want to state age, but those of us who are a little bit older and been around for a long time use a lot of very similar tactics. And we have what one may refer to as old-school methods.
DAVID: Yeah. I'm actually coming from a different side of the old school, right? Where for me, I almost never code at the console or an IRB. For me, it's the code, compile, test mentality. So, like, if I want to play with something, I'll open it in the editor. I'll tweak it, and then I'll rerun it, and then, you know, tweak it and rerun it. And then I'll, like, edit the test and then rerun the spec. And that's an instinct to me.
Modifying something in IRB is great. Like, if I'm trying to explore, like, I don't know how this thing works, I love going into IRB and doing this. But, like, if I'm actually trying to write the function, I want to be in my editor so that when it does work, I'm done, rather than having to pull it out and drop the code into an editor. Like, I want to be done. I don't want to have an extra step.
MIKE: For me, I care about how long is it? If it fits in one line, I'll do it in IRB.
DAVID: In IRB, sure.
MIKE: I'll do it in IRB. If it doesn't comfortably fit there, then I will pull up a file and do it that way because I don't want to have to restructure everything and fight with that. So, for me, it largely [chuckles] comes down to just I want to do less typing, you know, lazy developer.
TAD: So, Dave, is that debugging? Or is that just writing code, or is that both?
DAVID: For me, both. With debugging, it will be a lot...I guess what I'd say is if debugging, I probably won't edit code. It's, like, I don't program at an IRB kind of thing. But I'll crack up an IRB to explore or a Rails console to explore and find out, like, where is this coming from? And who is doing this to me? That sort of thing.
It's very, very rare that I will grab object dot method, colon, method name dot source location. I'll do that at a console all day long. It's like, you know what? Ask Ruby to tell me where this source code is so that I can track it down. I think I've done that once or twice from inside code, but very, very rare.
JUSTIN: One question I have for the group is, like, what is your strategy if you get handed a bug from production?
MATT: That was a little bit of where I was about to go.
JUSTIN: Oh, perfect.
MATT: What I was going to say is we've talked a lot about a bug inside of our codebase. However, we operate in a large ecosystem of services. What happens when you get a report of a bug from your production environment that is hitting five or six different services? And then, how do we debug that, and what kind of things can we do and put in place to make that easier on us?
DAVID: I have a first technique that I will go to, which is Newton's approximation, which is to split a thing in half and then subdivide it. And, man, I learned this soldering radios back in high school working, like, ham radio stuff. If you get a circuit, a long-convoluted circuit, and it doesn't work, and there's 1,000 components on it, you pick a point roughly in the middle, and you inspect. Like, you get in, and you basically say, "The data at this point, is it messed up already, or is it still good?" If it's still good, the 500 things upstream are probably okay. The bug is probably downstream at that point. I've just eliminated 500 candidates from the list.
And if it's bad already, then I know it's upstream, right? Then you split again 250. You split again 125. You keep splitting, splitting. And if you've got 1,000 components in 10 splits, you can come down to the line of code or to the specific component that is causing that thing. And I learned that with a soldering iron, you know, trying to isolate, you know, is this is a bad capacitor or a bad transistor? And it translated so easily onto, like, systems programming. And I was like, well, yeah, it makes sense.
JUSTIN: Did you understand the entire thing? Because sometimes, with a big, complicated system, it's hard to understand, like, all the parts to it, right?
DAVID: Mm-hmm. Mm-hmm.
MIKE: For me, I think a lot of the place I start is...and I don't know if I took this quite explicitly. I'm thinking about the conversation I [inaudible 21:18] earlier. What's the story here? You're probably not going to do very well; just kind of random cherry-picking. Well, might it be over here? Might it be over here? Instead, you say, "Well, what was the process that led to this? Who was the user that caused this? What steps did they go through to get it?" You know, and part of why steps to replicate are so critical to being able to diagnose problems. How did they replicate it, and what was the exact outcome?
And you can start thinking through what's possible and filtering down with that story. Narrative structures really work well with our brains, I think. And having that structure gives you a place to start. So, if you don't have the story, you've got almost nothing, you know, you've got a whole bunch of letters and numbers, and that doesn't do you very much. But if you have a story, well, I can do something with that.
DAVID: If you don't have a story, your brain has a working memory of about seven items. So, you know, a system with seven components, you can debug. You need a story if you want to go bigger than that.
EDDY: I think the most challenging part [crosstalk 22:15] of debugging stuff is when you can't replicate stuff in lower environments, and you're just playing with, like, I'll just shoot up in the air, and hopefully this fixes it, right? Like, how do you go about debugging something like that that isn't reproducible in a controlled environment?
TAD: Can you debug something that you can't reproduce, that you can't tell the story about?
MIKE: Those are the hardest ones [laughs]. But I think the answer is still the same. It's about becoming a really good storyteller [chuckles], honestly. You think, well, I don't know what's going on here, and I can't reproduce it somewhere else. But I can think about what's going on in this environment. And then you start asking questions, well, how does that environment differ from my local environment?
MATT: Yeah, there's a Latin term called ceteris paribus used all the time in economic principles, and what that means is all things being equal. So, one of the first things—and we have learned our lesson a few times with this—is we need to check our environments. Are they equal? Are they set up the same? Do we have the proper feature flags in place in one versus another environment? You know, we've been caught a few times with that. That's a really important thing to take into consideration.
And a lot of times, we overthink a problem when we're trying to solve it instead of thinking in the simplest terms. And if you can think of the simplest things first, a high percentage of the time, that's the cause.
TAD: Is this just science where I observe, I form a theory, I test my theory, I observe, I form a theory, I test my theory?
DAVID: Yeah, the method of hypothesis makes sense.
MIKE: I think it is. I was going to say something slightly different that there's been a number of times and probably a lot of times when I've had a bug that I couldn't understand, and I said, "Well, this is impossible." And I think every time I've ever thought that I was right [chuckles]. Going back to what Matt said, I was absolutely right because my mental model of the situation had a critical flaw [chuckles]. And if it had happened the way I was thinking it was happening, it wasn't possible.
So, going to what Matt was saying [chuckles], stepping back to think, well, all this is equal; apparently, something is wrong. Something is different here that I am not thinking about. And I have to think about, well, if this is impossible, that means I'm the one who's wrong. I'm thinking about this problem wrong. And I have to start exploring options as to what might be possible, which means that my focus is too narrow. And it's usually kind of a meta thing. Like, maybe the environment is different [chuckles]. Well, what is different about my environment?
TAD: I actually had a manager, and he'd say something like, "Well, did you drop the apple?" right? But that meant, did you test the gravity is working right now? Something that you assume is always true. Like, I'm going to drop this apple. It's always going to fall. Gravity is always working. And it's always attractive, da, da, da, right? Like, a fundamental assumption that I totally believe.
And it's like, okay, I've established that assumption. Okay, maybe I need to back up and drop some apples and test all these assumptions that I'm making and see, oh, this assumption that I believed was always correct is 90% correct, right? Or there is something that I assumed to be absolutely the thing isn't actually the thing.
DAVID: There's a thing that I'll do in RSpec called writing a cockroach test, and it's the opposite of a canary in the coal mine. A canary is any little breeze kills your spec, right? It's literally a flaky spec. A cockroach spec is if you've ever worked on any of the projects I've worked on, you will find me doing expect 42 to equal 42. I have seen that test fail, and when that test fails, it means gravity isn't working.
Okay, to be fair, it wasn't that 42 wasn't equal to 42. It was that RSpec was broken. And we all thought it was working the way we thought it was, and it wasn't all. Like, the whole framework was broken. So yeah, dropping the apple, I'm going to steal that. I like that.
MATT: There are also some things we can do to make our lives easier, especially in service-based architecture. One of the things we do here at Acima is we create a request ID for any request that gets passed along to all of our services. So, if we go look in our logs, we can search by that request ID and track it through each of our systems, you know. And I think being proactive is also something that we need to be really conscious of as well.
JUSTIN: I like how you said proactive, and I'm a big believer in logging and logging all the paths you're on, and especially logging that enables monitoring. So, logging enables monitoring so you know when things go wrong. And, you know, logging your request IDs is good, but also, looking at the path that the user went through at each level is really helpful for understanding the history of that user or that request and seeing if it took the anticipated path. Of course, any errors are really useful with that as well. So, I think the request ID is attached to that error case, isn't it? When an error gets logged.
MATT: It depends on where we're logging and what we're looking at. For instance, in, like, our Grafana logs, that request ID is attached to everything.
JUSTIN: Perfect.
MATT: Rollbar may or may not be. It depends on how the engineer logged it when they were catching errors.
MIKE: And Rollbar catches exceptions [chuckles], so things that are unexpected. You may not have access to everything you wish you had.
DAVID: Matt, when you say proactive, what do you mean?
MATT: Exactly what it sounds like. Put things in place so if you do run into bugs, it makes them easier to track down.
DAVID: Okay, okay.
JUSTIN: Catching and logging those exceptions.
MATT: Yeah, but --
DAVID: Knowing we're definitely going to need this. Let's go ahead and do it now.
MATT: Right. Right. The more preventative you can be, the fewer bugs you're going to run into. And let's be real with each other; bugs are inevitable at some point. I have never seen a piece of software that is bug-free, again, because we are humans, and we build software. But the easier we can make those bugs to crack down, the easier it is to fix and the fewer we're going to have.
DAVID: There's an interesting epiphany you guys have given me as we talk through this, which is a thing that I got told years and years ago is that debugging doesn't stop when you run out of answers. Debugging stops when you run out of questions. And the epiphany that I'm getting from talking here in the call is that the best debugging skills is knowing which questions are going to eliminate the most false candidates from a thing. And I'm liking this.
Yeah, it's like, if I need to trace this down and I don't have the request ID, and if I have to go at every step and go, okay, when you pass this to this system, you got back that ID, okay, now we have to switch to that ID to trace through. Make it easier to track down the data you need. That totally makes sense.
MATT: And a lot of it just comes with experience, you know, it's trial and error. You find what works for you. Everybody's workflow is going to be a little bit different. And we find what works for us, and then we just go with it. And we improve on it. And listen to your peers; other engineers are an amazing resource for this stuff.
EDDY: I was actually going to say that one of my tools that's always my go-to is seeking help with people who are more experienced than me because they have a leg up. They've been through that before, you know, and they've had that ingrained in their head on, like, what their go-to is to debug something. So, not only is rubber debugging, for me, supercritical for debugging something, but, like, interacting with someone else in my team who is also involved and can help me get unstuck.
MATT: And then after that, you're more experienced, right?
MIKE: Experience helps. I liked the framing that Tad gave. It's just a science [laughs]. It is. In modern science, they don't use the word...they do talk about theories and that that is real, but a lot of times, they talk more about modeling. What we're doing is modeling the problem. You observe. You try to create a model that describes the behavior, and then you start testing your model. Does it actually work? I'm not saying that theory isn't a thing, but model is kind of a broader idea. It captures not just your idea of what's going on or your hypothesis, but it's provable. It's something that is...not provable but disprovable [laughs]. Disprovability is the key aspect.
Does this work in this scenario? Does this work in this scenario? And you start looking for holes in your model, right? And, hopefully, you find some because that means that there's new things to learn. And, likewise, when you've got a bug [laughs], you're like, well, what's the problem here? And then you already have some mental model, right? But maybe it is weak. And so, you try to come up with a better model, and then you test that model. Does this actually explain the behavior? And you try to disprove it, right? [laughs] And if you can't disprove it, then it's pretty reliably going to be your answer.
And thinking about that methodically, in that way, about problem-solving, I think, is extremely valuable, and it can be a shortcut. Because otherwise, again, you just got a sea of text and a problem, and there's no relationship between the two. And modeling is that process of giving it structure in your mind and maybe even having something written down. You know, what's the structure here? So that you can understand it and start trying to test your model for reproducibility.
EDDY: And staying calm.
MIKE: The hardest problem I ever had to debug was working with a library that...it was a messaging library that was intended to run in a kind of a sidecar thread. So, you had your request come through, and it would fire off a new thread as soon as the request came through. And it actually tried to be persistent. It lived permanently between requests. So, this thread lived in the background, and it would spawn multiple other threads. So, it had multiple threads running concurrently to do messaging. And so, it ran in the background. And the publishing was asynchronous...it was concurrent like that.
And it was also grabbing messages by pulling from the remote server with this concurrent set of threads. But it was also distributed because you're communicating between your system and another system. So, you had multiple systems, you know, there were multiple separate systems, generally, two, which is nice. It didn't have, like, eight, but you had two separate. So, you had a distributed system that was also concurrent at both ends.
And I was asked to fix a problem with the messaging. There was something that wasn't working: figure out what's going on. And it turns out that the library I was using had multiple critical bugs in it. And the messaging was unreliable. We were dealing with messaging that was a little bit flaky. So, the messaging was unreliable. I had to deal with concurrency, something distributed.
And [laughs] the code that I was expected to depend on that was supposed to be doing the work didn't always work. And I did get through it. But it took me about a month to do something that I thought would take me a day or two. And every day, I was just so defeated [laughs]. Like, at first, I'd, like, you know, after a couple of days, I saw, like, okay, this is a hard problem, but I can get through this.
But to your point, Eddy, about staying calm, I've believed this for a long time: that software development is about frustration management as much as anything else [laughs]. Just keeping on looking for flaws in my modeling of this problem. Just saying, you know what? I'm just going to keep on working on it and not get broken by it. It eventually got through, you know, that was a trial. You know, that's kind of an extreme example. It will probably lead to other discussions, too, where people say, "Oh, this is the other one I fixed." But it illustrates some of those things.
Sometimes, you're working with some hard things that don't line up, so your log messages could come in random order. You can't trust the code that you're calling. You have to jump from system to system. The transfer layer is unreliable. And there are so many things that can go wrong, and that that randomness in the system is one of the hardest things. Anything that has randomness in it is really hard to debug, but continuing to just work at it and not give up [laughs].
And you think, well, I wouldn't actually give up. My job is to keep working on it. But it is; it's easy to, at least some part of you, to just kind of tune out. You're like, well, there's nothing I can do. And just taking a moment to calm down [laughs], take a step back, maybe go for a walk, and think, well, how do I tackle this problem next? Matters a huge deal.
JUSTIN: So, I'm curious, Mike. What was your solution there? Is this something that you could summarize very quickly?
MIKE: Yeah. So, that's a great question. So, what I ended up doing was focusing on smaller parts of the problem. I knew that there were problems in messaging between two systems. And that's just so ambiguous, you know, sometimes the messages weren't getting through. And the systems were running slowly, too. And the unit tests I was using weren't really unit tests. They were all integration tests, too. They were actually calling the live message [inaudible 34:50] [laughs]. So, I was handed that kind of testing.
So, there's a couple of things I did that were really critical. One of the things I did was essentially throw out the unit tests that I had been trying to get working because integration tests run incredibly slowly, and that was costing me a ton of time and not giving me what I wanted. I needed to get information and instead, they weren't really giving information.
So, I started with a smaller...I mocked out the actual messaging so I could focus on just what I was looking at. And by mocking that out, I restricted the scope of the problem I was looking at. That helped a lot. And then, I could look at just one particular request. What's going wrong with this particular request? And then look even smaller, like, well, sometimes this works, sometimes it doesn't. Why would that be? Which eventually led me to look into the library and think, well, again, this is impossible if the library is working.
So [laughs], I go into the library and put in some logging there and find the line like, oh, wait, this is spawning up a new thread. I don't remember the exact details of it. But, I found the problem in the library was using an array rather than a set. So, things weren't actually unique when we thought they were, something like that. And I find the library, and that freed me up a little bit, right? [laughs] And then I looked for...well, is it working now? No, it's not. There are still some other aspects of the problem.
So, then I focus down on the smallest component I possibly can. The bigger the problem you're trying to solve, the bigger the scope, the more you have to deal with. And getting out as much stuff as you possibly can so you can narrow down a tiny, little slice of the problem helps a great deal. And then find something wrong in the app in the way it's calling the library. I remember fixing some of that.
It actually ended up firing off more than one thread when it should only have been firing off one. Fixing that solved a problem because then things started running faster because you only had one thread running multiple. And it still wasn't working, again, just working on piece by piece. Focusing down on the smallest possible piece you can work on was incredibly helpful.
DAVID: You said focusing down, and you mentioned at the start that you had, like, a many to many, like, many distributed in, many distributing out on it. And coming from the debugger side where I want to jump onto a console and I want to interrupt the code flow, you can't do that if there's 20 nodes. And you don't know which node is going to... logging would be a good solution, right? We just put logging on all of the nodes. Is that --
MIKE: It is. Logging actually works.
DAVID: Yes.
MIKE: So, this is something a debugger doesn't [laughs], and logging does.
DAVID: Yes.
MIKE: And you think, well, I need a more sophisticated tool, and the answer is no. The last thing you need is a more sophisticated tool. You need the simplest tool possible. It was really counterintuitive. You think, for a tricky problem, you need a more sophisticated tool. And, no, for this problem, I needed the most basic kind of tool possible because it would work [chuckles] in this kind of environment. That's a really good observation.
JUSTIN: And the log entry would contain all the information you need, like the thread ID.
MIKE: Right.
JUSTIN: The time entry and things like that.
MIKE: Yeah. And knowing which thread it is in matters a great deal, right? [laughs] And being able to say that and then logging the threads, like, ooh, I see. This actually happened in this thread. I thought it was going to happen in this one, or these are out of order. That means that some of my locking and the semaphore stuff was going wrong. And bringing those kinds of things out, yeah, is a huge deal.
EDDY: You know, I kind of want to propose a scenario if we're okay. Let's suppose that you do merge something in production, and it's a bug that wasn't caught. Our natural instinct would be, oh, shoot, roll back [laughs]. Let's roll back. Let's go. Let's get rid of the bug. Let's make sure that we step back and, you know, we can work on this.
But what if it's already further along? You know, you've gone a couple of weeks, and it wasn't caught. And suddenly, rolling back after it's been so long, you know, just isn't really feasible, right? But suddenly, this bug has turned into something huge. Let's say, for example, you're, like, leaking PII or something. And you're like, shoot, I need to get this out quick, right? In that sense, are we okay with bypassing unit testing and throwing out a potential fix to fix it? Is that good practice? Or should we still be concerned about writing passing specs?
RAMSES: How would you know it's working if you don't have tests?
MATT: I would say without your tests, you open up a really big door to introduce more bugs and different bugs.
MIKE: There's a phrase that people use, and it's derogatory, cowboy coding, to try to describe that kind of thing. Oh, I'm just going to throw something out there and see what happens. It's a figure of speech, and there's a metaphor there. I don't know if this idea of the cowboy in the way that it's described ever truly existed in history [laughs]. But it's this image from movies about somebody who does things their own way. They don't really care about the consequences. They shoot with their guns, and they don't really care what happens.
Well, if they're going out and shooting with their guns, some people are going to get killed [laughs]. There's a lot of chaos and sometimes tragedy that happens for others and for your environment. And the best usage of that phrase it speaks against that kind of action, that if you have this lone [inaudible 39:56] that thinks they can control everything, that they can get away with anything.
And maybe in some scenarios, you're just building out something that doesn't have any real-world consequences. You haven't deployed yet. Then, that kind of development might be okay. But when there's consequences, people are going to get hurt. In your scenario, it's already been out for two weeks. Waiting an extra 30 minutes, waiting an extra hour to verify your solution, you know, you think about the cost-benefit analysis, it's probably going to save you some in the long run.
DAVID: If you're hemorrhaging money and the [inaudible 40:26] actively damage by leaving the server, you know, running in that mode, another good option is to take the server down, right? You're just, like, stop production. We're going to stop the company from making money, but that's okay because we're also stopping the company from hemorrhaging money, you know, that kind of thing.
I think pushing an emergency hotfix into production without a second pair of eyes on it I think that's a very, very bad option. I can imagine a world in which, at one shining, dark moment in time, that was your best option. Sometimes, you know, the, like, cowboy coders that mentality, of, you know, running around firing your guns everywhere, it's like, sometimes, the environments need clearing out. But that's you're in a bad situation, to begin with, when that's your best option.
But to answer the question behind your question, possibly, Eddy, like, what do we do in this situation? I would not sit down and spend 10 hours writing a test suite to verify all the new edge cases that we think, you know, it's like, oh, you know what? This is a problem. This needs to go out. We're not hemorrhaging money, but we're not making money. We're losing, you know, conversion. We're losing sales. Let's get this turned out.
And there are things that you can do very quickly and low cost that give you a huge mitigation on the pushing a new bug, like, grabbing a co-worker and saying, "Look at this," right? Like, a unit testing is just automating pair review; that's all it is. So, having another human, especially, like, a senior developer who has a very complex mental model of how the system works. You say, "Does this make sense?" And they'll take one look at it and go, "Oh no, don't do that," right?
MIKE: [laughs]
DAVID: You say, "Okay, good. I'm glad I didn't push this fix," right? Now, we're solving your solution down the road. Or they look at it and go, "Yeah," right? And the next thing you can do to mitigate this risk is, after you push it out, stick around and watch the server, right? Try it out, you know, confirm. Eventually, like, by this time next Thursday, there should be tests for exactly the problem that you have found and that prove that we have solved it, and whether or not we need to get, you know, cash flow protected in the moment. That's a valid question. But it's definitely a different operating mode than a stable, steady, patient maximum risk mitigation development.
This has been a fantastic call. I could go for another hour. This was awesome. Any other thoughts on debugging and, like, how and why? Just don't write bugs?
TAD: I thought Mike said something interesting about go for a walk. It's an interesting debugging aspect.
DAVID: Reset the scope in your brain.
MATT: Yeah, get away from your tunnel vision because --
EDDY: Getting a good night's sleep [laughs] [crosstalk 42:53]
MIKE: That's so true.
DAVID: There's a paper trail in an email somewhere that I sent years...I don't have it anymore. But literally, I have written the sentence, "The most effective thing I can do to solve this problem is to unplug and get my head down for six hours." And that's what I did. And I got up in the morning, and I had the solution and was able to get it, but I never would have come up with it.
Specifically, going for a walk, Brain Rules by John Medina is a fantastic book. And he talks about if you get your quads moving, the biggest muscle in the body gets oxygen moving; this gives, like, a 30% boost to the Broca's area of the brain, the part where you executive function, which is your ability to reason through difficult problems and make decisions. So, going for a walk actually makes you smarter. So, if you need to be smarter for a problem, there's the door.
MATT: Yeah, many of you have probably heard me say, "I have not gotten much sleep. I am not going to try and solve this right now," because that's how mistakes are made, and we introduce more problems. So, it really is; rest is important.
DAVID: We could do a whole separate show about recognizing the mental unit tests that we run on ourselves, right? Because a microscope can't look at itself. And the first judgment that gets impaired when your judgment gets impaired is your ability to determine if your judgment is impaired. So, finding those things like knowing I'm running on no sleep. I made this call last night, and I knew, and now I have to recognize that I feel fine. I feel tired, but I feel like I'm here. But you know from experience if I keep typing, pain and suffering will come of this. That's a really dark place to end. Somebody say something brighter.
MIKE: I can go darker if you want.
DAVID: Okay, fine. That works. As long as I'm not the one ending us on a dark call, somebody else could do it.
MATT: Light outs for a nap.
DAVID: That's right.
MIKE: My father-in-law had diabetes. And he was in a serious accident because he was not thinking clearly, and they sent him home from work. And so, he rode home on his motorcycle. Our ability to understand the consequences of our choices is impaired when we're impaired in some way when our thinking is impaired. And having some sort of checklist almost to say, well, should I be doing this that is external to yourself, I think, is really valuable.
DAVID: That is fantastic. We get into a go mindset, right? We gotta go, gotta go, gotta go. We stop checking to see, wait, is it safe? Is the smartest thing to stop? The best way to go might be to stop.
MATT: Avoiding burnout is important and possibly a topic for another Acima podcast.
DAVID: That is a fantastic idea, yes.
MIKE: It's a fantastic idea.
DAVID: 100%. All righty, folks, this has been a fun, fun chat. Thank you all for coming. And thanks for the contributions. I really appreciate it.