ep-107-exploring-business-resilience

Description

Have you ever pondered upon the intricacies of business resilience? Do you fully grasp the importance of non-functional requirements for your business? Peter Maddison and David Sharrock promise to demystify these topics on the Definitely, Maybe Agile podcast. We guide you through the nuances of scenario planning, business continuity, and disaster recovery processes that are pivotal for preparing against potential disruptions. Events like pandemics, environmental issues, and economic changes have underscored how essential it is to comprehend the key parts of your business and the resources necessary for recovery.

Furthermore, we'll spotlight non-functional requirements and their vital role in business resilience. We'll show you why it's crucial for your teams to be autonomous and how to manage system changes without triggering unanticipated impacts. We'll also touch on the rising observability movement and its role in making non-functional requirements more conspicuous. From discussing design and architecture to system latency and changes, we hope to provide valuable insights into non-functional requirements.

This week's takeaways:

Don't overlook non-functional requirements - they're crucial for the resilience of your business.
Understand the impact of changes on your system with observability at the forefront.
Take a comprehensive approach to business resilience planning, considering all aspects of your business.
Make sure everyone is aware of the business resilience plan and knows their role in executing it.

So, buckle up for an enlightening journey through the nitty-gritty details of business resilience! Be sure to listen to Definitely Maybe Agile on your favorite platform, and remember to subscribe. For additional resources and to join the conversation, visit our website and contact us at feedback@definitelymaybeagile.com with your thoughts, questions, or suggestions for future episodes. Stay tuned for more exciting content!

Transcript

Peter: 0:05

Welcome to Definitely Maybe Agile, the podcast where Peter Maddison and David Sharrock discuss the complexities of adopting new ways of working at scale. Hello and welcome to another exciting episode of Definitely Maybe Agile with your hosts, Peter Maddison and David Sharrock. How are you today, Dave?

‍Dave: 0:22

Doing well, doing well. It's always good to kind of get together. We were just prepping this conversation and beginning to think we just needed to hit them.

Peter: 0:30

We should have just hit the call, because none of the things we say from now on for the next 15 minutes are going to be anywhere near as insightful as what we were saying right before the call started, that's always the case.

Dave: 0:43

We should always remember that we should remind our listeners too. So what was the topic? How do we set the topic up.

Peter: 0:50

So the topic is business resiliency, or what we used to call business continuity or business recovery, and all of the processes are involved around that, and I mean it's a very sort of topic du jour, if you like. It's a topic that everybody is top of mind for a lot of people, for a whole variety of reasons, but it's one of these things that organizations often seem to struggle with, and I felt like it would be a good place to start a good conversation to have.

Dave: 1:19

So just as we're diving into this Pete, I think it's really easy for us to say, hey, pandemic, we were all very aware of it because every organization, every company's protocols around disaster recovery and business continuity or business resilience was visited, implemented, acted on and so the gaps or the coverage of that was very clear from the pandemic. But you said there's quite a few different reasons. What are the different things which are kind of coming together to drive that?

Peter: 1:49

We've also got environmental problems. We've got larger economic issues or changes or changes in the environment, all of which cause us to need different reactions in the organization, and there's lots of reasons and lots of things that might trigger the plans, and one of the things we were talking about before we started is that very often, the plans that have existed historically have been ones that were created based on a worst case scenario something where we've lost an entire site, like a data center blew up, a meteor hit us or something like that, and the kinds of problems we're running into now that are where we're wanting to use these plans as a method of moving the business forward or continuing to operate. The business need to cover many, many more different types of scenarios.

Dave: 2:37

So I think if and this is one of the things, that kind of a realization that I've certainly seen working with a number of organizations, is that that scenario planning piece which it used. I mean, of course you know we talk about this a lot around complexity and how the exponentially changing environment has generated a desire to be a lot more resilient in the way we approach problems and so on. But this is tied to that exactly. If you're in a calm, you know slowly changing, rarely changing environment, the scenarios that you're envisaging, when you're pulling together the plans that you're going to come, you know what you need to be able to cover, what are essential services that have to be either kind of handled through those periods of time, sort of in a small scale to those scenarios, and the scope for these different scenarios has really accelerated in the last few years. It's almost like you turn on the news, there's more and more of these things happening. I mean, I'm sitting in BC with the fires and all of the summer and this year it's like twice as many fires as ever been seen in BC before, and the impact of these things is just all the time. We're seeing this happen over and over again, whether it's supply chains climate events which are happening, the market you mentioned as well. So how does that, how do those scenarios come together? Where do organizations start with business resilience and the planning that goes around that?

Peter: 4:02

So I mean the way I've done this in the past and I was I was coming to you for this, the call I that one of the first things I did in one of my first roles was way back in the 90s was around business continuity planning and disaster recovery planning, and actually not only designing and putting that into place assessing what are like doing business impact analysis, working out what are the most critical applications and which business processes require which applications in order to support them. So we knew what we needed to recover and what were our. You've got the whole set of acronyms right RTO, rpo, like recovery time objectives, recovery point objectives how much data can we afford to lose? So understanding all of that is one piece of it. That's where you want to get to, but I think, even before that, there's an exercise where you sit down with the key business leaders and work out like, well, what is it that we actually consider critical, that we need to run the business, how do we identify the key parts of that? And let's go through a tabletop exercise of understanding what happens when, if something fails, like what would we do? What would we do in those events, and from that you can start to derive. Well, okay, this is what we need to do. We need a separate site with this many desks, with the same machines and this many pads of pens and paper, and we're doing all of those pieces many years ago to like it, Because all of those pieces need to come in. But I think that that kind of initial tabletop exercise of like what, what, what? Starts to happen when these bits of the system go away.

Dave: 5:41

Now, do you? Is that not started, though? I mean, I think again and I've had some experience in this as well more around startups and prepping for IPOs and so on. So they need to have some of this paperwork in place and the sort of thought has been put to it and also around data centers and and just making sure you can operate should something catastrophic happen, either within the data center or to the data center, as an example. But all of these are around. Doesn't it start with some understanding of the scenarios that you're going to be modeling around, and I think that's you know whether it's a fire or some other event and how long that event might cause problems, right?

Peter: 6:19

Right, exactly. So what's what? What is going to drive it Like? What's going to cause those systems going? What are the, what are the scenarios you're going to plan for? And, as I was saying before, the scenario people typically plan for is a meteor hits a particular place.

Dave: 6:34

So I mean, is there now a shift in thinking around that? Have you seen that? Because I can imagine again 10, 20 years ago. The scenario was one of these catastrophic events happening pretty much to a building or a location. And how does the organization kind of if it's an international organization or a local organization, how does it work through? That so they're very isolated events and I just wonder if the fact that things like the pandemic, the supply chain challenges that we've seen and other things happening now, when do they become strategic versus? scenarios that you plan for and you kind of bring out that action plan for when they do happen.

Peter: 7:16

Well, the advent of cloud technologies and the capabilities that that provides, along with the triggering of people working from home more and the impacts that that had on the workforce, has also driven very different sets of scenarios and behaviors, because there isn't like one place that potentially gets blown up and then you have to plan for that. It's more a case of, well, how do we deal with redundancy? And very often it now comes down to like where's the data going to be? Like, how are we and I mean one mistake I often see younger organizations make is mistaking. Actually, to be fair, I see older organizations make this mistake from time to time too. But the idea of mistaking high availability with disaster recovery, that because my entire system is mirrored from one site to another site, I don't need disaster recovery because if this site vanishes, it'll recover over here. But what happens if your data is corrupted and your data then gets corrupted? Data then gets mirrored and you have no ability now to recover back to a known good state.

Dave: 8:27

It's not an instantaneous hit right. There's some sort of transactions in progress and various other things that have to be elegantly kind of worked through rather than just chopped off, and so how do you go about doing that? Yeah, do you see organizations I mean we're still talking really about this sort of immediate or unexpected events of you know, shortage period of time that kind of can hit the region Are you seeing organizations start thinking about more widespread disruptions that they have to be able to handle? I mean, that was one of the key things in the pandemic, is it?

Peter: 9:01

didn't hit one city. It hit everywhere all at once, so it was a sort of a not a localized event yeah well, it certainly created extraneous circumstances that organizations had mainly not planned for. A lot of BRP plans had the intent that wealth to keep these business processes running, we will need to have this many people sort of be able to access remotely maybe to do it. So the the things like VPN concentrators, where everybody logs in through into the organization, were scaled in a particular fashion and everybody was still using those suddenly found, well, I don't need to handle a few hundred people, I need to handle thousands, and that caused problems where they're. Just simply the technology didn't have the capacity to scale up to the volume of people who are now trying to access from home, and so that was the kind of scenarios that that was causing that were were never planned for yeah, so, and that is obviously shifting nowadays.

Dave: 9:58

but are you seeing, you know, is there, I guess, my the thought there is? At what point does it become a business strategy to have? Redundant development, high availability, this and the other services, or is it part of a business resilience plan as well?

Peter: 10:15

Well, one of the the jokes from the technology I don't know, jokes are the wrong word but one of the things that came out of the pandemic as a consequence in the technology space was that a lot of the plans that people have been pushing for for quite a number of years to put into place more virtualized or scalable systems or using cloud technologies were suddenly acceptable, whereas previously there had been all sorts of reasons why they couldn't be done and cost was often not even one of the highest ones there. But there was the but. Now there was suddenly a shift in focus to oh my, we have to do this and that caused, I think, many, many years worth of digital transformation in a very short period of time, as I think is the way that it's put. It's so the I mean the technology itself has been around for a very, very long time, and technology standards, because I always feel a little hesitant when I say these things because we we work in a time where technology is advancing so quickly, but yeah, so what do you see?

Dave: 11:11

is the sort of you know things, the checklist to watch for, given, I mean, we talk about digital transformations all the time, and I think one of the key things that's changed is that development operations, the whole product development piece, is, you know, shorter time to market. It means things are changing much more rapidly. There must be consequences on business resilience and disaster recovery. What's the impact? How often should they be aligning?

Peter: 11:36

So it's one of these interesting pieces that historically, you would be required, usually by the governance of the organization, to test your business recovery plans on maybe a annual basis. If I'm making changes to the systems on multiple times a day, then potentially are any of those changes breaking, changes that may significantly change the nature of the systems that I'm deploying. And some of the ways that you deal with that is by pushing the responsibility of the recovery engines down into the platform, so that the platform is what does that and so that the platform is taking on a lot of the responsibility for ensuring that the systems and everything that runs above and all the data associated or properly backed up and stored off site in regular increments so that I have some way of recovering from that. If you look at cloud technology, it'll be snapshotting on a regular basis. I'm massively oversimplifying in terms of the. Obviously, there are impacts on large data sets if you start to snapshot on too regular basis, but that's not something we're going to worry about here. So, but there are the system architectural concerns. Essentially, that can be used to ease the burden of this from a development perspective. Where the difficulty often comes, is that not everything that an organization has, especially one that's been around for some time, is able to take advantage of those technologies. So you may have dependencies between critical parts of your system to older parts of your system that haven't been updated or changed recently, and the failure of those parts can cause a cascading effect into other parts of the system, and that's where you run into the difficulty of understanding what will it take to recover this? One of the common pieces I've seen in sort of midsize organizations is you can usually recover a lot of the external parts of the system fairly easily stuff that's more modernized and eventually you get to whatever the critical processing engine is in the middle and you find it's just a massive spaghetti, and at that point you have to just recover everything and everything that it talks to all at once. And that's where all of the effort is, because it's everything so tightly coupled that nothing can operate without everything operating.

Dave: 14:04

Yeah Well, I mean, this is some, and the other side of it is the way I've certainly seen. It is those testing of your disaster recovery plans or business resilient plans become so important because invariably, these are parts of those scenarios which are just not documented or really understood. That the every organization has some aspect of their service which is a legacy is by definition. They become obsolete or aged at some point and they need to be understood and reincorporated into how things are working.

Peter: 14:36

But they often get missed, or at least downplayed.

Dave: 14:39

the risk associated with them is downplayed. And then all of a sudden, when you try and do I mean that's why the testing is so important when you try and do the testing, the thing that breaks is not normally where you spent all your time making sure it doesn't break, but it's the assumption you made about the service beneath somebody's desk that didn't get addressed.

Peter: 14:57

Exactly. That thing always runs. It's always when it's never had a problem. Well, that's because you never turned it off before. Exactly, Exactly. Now what about?

Dave: 15:06

so what I always find interesting when we're talking about this, when we're just thinking about it, is we spend a lot of time talking about what the teams are up to and where they're working, and how to work closely with the teams. And we want those teams to be very autonomous and self directed in many ways and yet, at the same time, although we I certainly my experience has been all teams still talk about definition of done stories work, whatever it is. The non functional requirements is the topic that some years ago was always front and center in those conversations, and I don't hear that conversation as frequently anymore.

Peter: 15:44

Yeah, I think that actually drives a lot of the problems that we see. And this goes beyond just the resiliency like how do I recover from a disaster but from and if I'm making lots of changes to the system, am I ensuring that the system can still recover after I've made those changes? And that actually does tie back into things like design, like architecture, and how do we put all these different pieces together. Are the changes that I'm making to an extent material in nature? Are they ones which are going to significantly change the way in which my system functions? And, if they are, do I need to consider what the impact of that might be? Now, the normal day to day change I'm making will not be, for the most part, of that nature. Right, there's typically more oversight and thought put into it. But another common one that I see from a business resilience perspective is things like latency. If I'm running some kind of latency sensitive system where I've got to swipe a card at a checkout, for example, and I need it to come back through to the back end system and it's got to complete within a certain number of milliseconds for me to be considerate valid back on the system, I might make a change into a subsystem somewhere that's getting cold and inadvertently slow down that transaction system. I've seen that type of change happen more than once, where what I, what I, the change I think I'm making isn't actually as detrimental, isn't? I don't think that change is detrimental, but then when I actually make the change, it has some unexpected effect on the overall performance of the system.

Dave: 17:24

And certainly I mean I think just in this, this conversation is a reminder to me certainly to take this back to the teams that we work with to reinforce the placeholder for the conversation. That was that discussion around non functional requirements, because it's all everything that you're describing understanding the impact that the change has elsewhere in the system and just being caught kind of cognizant of the impact and the consequences that can be generated from that. It can be possible.

Peter: 17:50

I think a lot of what the observability movement is looking at is aimed at making non functional requirements more visible by exposing the business functionality of systems in such a way that we can measure those and and respond to them in a more timely manner. And so there's this. It's sort of the carp for the horse. If we end up pushing changes into production faster than we can operationally manage the system that we're changing, then we will have problems.

Dave: 18:19

Well, it's interesting because this is all about we need to make more information available to the teams that are changing things. So the consequence, and so they obviously have to have a broader awareness of the systems being made observable, of course but also how to interpret those, to change them. And we're just just having a conversation about work transparency and the same thing we know from a work transparency perspective. People who are using that information need to know where to get it and how to use it. Same is true Right, and you need that to be able to make decisions in a shorter cycle.

Peter: 18:52

Exactly so wrapping things up three things.

Dave: 18:55

What would you draw our attention to, Peter, from the conversation that we've had?

Peter: 19:00

I think I mean three good things is one that's understanding what happens in the event as well, and understanding the scenarios for which you want to plan for and considering broadening the scenarios that if your business recovery plan still consists of media strikes, data center and you don't have data centers anymore, you need to update it and understand what are the implications of the different types of things that might possibly happen. An example from earlier today was one of my colleagues was working with a client and going through the results from a tabletop exercise and identifying that there was nobody was actually accountable for ensuring that when there was a disaster, somebody actually owned the execution of the plan, which. So there's basically no crisis manager, so there's nobody whose responsibility is or accountability is to actually ensure that this thing will even function or work, and so things like that are like sort of making sure that it's understood who owns this. So that's the kind of thing being conscious that it's you got to think about the whole system. It's not just about the IT systems. It's about what are the business processes you need to recover that are supported by those IT systems, because you don't want to necessarily recover everything, and what is the scale and complexity of doing that. I think there's a couple of what else would you add?

Dave: 20:25

So I'm kind of drawn to I think what you're describing is a lot of the sort of how to make it happen bit and what I'm drawn to very specifically is at what point do the scenarios that we're discussing morph into strategies or business strategies we have to address? So I and that's something I think is more to do the fact that things are happening and changing more and more frequently. So we need to be aware of that, and some of that just means it's business as usual. We need to be able to work in those fluctuations. So there's, I think, a pretty interesting conversation about where does, at what point do those scenarios shift to being business scenarios that we have to deal with, rather than disaster scenarios that we have to deal with and I don't know where that would be. I'll kind of go and do a bit of a search about that. I think that's quite an interesting conversation.

Peter: 21:14

It is because it'll come down to the risk tolerance of the business leaders in the case of BC. Do they want to work out how to make their systems recoverable or do they just buy fire insurance?

Dave: 21:29

Well, yeah, I mean there's some interesting conversations there, right? I mean, it's risk is always about do you insure or do you?

Peter: 21:36

Yes, impact versus we manage right and address it.

Dave: 21:40

So, yeah, yeah, I think that was one of the things, and then really, maybe the other one is just, it's almost like I think this is such a timely conversation in that we have the tendency of building silos around wherever it is that we work, and I think this, for me, was that reminder, that conversation around non-functional requirements, around observability and around some of those things to a good, just like we've said before, around engineering practices. It's worth bringing these things up and just getting them to the top of a pile of conversations rather than thinking that they are being covered when probably they're not really.

Peter: 22:13

Oh, they're not. They never are probably my experience all the stories I can tell. Fair enough, fair enough. So, with that, if anybody would like to send us feedback, they can feedback at feedback@ definitely maybe agile. com, and don't forget to hit subscribe. We always like new subscribers For sure, until next time, peter. Thanks again. You've been listening to definitely maybe agile, the podcast where your hosts, Peter Maddison and David Sharrock, focus on the art and science of digital agile and DevOps at scale.

Ep. 107: Exploring Business Resilience

Description

Transcript