VIDEO
Swati Raju (Head of Confluence Experience Engineering at Atlassian) shares her learnings on Confluence Cloud of building a reliability habit in the team. This goes beyond DevOps best practices, and focuses on treating reliability front and center in team processes, tools, and decisions.
Like what you see here? Our mission-aligned Girl Geek X partners are hiring!
- See open jobs at Atlassian and check out open jobs at our trusted partner companies.
- Watch all Elevate 2021 conference video replays!
- Does your company want to sponsor a Girl Geek Dinner? Talk to us!
Transcript
Angie Chang: We’re going to have Swati from Atlassian join us here. And she will be sharing about building resilient products while we all grab a coffee or a tea and kind of get re-energized for the afternoon. I know it’s a long day of talks, so yeah. Please help me welcome Swati. She is the head of Confluence Engineering at Atlassian.
Swati Raju: All right, good afternoon everyone, just want to clarify, I head up Confluence Experience Engineering, which is a subdivision within Confluence Engineering. But anyway, thank you so much for having me here today. I work at Atlassian and like Angie said, I’m going to be speaking about building resilient products. Before I get started, I would love to take a moment to wish you all a very happy International Women’s Day. I also want to acknowledge our female predecessors in science and technology have really opened the doors for us and paved the path for each one of us. I can’t think of a better way to spend International Women’s Day than in this group with all you wonderful women engineers. And thank you for folks who are hanging on for the coffee break to spend your coffee break with me and to learn about building resilient products. So today we’ll be talking throughout this conference a lot about resilience in ourselves and in our teams.
Swati Raju: Those ideas are all very close to my heart, but I would like to speak to you about a different type of resilience. Resilience and reliability in cloud products. So the learnings that I’m about to share are really based off of some of the work that my colleagues and I have been working on in the last year or so. But just to set the context since I’m not sure how many of you are familiar with the name Atlassian. Atlassian builds software tools for team collaboration. The tools that we build help teams of all types. So it could be like sending the Mars rover up to Mars, or it could be startups that build the next concept electric car. So you may have used some of our products like Jira, Trello, Bitbucket, Opsgenie, or perhaps Confluence Cloud, which is a product that I work on.
Swati Raju: The organizers asked me to briefly touch upon my own background, and I would love to share with you real quick before we deep dive on our topic. I have a bit of an unconventional background because I studied architecture in my undergrad. After that I moved to the United States to do my Master’s in Design Knowledge and Computation. Worked in a few companies in the Valley from Yahoo! Search to Groupon. And then most recently I was heading up engineering for a small startup called Traveling Spoon, which incorporated three of my absolute passions technology, food, and travel. And then now I am at Atlassian where I work on Confluence Cloud. For those of you not familiar with Confluence Cloud, it’s a space for teams to do knowledge sharing and collaboration. So when I’m not working, I’m usually busy being a mom of two young boys and pretty much chasing them around all day long.
Swati Raju: Some of you might relate to that. So the common thread in my career from studying architecture to working on Search, to working on Confluence now, has been this idea of building something that’s enabling, empowering and really meaningful to the human experience. So when we talk about products that are truly critical to your day to day, resilience and stability are need to be part and parcel of those products. Let me give you some concrete examples.
Swati Raju: So on the first working day of 2021, Slack had a three hour outage. No disrespect to Slack, but if any of you were impacted, you recall how disruptive it can be when a critical tool like that goes down. Similarly, Confluence in 2019 had a bad outage. What we define as a severity 1 based on how many customers it impacted. What our users saw was this chilling screen and users could no longer access their wikis or collaborate with their teammates because of a code change that just didn’t work when it was pushed to significant load.
Swati Raju: The point is that if you are involved in building products that are critical to someone’s day-to-day working, the reliability of those systems become critically important. So I’m going to share with you three principles and practices that can help you improve the reliability of your systems.
Swati Raju: So let’s dissect each one of them. The very first one is accurately measure the customer pain. So if you’ve heard of Peter Drucker, arguably the most, one of the most influential thinkers in management, he said, “If you can’t measure it, you can’t improve it.” This is true in all aspects of life, especially true for reliability of our systems. So now traditionally, most measures of reliability were based on server-side metrics or what we call uptime. This is not always representative of what the customer is actually seeing. So reliability on the other hand measures what the customer is actually experiencing. A lot of cloud companies, including Atlassian, promise our customers very high reliability as well as uptime.
Swati Raju: So it’s a combination of both. But why is reliability and specifically like this idea of an overall reliability from the customer’s perspective versus uptime, a much more harder metric to nail, especially in distributed systems where you have a lot of depending microservices? Let me share an example here. So take for example, this lamp at my desk, right? It is dependent on the reliability of the bulb, the power cord, the lamp arm, the lamp plug, the shade. Hence, when you calculate the reliability of this lamp, it would be the reliability of each of its components multiplied by each other. The more components you have in the system, the greater likelihood of lowering the reliability.
Swati Raju: So, really we really need to think about kind of how to build systems that have great end to end reliability. Let me give you an example of how we have approached this at Confluence. So what we’ve defined is something called key user journeys. And these might be the most important things that a user does on your product. So for example, viewing a page is a really important key journey for us, and the team that owns that key user experience is responsible for understanding the operational metrics of the way the action, right from where the action the user takes to how the request gets sent and all the dependent services that then touches. So holding ourselves accountable for not only our own code but for the key user journeys that… and on how the code, what are the different aspects of the code that are being touched is critical.
Swati Raju: And then having visibility and alerting you to not only what the components that, say, your team owns, but also the underlying dependencies, becomes super key. What we ended up with then is a whole lot of rich dashboards and alerting for exactly what the customer was experiencing versus silverside reporting. So moving on, before I jump into the next principle.
Swati Raju: We collected data at Confluence and found that bugs and software are the majority of the root causes for incidents or what we call when a customer has an outage. As much as 50% of our outages were caused due to bugs in code. So this whole idea of move fast and break things does not always work. So don’t get me wrong. There might be situations and places where the mantra of move fast and break things works really well. When I was at a startup, trying to get something out into the market, where my goal was to prove my hypothesis on product market fit, move fast and break things absolutely works.
Swati Raju: However, when you’re working on a product that users depend on for their livelihood and revenue, this strategy needs some rethinking. For a B2B product like Confluence, where our users depend on us for timely collaboration and real-time reference documentation, taking such an approach is, dare I say, irresponsible. So in fact, Mark Zuckerberg, the CEO of Facebook, who famously coined this phrase, or at least as attributed to have coined this phrase, move fast and break things, he announced in 2014 at F8, which is Facebook’s annual developer summit, that they had changed their mantra. So wait for it, move fast with stable infra.
Swati Raju: So unfortunately this phrase, doesn’t back the punch of the original move fast and break things, but you get the point. It’s addressing the reality of a product that has to support its users at scale. So the point I want you to take away is that speed of change should always be balanced with the ability to detect and recover quickly and B, ability to limit the blast radius when something breaks.
Swati Raju: So some of the examples of how, that I’ve seen to successfully balance moving fast with ensuring reliability have been progressive rollouts where a very small percentage of users incrementally get changes, so we gain greater confidence on the reliability of that change. Another way it will be longer soak times where we leave the production, the newest production version, and our internal instances for enough times then we can exhaustively use it internally. And then the last one is really this idea of early detection, looking for anomalies in the production pipeline.
Swati Raju: Additionally, what I found incredibly useful is moving to this idea of shift left in our approach of software development, where we shift the effort for improving the quality of the software earlier in the development process. So what we found at Confluence is that 89% of our instances or incidents, 89% of our hots and incidents could have been avoided by just adding more detection in the predeployment testing.
Swati Raju: So that should make it really clear. Investing in prevention early in the development pipeline is absolutely crucial. This brings me to my next principle, when we shift changes, we must always expect the unexpected and anticipate and plan for failures. So we need to build systems that can embrace failure as a natural occurrence, even if we do not know what that failure might be. So some of the methods that we have been using have… it’s been around load testing.
Swati Raju: It’s a really good idea to load test significant features with what might be a simulation of peak traffic, so you can identify bottlenecks for example. Throttling and limiting. So when planning operations in the cloud, we want to know what are the upper bounds and limits that can be consumed. This can be critical for us to design our systems. And some cases, we want to do throttling so that there’s a small, very, very small percentage of users who might get a bad experience.
Swati Raju: The 0.0001% and the 99.99% get an awesome experience. And obviously you want to work with your product teams to figure what that best trade-off would be. And then the last one is around reducing blast radius. So it is important to manage components that are impacted without the need for the overall system to go down. So we need to develop this into our fundamental planning where failure occurrences, such… that impact the overall health. We never reached that point. It’s always very, very local. So I’ve shared a ton with you.
Swati Raju: You can do all of those things, but here’s some really bad news for you. Things will still break and shit will still happen. The best that we can do in these situations, A, is to recover our systems quickly. And B, is to show some customer empathy. So this is an example of designing and planning for error states really shows that you are bringing empathy in the most rotten situations.
Swati Raju: So here’s an example of what the failure pages looked like in Amazon Prime, 2018. When all else fails, bring the dogs and the cats, right? And then finally, every outage or incident is an opportunity to learn, to helps us think about how we could have avoided it in the first place, how we can mitigate its impact in the future and how we can reduce the blast radius in the future. So here’s a recap of the three principles.
Swati Raju: And hope you can remember these, accurately measure customer pain, move fast and break things does not always work, and expect the unexpected so you can anticipate and plan for failures. Finally, I want to leave you with this thought. So no matter what your role, you should be thinking about your reliability. If you are a developer, really think through what are all the unexpected things that can happen when your code goes to production. What are the absolute crazy wild things that could go wrong?
Swati Raju: Because let me tell you this, they will go wrong. And if you’re an engineering leader, go beyond just thinking about tooling, metrics, processes, and habits that your team needs to do for the stability of your product. And really think about building that culture of reliability, because that truly shows how much you care for your customer.
Swati Raju: With that, I will wrap up, but I will make a quick pluck from teams in Atlassian that are hiring. I put a link at the bottom there. If you’re interested in working to support teams that will tackle some of the next big challenges for humanity and really care about working in a culture of awesome diversity, do go ahead and check those out. Well, that’s all I had. Thank you so much for your time. Enjoy the rest of the conference.
Sukrutha Bhadouria: Thank you so much, Swati.
Like what you see here? Our mission-aligned Girl Geek X partners are hiring!
- See open jobs at Atlassian and check out open jobs at our trusted partner companies.
- Watch all Elevate 2021 conference video replays!
- Does your company want to sponsor a Girl Geek Dinner? Talk to us!