Quick experimentation in your application can turn feelings-driven development into data-driven wins. Beyond blue buttons or red buttons, A/B testing can answer key questions on how best to serve your users by offering sometimes surprising insights into how they interact with your product. In this talk, Dena Metili Mwangi will get the first experiment up and running with PlanOut, a Python-based open-source framework. It’s cheap & easy to be A/B testing with open source.
Sukrutha Bhadouria: Hi, Dena, how are you? You’re muted, so you need to unmute. So, hi everyone. I’m Sukrutha. A couple of housekeeping notes as we’ve been doing through the day. We’re recording all these sessions and we’re going to have the videos ready for you in a week. I saw some of you asking questions about the previous sessions. We will also have the slides for you to be able to view. Please share the fun that you’re having, whether it’s through the content, or selfies of your viewing party, or you watching at your desk on social media using the hashtag GGXelevate.
Sukrutha Bhadouria: We’re going to do Q and A at the end if we have the time. So, please post your questions. At the bottom, there is a button for Q and A. If we don’t have time for it, we’ll do it over social media, and I’m sure Dena would be willing to do that for us. Also, we have a job board on our website, GirlGeek.io/opportunities. So, please check it out.
Sukrutha Bhadouria: Now our next speaker is Dena. I’m super excited. She’s a software engineer at Sentry where she works on the growth team. Fun fact, she graduated from Hackbright where she learnt–did a ten week program studying Python. Before that she was a graduate from Duke University and was working as a research analyst at World Bank. Her talk today is about A/B Testing: Cheap and Easy With Open Source. And I’m sure everyone is excited to learn more about this.
Sukrutha Bhadouria: So, thank you so much, Dena, for taking the time.
Dena Mwangi: Hey, thank you so much. I’m going to go ahead and bring up my slides. Can you hear me okay?
Sukrutha Bhadouria: Yes, we can hear you.
Dena Mwangi: Excellent. Okay. Hi, guys. It’s so nice to be here with all of you. Happy International Women’s Day. Today we’re going be chatting a little bit about A/B testing and specifically how to do it cheap and easy with Open Source. You can find me on Twitter as Dena Mwangi, or on Linkedin. Feel free to connect.
Dena Mwangi: So, before I jump in, just a little bit about me. As was mentioned, I’m a software engineer at Sentry.io on the growth engineering team. I did go to boot camp, and that’s how I got into tech. I went to Hackbright and I think there are few Hackbright grads in the audience today. So, hi to all of you. I’m also a data enthusiast. I am into quantified training. So, I really like thinking about data and-
Sukrutha Bhadouria: Dena, sorry to interrupt you, they are soft on your volume. Can you speak up or move the mic over.
Dena Mwangi: Yes.
Sukrutha Bhadouria: You may need to start again.
Dena Mwangi: Okay.
Sukrutha Bhadouria: Yeah, this is better.
Dena Mwangi: This is better? Okay.
Sukrutha Bhadouria: Yeah.
Dena Mwangi: Thank you so much for the heads up.
Sukrutha Bhadouria: Thank you.
Dena Mwangi: Awesome.
Dena Mwangi: Yeah, so software engineer, bootcamp grad, studied economics. So, I really like thinking about the world in terms of data, which is how I ended up in the role that I’m in now. I also really like thinking about diversity, and inclusion, and how to do tech for good. So, I really liked the talk that we just had about AI. So, if you want to talk about any of those things, feel free to connect as well.
Dena Mwangi: Our agenda for today, we’re going to go through what and where is A/B testing? We’ll talk through the general MVP requirements, if you want to build your own. And then we’ll talk a little bit about PlanOut, which is an Open Source framework that you can use to help you out.
Dena Mwangi: So what is A/B testing? Simply put, it’s just a way of comparing two or more versions of a thing to determine which performs better. And the magic sauce that lets us do that is we are able to randomly assign samples of people to each variation and use statistical analysis to evaluate how legit our results are. If we do this correctly, we’re able to take the insights that we get from our small samples and say something meaningful about our larger population, which is what we’re really interested in. You’ll also hear this called split testing or bucket testing.
Dena Mwangi: Now, where is A/B testing? And the answer might freak you out. It’s everywhere. So, as you are using your applications, as you’re surfing the web, tons and tons of organizations are running A/B tests on us all the time. But, for the most part, it’s because they want to make sure that we’re getting the best out of their products that we can possibly get.
Dena Mwangi: So, one example of this is Netflix. So, while you’re Netflix and chilling, Netflix is running tons of experiments. One of these is what image they show when you’re surfing and trying to figure out what show you want to watch. So, they’ve played around with the title they show you, the image that they show you, and they run experiments to see which one gets the most clicks and which one ends up with more people watching it. The quick example of that is with a show that I love called Sense 8. If you haven’t seen it, you should.
Dena Mwangi: So, they ran this when they first had this show out. And this is just three of quite a few variations and buckets of this that they experimented with. So, you’ll notice that they’re playing around with the text, they’re playing around with the image that they’re showing, and they set this through all their markets. So, if you look at this, try and think about which one of these appeals to you the most. And, in the U.S., if you chose the middle one then you’re in the majority.
Dena Mwangi: So, most people in the U.S. ended up picking the middle one. So, most people who saw this ended up clicking on it and actually watching the show, which is what Netflix cares about. But as with A/B testing, you’ll find that, once you start digging into the data, there is often quite surprising insights to be found. So, while the middle one did the best in the U.S., all of these were winners in different markets. So, the last one won in Germany, the right one won in Brazil, and this actually tends to make a big difference.
Dena Mwangi: So, they saw, once they started running these experiments, a 20 to 30% lift in engagement with people clicking on these titles and actually watching the shows. So, you can make a difference.
Dena Mwangi: One more example for that, quick note, this stuff is hard. Computers are really hard. They do this with tons and tons of different shows and this is one where it kind of went awry. If you’ve seen Tidying Up with Marie Kondo, maybe this is the vibe you get, probably not. But this was a case where they kind of mismatched the image that they were showing in their tests.
Dena Mwangi: So, one other quick example of where A/B testing is is an example that’s kind of famous with Google where they weren’t quite sure which shade of blue they were going to use. And I think things like this are why A/B testing actually has a bad rap, because people think, really? Are we going to spend our time thinking about shades of blue? And actually, yeah, we are. Because this actually translated, by figuring out which one worked the best for their users, it translated to an increase of 200 million dollars in ad revenue. So, A/B testing can end up being quite profitable.
Dena Mwangi: So, if I’ve convinced you that perhaps A/B testing is something that could be useful for your organization, what do you do next? How do you even begin? So, let’s talk through some of the MVP requirements. Really, it boils down to two things. You want to think about how you’re going to bucket people and how you’re going to do it correctly. And the second thing you’re going to want to think about is your data, the data that you’re getting out, because you need to know which of your variations performs the best.
Dena Mwangi: So, for the first bit, you want to think about randomization. You’re going to be randomly assigning your users as they come through, but they’re going to come through your website multiple times, hopefully. And so, you want these randomization, these assignments to be deterministic and counting is hard. So, this is a nontrivial task. As you scale out your experiments, you’re also going to want to account for parallel or iterative experiments. So, if you have a user that is going to be exposed to multiple parts of your site, you want to be very intentional about what you’re showing them.
Dena Mwangi: As far as the data, you want to think about how you’re getting the data out for analysis so you can actually decide who wins. You want to think about how you’re linking it to your internal metrics. So, like with the Netflix example that we saw, they really care about people actually watching the show, and they really care about the people who are paying them, how much they’re paying them. So, you want to have a way of linking the success of your experiments to your internal metrics like activation and paid users. And you want to think about how are you going to be seeing this? What does your analysis look like? Do you need dashboards to make that easier for you and your team?
Dena Mwangi: When we thought about this, we had to make a decision between whether we were going to build something or whether we were going to buy something. And there’s pros and cons to both of these situations. So, with buying, of course, it costs money. That’s a downside. These can be pretty pricey. They run up to 40 to 60K sometimes. But, on the plus side, they’re almost ready out of the box. Bit of a negative is you have to do a little bit of extra work to link them to those internal metrics, and you have to also think about do you want to send all your sensitive information about your users out to a third party? If not, which you probably don’t want to do, how do you get that information from the experiments back and connected into your internal metrics?
Dena Mwangi: As far as building it, the downside is, well, you have to build it. So, you have to customize it to your exact use case, which is great, but that takes engineering resources, building and maintaining it. And again, counting is hard, so you have to think about how you’re going to be implementing that correctly and validating the results that you’re getting.
Dena Mwangi: So, when we thought about that at our institution, we decided to use Open Source for the first section, for the first problem of how to bucket people correctly. We don’t want to think about that. We figured if there was someone who has already done the work of implementing that, why reinvent the wheel? So, for the first part we used Open Source, but for the second part, we kind of had our data pipeline already in place. And so, we were able to leverage our existing infrastructure and just hook that into place.
Dena Mwangi: So, things that you get, you get random operators, you get deterministic assignments for your hashing, you get name spacing. And to do all this, it’s really simple. If you’re familiar with Python, you’re able to create new experiments simply by inheriting from a base experiment class and modifying the assignment logic.
Dena Mwangi: But, what you don’t get is you don’t get a GUI. So, everything is code based and every time you want to create a new experiment you have to write it out in code and write it out in Python. You also don’t get any post experiment analysis assets. So, the nice dashboards to help make your analysis life easier, those don’t really exist, and that’s something that you have to implement on your own.
Dena Mwangi: So, I find it best to learn about a new tool by walking through an experiment. So, we’re to walk through a really quick one with a pet adoption profile. So, suppose you had an app that was trying to get a pet adopted. Suppose it’s this guy. And you think that, if you play around with the image that you’re showing, we’ll be able to have more interest and more clicks on this lovely cat’s profile. You also want to have a blurb with it because why not? So, we’re going to have these two images and these two blurbs, which gives us four options that we’re experiment and randomly showing to our users as they flow through.
Dena Mwangi: If we wanted to run this with PlanOut and actually have an experiment up and running, this is pretty much all that it would take. Put some code, it’s always scary when you see code on your screen, but don’t fear. We’ll walk through it really quick. So, basically what this is is it just pulls from a simple experiment class from PlanOut, and it gives you all your random operators all in this one thing.
Dena Mwangi: What you have to do on your end is tell it the required rules of engagement. So, tell it what you’re trying to do, who you’re trying to experiment on. In this case it would be a user ID. Tell it what your varying. In this case we’re varying an image and a blurb. Tell it also how you want to vary this. And, in this case, we’re going to be using uniform choice. We don’t really care, 50/50 split with each one. And that’s all it really needs to know.
Dena Mwangi: But where does this actually go in your code? So, if you played around with Flask, for example, wherever it is that you’d be using this image and this blurb, regardless of what language you’re using, that’s where this would go. So, in this case, if you have a route, then you just throw in your assignments and you’re able to pull directly from them and put them into your template.
Dena Mwangi: But okay, so you did the thing, but where’s your data? For this, all you have to do is tell it how to do the logging in your setup. So, you tell it where you want to log all the things, what file you want to send it in, whether or not you have a data pipeline or not, you have this option of just throwing into JSON. So, as people are flowing through your website and seeing all the options that you’re showing them as you’re randomly assigning them into particular variation, all of this is getting logged and put into a JSON that looks like this that will make it easier for you to pull from it later on.
Dena Mwangi: And the important bit here is that you’re able to see what the image was that they were assigned and what the blurb was that they were setting. Also who they are and what time it was, but really these are the two main things that you care about. And that really is it. That’s your first experiment, and you’re ready to go forward and A/B test all the things.
Dena Mwangi: But, before you do, I will leave you with a few A/B testing sanity tips that we’ve learned on my team that have made our lives a lot easier. The first being, you really want to have well defined metrics of success before you start running your experiment. I think a lot of teams get really excited and they think, obviously, this is going to be great. I’m sure it will be a success, but they’re not very clear on what success looks like. So, before you run any experiment, be very clear to write this down and know what your metrics of success are.
Dena Mwangi: The second thing I would advise is to make sure that you’re doing all your experiments in small, measurable iterations instead of doing large sweeping changes. Sometimes this isn’t always possible. For example, if you have an experiment that’s being run that requires a lot of design or it’s very greenfield, then you might have to do a lot of front end work, a lot of front end cost work. But, for the most part, you really want to be doing this in small measurable iterations. That way you’re able to attribute what exactly changed to give you the lift that you might be seeing in your data. Otherwise, it gets very confusing. Was it the button color that you changed? Was it the language that you changed? It’s unclear. So, do small, measurable iterations.
Dena Mwangi: The last thing is, A/B testing is not a silver bullet. Data is one thing in your toolbox. It’s not the entire tool box. So, this really should inform your decisions. It shouldn’t be the one guiding light. So, if you see a lift in an experiment, for example, you really want to think about it and look at it in context of the whole picture of your application and what you’re trying to answer. So, with the Netflix one, for example, they could have said, oh okay. This one particular one won in the U.S. Let’s do this everywhere. But instead, they dug deep and they were able to desegregate and see that, actually, they had different winners in different markets. And they were able to leverage that information and go forth with that and be a bit more successful.
Dena Mwangi: Thank you so much for your time. It’s been so great chatting with you. Happy International Women’s Day. If you have any questions, I’m happy to answer them.
Sukrutha Bhadouria: Thank you, Dena. This was amazing. What a fun image at the end. All right. So, there are a few questions. So, let’s roll through them. Susan asks, do the logs include the end action that is the click event of user wanting to adopt the cat?
Dena Mwangi: That’s a great question, and no it doesn’t. So, this really logs the exposure. So, you would have to do the extra step, which is sometimes non trivial, of having to connect the exposure with the actual action of interest. So, what we do is we have lots of different analytics events. So, in addition to the exposure one, we think about what we want them to do, and what success looks like, and we log that as well separately. And, when we run the analysis, then we combine the two.
Sukrutha Bhadouria: Got it. How do you know what is a good sample size of data to test with?
Dena Mwangi: So, this actually, there’s equations that we run. It’s pretty standard like statistical modeling that you can just like put number in, figure out what power you want, figure out what level of statistical significance you want that you’re comfortable with, and play around with that. And that will spit out what sample size you should be going with, at the bare minimum.
Sukrutha Bhadouria: All right. And is there any project too small for A/B testing to be useful? For example, a small app in private data with only 20 users?
Dena Mwangi: Yes, unfortunately. So, with the statistical significance to be able to say something that’s truly meaningful, you would have to run the numbers and see like what the minimum number would be. But I think 20 would definitely be too small. You want something in the hundreds and up.
Sukrutha Bhadouria: Yeah, that makes sense. And finally, what did you find most challenging when you transitioned from bootcamp grad to working full time as an engineer?
Dena Mwangi: That’s a great question. That should be a talk all on its own. I think for me it was still like a steep learning curve, but it was getting really comfortable asking questions and asking one more question than you feel comfortable asking, and getting over that fear of being seen as not knowing enough or all the imposter syndrome things that come with being a bootcamp grad and being in your first tech role. I think, honestly, that was the biggest thing is just getting over that and saying, it’s fine. I just need to learn the things. So, I’m going to ask the questions.
Sukrutha Bhadouria: Thank you so much, Dena.