On April 2, Zachary Henderson, Lead Solution Engineer at Catchpoint, spoke at our Test in Production Meetup on Twitch.

Zachary explained how proper RUM and synthetic data (monitoring in production) can be leveraged as a way to also test in production. He also shared that, for actionable end-user monitoring, you need a system that can ingest large amounts of raw data, slice and dice that raw data with any number of dimensions, and visualize it appropriately.

Watch Zachary's full talk. Join our next Test in Production Meetup on Twitch.


FULL TRANSCRIPT:

Zach Henderson: I'm very good. Thank you Yoz for having me. It's a pleasure to talk to you all today.

Yoz Grahame: Excellent. Thank you so much for joining us. And you are going to talk to us about end-user monitoring in production.

Zach Henderson: That's correct.

Yoz Grahame: Which is useful for all kinds of things, especially when you're testing in production. We will start you off in just a moment. Just a reminder that Zach will be presenting for roughly how long? About 15, 20 minutes, something like that.

Zach Henderson: Yeah, just about.

Yoz Grahame: Just about. And we have these Twitch chat live and myself and Kim, our producer, will be fielding questions in the chat. But we'll also be taking questions with Zach after his talk on all matters of testing in production and especially monitoring in production. So I will switch it over to your slides and take it away, Zach.

Zach Henderson: Great, thank you. Yeah. And so good day everyone. Happy Thursday. As Yoz was saying Zach Henderson here. I'm a solution engineer at Catchpoint. We are a monitoring company. So today's talk is all about end-user monitoring. So the idea of this group testing and production, there's a whole lot of things that go into testing and production. Monitoring is just one key part of that, usually after release or after post-release. This is where you're trying to guarantee that the systems and services that you built are actually working for the people that you built them for. I know there's a whole lot of things to testing in production, so today's talk is just mostly on that monitoring piece. But if you have any questions about other pieces, here are my thoughts on it, happy to talk through at least what I know on those angles as well.

And so the general idea of end-user monitoring in production, this is the fact that it is in production. You're essentially monitoring the world and the world is asking, Hey, can you tell me how your real users actually perceive your app or services performance and its availability? Because if you understand how real users perceive it, it's basically the key to optimizing the right things, finding the right issues to fix and improving your user experience overall. And so when we say production and monitoring of it, it's a very big world. Your app that you've developed, you've worked hard on, it will run on thousands of different devices from smartphones, laptops, TVs, gaming consoles, different websites, web browsers, all those things. And those devices that you're trying to get understanding of for around the globe and different countries, different regions, all have a different set of capabilities, resources, constraints, and there's a whole variety of the network piece.

The internet is not just one big network it's a bunch a little tiny networks talking to each other. And so the idea of can you monitor that type of interaction for your users? And it's just a lot of variables to consider. So in general, the idea is that those different things all work together for a user. So when you have your app or system that you've developed, you've pushed it in the cloud and multiple clouds, maybe you have multiple centers, you're co-located at somewhere for example. Your users want to access those apps, they want to access how you've delivered those apps. And so they'll come from their home, their coffee shops, their offices, their environments. Or with different browsers, different machines from anywhere around the globe. And they'll access things through their Last Mile network that they pay internet for and those Last Mile networks to carry traffic over the backbone of the internet.

And they'll talk to your DNS providers, your DNS servers, your content delivery networks, security vendors you have, even third party services that you integrate with the app. And for a user, they don't just see your app in the cloud or in the data center. They see this whole type of complex user journey. And so it's a lot of things to think about. It's overwhelming a lot of times just thinking, Hey, I'm building something. Is it actually going to work for the users? So I need to go ahead and monitor that. And so the idea is that you can actually boil it down to these four general concepts. These are kind of pillars if you think about it, which you can build off of different strategies on for monitoring your end user experience. So the first is availability.

Is your application simply up? If it's down, it's going to be down everywhere regardless, right? But then another idea is, is it reachable? Because your app can be 100% up but if your users can't get to it then to them it's down. And the idea that you can't reach it, it's not available for them. So that's a really big concern that people have to think about. And then the idea is reliability, can they actually trust it if you're developing something that they need to do the work, that they need to do to see and talk to their friends and family members, all these kind of different things you use the internet for. Can they trust it to do what you expect whenever they use it? And then finally the idea of performance. If you're up, you're reachable and you have some trust built into those factors.

Is it fast enough to do what they need to do? And your app is going to be completely different than someone else's app is developing? But users expect speed, they expect quality of service and they don't want to have to wait because bad performance is really just a new downtime nowadays. And so the only really way to actually monitor your real user, the actual end-users themselves, these aren't really two methodologies. You can use Real User Monitoring, which is idea of real-time data captured from those actual visitors to your app or service. And that's the actual user experience being reported back. And the idea of synthetic monitoring, which is real-time data is captured from robotic users. The idea there is that these two methodologies can really compliment each other. So you get a real user understanding of the system and you get a really proactive, protective understanding of your system from these robotic traffic to your site, for example or service.

And so looking back at those different concepts. Availability. It's only going to be basically internal synthetic monitoring. You want to have a robot that's in your data center or your cloud environment that's constantly pinging and probing all the things that you rely on from that area. So you can prove that that robot is up and telling you that things are available. And usually you actually want that robot just a little bit outside of some sort of environment that you're hosted in so you can have out of band monitoring and availability. The idea of reachability, all people tend to think it's external synthetic monitoring can a robot at 24/7 from different areas around the globe on different networks. Last Mile, Backbone, wireless networks can actually reach different parts of my system.

Actually there's some, some technology out there nowadays. We actually use some supporting RUM data to actually answer this idea of reachability and they know reliability and performance. Those are both going to be datasets that you can get from synthetic, internal and external and real user technology as you get traffic visiting your site. So the idea here is that let's kind of break it down for availability. Availability is, is your application up and can it respond to requests in a very simple thing. So it's basically your app in the cloud or your data center asking am I up? am I available? Can I actually do what you've built me to do? So essentially that's just active synthetic probes where you have some sort of robotic agent that's probing different parts of your system and saying, are you up or are you down? And again, it's from those data centers or cloud regions, this idea of availability for down there and you're going to be down everywhere.

So pure availability is most just about where you're hosted, where your application is living. And then usually it's just a quick example for like a web app, very popular example is mostly just getting an HTTP 200 response code from a web server, maybe doing some sort of content validation. That's a majority of what people were doing, where just kind of simple availability checks and say, Hey, can this thing actually reply back in a healthy manner? But for your database or for your load balancers or other parts of your system. There may be different types of checks you would do just health checks of your application. And then the idea is that you do this for both user facing application, the thing that actually users are accessing that help the package that you're delivering, but also do it for all your dependent services.

Because if those dependent services are down, the user would probably see part of their application being down. And to them, it's not going to be a good experience. So the system you built has dependent services. Make sure you're monitoring the availability of all those dependent services. And if you're building SLOs, SLAs, SLIs around those uptime of your services, really understand how those are actually rolling up to that user experience. This is where you get SLA calculus it's a little bit idea of if my database has to be up 99.95% of the time, that means my users may have a bit of a different SLA at 99% for example. So calculus there is really important, but for most SLAs, SLOs, SLIs it's purely based on this type of availability. So it was a lot of things that you can typically control in your environment.

And so a few things to consider, it's binary in nature. Either you're up or you're down, kind of the general feel there. Each service and dependency will have its own availability. But the user is basically asking, Hey, is this service available to me? And that's a really critical thing. And then usually for a lot of people it's actually the easiest to monitor, understand and conceptualize. You can really understand if your service is up or down and you can really get a sense of monitoring that. And so in regards to availability, synthetic traffic, that robotic traffic is really the only way you can actually measure this uptime. If you rely on your users, what you're basically doing is saying you're down. When you get a bunch of messages on Twitter where you get a bunch of calls to your support channels and you can't really rely on traffic volumes because you have no idea what's causing those dips in traffic.

You're only knowing that perhaps your site or service could be down for example. And then remember your app can still be 100% available. It can be completely up, but users can still complain about something not working quite right with the app. Hence the idea of reachability. And so can a user from their home or office traverse the actual internet, their local network, talk to your DNS provider, talk to your CDN provider, any security providers you have, any third party services you could embedded and get through all those things. Actually access the app you have running in the cloud, in your data center. And so this idea of reachability is a great big concept. A lot of people are trying to get their heads around how can we make the internet more reliable to guarantee reachability of our services for our customers?

Because they're really at the end game. You want to have your app available and reachable to your actual users. And so a few things to consider, when you talk about reachability, that's basically is my origin reachable? Is my DNS server reachable? Is my CDN provider reachable? If none of those services are reachable on the internet, a lot of times you're going to be basically relying on data that's in your application or code that can help you answer, is my system reachable? And then basically the idea is that for reachability, a lot of times network conditions matter. So can you actually see, are you reachable from different ISPs? For example, my reachability to Zoom or Facebook or Amazon from Boston on Comcast is completely different to someone else's reachability when they're coming from San Francisco or Seattle on Verizon or some other Last Mile providers.

So where you stand on the internet really changes your view of how things are reachable. And then a general sense too is that not only are you need to monitor the reachability of your code or your systems that you rely on, but third parties as well. So if you rely on data from Facebook or rely on ads, calls, you're making your services monetarily worth it for the business. Or maybe you have APIs you rely on to pull data in. If they're not reachable to you, they're essentially down. So really understanding that because that type of stuff can cascade and impact your actual users. And then a lot of times people would tend to think that reachability is something they can't control. I can't do anything about Boston and Comcast having an issue. But the idea is that 95% of your users don't really know that. 

They see your site. ƒThey see trying to get your site and they're going to complain to you that you actually can't reach it. And they're going to say, Hey, why can't I get to your service? And they're going to hold you accountable. Because if you think about it, the next thing that I do is go to google.com and say, Hey, I can get to Google. Everything's fine where I am, but for some reason about your system, they can't actually reach it. You're basically being compared to the likes of Google, Amazon, Apple, Facebook for example. And these reachability issues, you can't always fix them, but you'll still be held accountable and have to communicate that and you're going to be basically wasting time trying to figure out, is it us? Is it the network? Is it some sort of third party that's causing issues for our customers?

And usually again, it's highly network dependent and infrastructure dependent. So think your DNS resolution issues. Think BGP route leaks in the internet, poorly performing internet service providers, all these things that you're trying to work with. If you don't have a handle or visibility into these type of components of your system, they can really cause some issues. But it's not only adjust the network as well as it can be your application your infrastructure, your environments your hosted in. The firewalls you rely on, the load balancers and things like that. Many have those components aren't working. Essentially your site could be your app could be 100% running but user can't actually get to it. And also a really key thing that I mentioned earlier with reachability is that a lot of people think it's only synthetic data that can use this because you have like, Hey you have a robot 24/7 to reach my service.

But there's actually a nice set of work that Google is doing with network error logging. Where you can actually tie in your domain and have Chrome actually report back anytime a real user who is trying to access your site or service or access your domain and they get a DNS there or they get a TCP timeout error. So it's actually been a lot of work recently in trying to get real user data coming into your site, your systems, your logging, whatever you're using and say, can actual users get to my system? So is my DNS reachable? Is it failing for real users? Is it closed? Is TCP drops being aborted and things like that? So you can talk more about this in general, but do think that there is some cool stuff you can do with real user traffic to define am I reachable out there in the internet?

And so once you're available, once you're reachable, you want to answer, am I fast enough for my users? And so that's where performance comes in one of those other pillars of end user monitoring. And so a general sense here is first off, what does performance actually mean to your business? To your users? In the metrics you use matter because those metrics are what you're going to be explaining to your user base about what's going on. And so a lot of people have done work in the environments to say, Hey, what does it actually mean for performance to happen for our users? So you'll see a metrics like first ping times first meaningful ping times time to interactive, the idea of what a user actually sees when they try to load your site, your service, your system. So these kinds of components, these user centric metrics are really powerful for you to explain and understand what it actually means to your site or service to be fast for your users.

Basically, if you look at it this way, you can say, Hey, is my service happening? Is it useful? Is it usable? And if you take a look at these high level user centric KPIs, you can then understand, Hey, is it the network? Is it the code? Is it some sort of third party issue? Is it a [inaudible 00:15:09] browser version that they're running on? All those kinds of underlying variables, you answer as long as you have these KPIs to really understand that user journey first, not look at just basic response times or latency on the network or things like that. And so the two concepts we talked about here is that real user data will be the actual performance of your users. You can't really argue with real user browsers nowadays reporting back the latency and performance and experience of your actual visitors.

It's all standardized. It's all you can agree on it and it's coming from actual visitors and whereas synthetic data is going to be that proactive protection and that consistent 24/7 baseline of your performance that you can use to get alerts off of. You can use to basically compare your rare user traffic against and you can use that to make continuous improvements and know exactly how your system or service is getting better. If you do synthetic in a way where you try to get as close as possible to real user data, you can really combine that synthetic robotic performance data to really see the impact it would have on your real user traffic. And so performance has a lot of variables involved. That user journey that we talked about before is pretty complex.

You have your browser, your version of your browser, your OS, the continents, regions, networks, providers, pages that people are visiting on. There's a whole ton of variables to consider. So when you're looking at a tool, analyze that data, really make sure that you can ask those questions on how is my performance by this browser, by this version, by this network, by this region, and not being limited to a set of predefined views initially. So the idea of asking questions on that data and make sure that data is accurate and of high quality is really important. Because if you think about this, you're going to use this type of end user data to make informed decisions on should we invest in a new CDN in that area, should we invest in maybe a backup DNS provider because we're seeing DNS latency causing first paint and interactive issues on our site?

Or perhaps we need to roll back that code that we pushed out and we thought we tested pretty well. But we have to roll it back and maybe refactor what we're thinking about in terms of delivering that new feature or function to the product. So having accurate data and quality data is really important. These are really important decisions you're making about the future of your application system. But also if you think about it at the analytics and visualization of that data is almost equally as important. If you showed someone a number that says, Hey, we load in five seconds on average. And then you show them a line chart and they can say, Hey, that's pretty useful information, but they may not know what it actually means to load, for example. So things like film strips and screenshots and other data realisations really get that point home, especially as you communicate to different stakeholders, different user groups and things like that. Because they have to make their own decisions on how to allocate budget, how to prioritize different features and kind of change priority when you're developing and things like that.

So making sure you have the right platform to analyze and visualize that data is also going to be really impactful for how you make use of this type of performance information. And then a really key thing about the performance averages can be very misleading. So it's kind of to go through a quick example here. So we can all agree that each data set we have here is pretty unique. You have a nice dinosaur dataset, you have a nice circle, bullseye, horizontal lines, stars, things like that. And you can imagine that these are data sets that are coming in from actual visitors to your site, your service. They won't ever look like a dinosaur, but the idea is that you can at a first glance see that there's some sort of Delta change in this data. Whenever they tell you that each data set here has the same X average in the X direction, same average in the Y direction, same standard deviation in the X direction and same standard deviation, the Y correction.

So if you're purely looking at numbers like means, standard deviation and things of that nature, you're kind of missing the point of the fact that underneath all that there is some sort of pattern in that data that you may not know about. Your star could be completely different than your dinosaur, so to speak. And so the idea is that in performance, especially in monitoring end user performance, raw data really matters because you can make some really powerful conclusions that you may be completely incorrect if you rely on just a calculation or some sort of high level view of that data. Unless you have that raw data to look at and really understand these underlying patterns in the data. And so walking through a bit of a example here. If I said, "Hey, my site on average loads in nine seconds, for our industry that's pretty good."

We compare and contrast against our competitors. But you have no idea what that actually means to your users. So the idea of histograms are really important in a performance analysis where you can say, Hey you that that average or that mean is basically right here, but we have a lot of traffic that's faster than that. We have a whole lot of traffic that's slower than that. And that nine seconds does not actually mean anything to any users. In fact, your average speed or page load time, you can actually have an example where no traffic is actually at that average and you have two very divergent or bi-modal peaks in your data. And so even things like time series like can say, Hey, yeah, we have an average of our page load time, we see a spike here. But you don't really know how many people are really slow, how many people are really fast, for example.

And so underneath that taking the histogram view and looking at it in terms of heat maps really tell you where a lot of your traffic tends to lie over time. And that when there's a spike and say, Hey, actually all of our users had a poor performance in this page load time or maybe perhaps only a certain subset over that time. So data visualization, having tools that can bucket this data and not look at just numbers or trends, actually the underlying distribution behavior of that performance data can be really impactful and can really help you understand your performance, not just report on it to have a number for it, for example. And so the idea of reliability is the hardest part in my opinion, this idea of making sure it's consistent, your availability, your reachability and performance.

Because if you think about it, user reliability, the actual thing you're delivering to your site for your user base is basically an emergent property of many unrelated independent systems. If I visit a site or service and they have a missing widget on the page, that's an unreliable service. Even though everything else in the page can be completely reliable or available. So for a user base, don't define your own reliability. Your users have to define that reliability for you. And you think about it, if you have a reliability component that you're analyzing that's been determined by your user base, then you're basically having really good customer success because you're making sure that the users accessing your service are happy in a way that you can actually measure and monitor.

You're not doing it the other way around where you say, Hey, it doesn't really matter our site's available 99.9% of the time. But for you our particular user, it was down two hours during the day, for example. So really listen to your user base and have them tell you what the site and service reliability should be and then work backwards from there. And so the idea here is that longterm trends also matter quite a bit, especially with the metrics that we've talked about and dimensions of the data that we talked about. So to show you a quick example of that, let me show you this data set here. This is actually some data we're doing a Google search. The idea is like, Hey, Google has been pretty consistent in how long it takes to deliver all the things when you do a Google search, for example. You can say, Hey, actually they're pretty reliable. They're pretty consistent.

Now, if I were to tell you that this is basically a month's worth of data here, you'd think, well, I mean Google is doing a really interesting job in keeping their search times pretty consistent. But then think about the longterm trend of this. If I were to tell you that let's zoom out a bit and look at about a year's worth of data. What you start to see is that there's actually a long bit of a delta here in terms of how long it takes that search to run in Google for example. And that actual dataset I was showing you before was just November. Right here. You've actually looked at it and say, Hey, show me this trend over a full year. You start to see that actually they actually got a bit slower in October and November. And then they improved it and actually got better in January, February and March for example.

Idea of longterm trends matter and having that data quality behind it to really ask and see over this type of long timeframe will be really impactful because if you just looked at the very small windows of data as they come in or let's say even over a months time, you're going to miss those longterm reliability concerns. That could actually impact your business and make it really hard for users to trust your system overall. And so the hard part here is that you need a monitoring methodology. Any people and processes in place that have the rigor of a data scientist. And if you take a longterm trend separate out by things that you know you want to separate it out maybe by region or country or ISP, but also be able to ask questions on that data that you may not know about as well.

So the idea of observability comes into play here where as you get this data coming in from real users and synthetic, you had to be to ask questions of that data. The ones that you can think of, ones that you can't think of. And so the idea here is when you have a monitoring system in place, just make sure that you can leverage it in ways that you can't think of. Which is a really catch 22 like how do I know what to think of? So when you ask the data, think about data quality, data integrity, and make sure you can ask the right questions so you're not being misled by some sort of tool making poor decisions off of it, for example. And so here's a quick example in action. At Catchpoint, we actually can't drink our own champagne so to speak, and we actually monitor all of our SaaS tools that we leverage.

So we're kind of an O 365 shop. And what you start to see here that we have a trend, say, Hey, there's a bit of a dip in availability for some O 365 tools like SharePoint, Outlook, OneDrive for example, and even a spike in performance during that timeframe. This is an actual example of the impacted some of my colleagues, for example. You even says, Hey, this performance spike here, you can actually see it was only happening for a certain subset of values. There were some performance rangers that were actually pretty consistent where it has been over time. They've got an idea of a heat map. And what's really important is that you want to ask questions on that data. And so what we did is say, "Hey, let's see if it's actually localized to a particular region or office or city."

And so again, you see we have that raw data that scatterplot and say, Hey, this is actually, it looks like it's just happening from New York for example, as opposed to where I'm based in Boston or my colleagues in Bangalore for example. Some sort of very localized issue here accessing O 365 from New York. So you have some hints of a problem here. And it's happening in New York, it's O 365. But then you want to see what does it actually mean for this to be slow for are my colleagues and the customers you work with. And what you start to see is that Hey actually it means nothing's being displayed on the screen and there's a very long wait time of things actually rendering on the page but even things like high interactive time. So you have a sense now of yeah, it's actually is impacting actual people.

This is actually how a user would feel the site could be slow. And if some really kind of issue going on here that we should probably take a look at and try to see if we can improve and talk to Microsoft about. And so the idea again of analytics and visualization matter. So we're bit of a small company, but we have about 250 employees. But we use maybe over 100 SaaS apps across engineering and product and sales and all these tools. So we actually monitor all those systems. And what you start to see is that Hey, there's actually something happening where everything from the perspective of New York and our office was seeing a spike in performance. And then you kind of start to say, "Hey, it's something in New York, it seems to be impacting a bunch of different tools that we're monitoring."

But what you start to see here is that Hey, there's some sort of network component that's different between the failures and the successes. So in our New York office, we have a main ISP pilot. We have a backup ISP provider AT&T, for example, we kind of get a real sense here that Hey, when traffic went through AT&T for just a quick moment, everything got poor performance. And all of a sudden there's some sort of issues going on here. So you guys sense, Hey some sort of network issue going on here that's impacting the user and the user experience. And so as we drill down further say, "Hey, I know network data is usually best visualized when you have some sort of path or flow analysis. You actually kind of start to see, Hey when traffic went through pilot on the way through other providers to reach Microsoft for example, there is no packet loss.

But when it went through AT&T you see packets being dropped, you see high round trip times. So you have a sense of flow and how networks work that power, all of the applications that are running on top of it, for example. So even kind of taking that further and saying, Hey, that timeframe where we saw an issue for users, can we guarantee that during that exact time window we were always on AT&T. This is where things like time series and again that raw data comes in and say yeah actually here on the third hop of the network path trying to reach Microsoft, you see a very sudden shift. Where you will use pilot for one provider is a performance value on the network side and then you switched to AT&T and we were using a different performance.

Basically the actual story is that we had to switch over because they were switching the cable in our local office because they had to basically upgrade it for purposes. We had to switch over to a back over provider. This actually really impacted people during that timeframe. As you start to see that, Hey, we're actual people impacted by this? This is the real user component. What you start to see is that we actually have basically understanding from each user. In our offices and our end points. Well, I see actual traffic coming in to O 365 from our offices. And so you start to see it, it's actually during that same timeframe, all traffic actually died off trying to reach O365. So if you think about what that means, if you're relying purely on real user data to report the issue, you would have no traffic coming in because kind of see it just stops here and then basically starts up right back here.

So the idea is that synthetic data was the only way to actually catch this issue that could have impacted users. But that you can also use real user data to answer network problems and things like that. So we actually were able to show right at that time that the AT&T switch over now everyone was going through pilot, but then no traffic was coming in until we got through that quick upgrade we had to do. And so they can take this further and let's say some individual was complaining or something like that. You can look down to their individual device, the actual applications, what pages and systems are visiting, for example. The idea here is that when you're monitoring in production, sometimes you need to see that high level approach, everything overall uptime and performance across your entire user base. Sometimes you have to actually go down and say, Hey, someone's complaining and let me take a look at their individual session, their individual data and see just exactly how their experience was.

And so that's kind of the idea of what this stuff does in action. Combining real user in synthetic. And I hope you get a general sense here. I'm happy to answer any questions talk through any different scenarios. I've been at Catchpoint for quite a while, so any tools or ideas you have around monitoring in general, I'm happy to help and I appreciate your time. (silence)

Yoz Grahame: There we go. I was muted in place. Sorry, I've been rambling for ages. Thank you so much for that.

Zach Henderson: Oh, you're welcome.

Yoz Grahame: That was a great overview of loads of different kinds of monitoring. We're not done with you yet. I have some questions for you and also we'll be taking questions from the viewers. So if any viewers have any questions, please post them in the chat. We have about 20 minutes left for taking questions. So the first one for me that springs to mind is that you're going through a whole load of very impressive monitoring and statistics. Actually the thing I see in the chat my colleague Dawn is actually asking the same thing, which is that somebody who is new to this and looking at these incredibly impressive graphs and all the different things you're monitoring. It seems incredibly daunting to set up for beginners. So if somebody's getting started with a monitoring on their own systems or trying to improve, where should you start? Where's the best value at?

Zach Henderson: Yeah, that's a great point. Let me go back here. Just those four pillars So the easiest place to start is with availability. That's basically as we were discussing are you up in available? And a lot of times that's where you have these health checks. You have these very simple components to say, is this app up and running and available. And then from there you can work towards reachability, which is basically driving either synthetic traffic or understanding that real user data we talked about. That can also be fairly simple as long as you understand, what you want to monitor. So a lot of times with external synthetics, you want to think about uptime or reachability uptime. Monitoring from as many places as possible or from many locations as possible, as frequently as you can. And then letting that data define baselines for you and then using that to basically improve your strategy over time.

What we tend to recommend a lot of times is you start with availability and reachability and then you use that information and build on it to say, okay, we have a good baseline of how users actually think our uptime and reachability is. Can we now use that data to talk about performance? And that's where a lot of people mature and this idea of monitoring maturity comes into place where you can start passing some really powerful questions on is my performance getting faster or slower than it has been the past day, past week. What sort of KPIs should we look for that really indicate user performance and availability. And one really simple thing to do is if you rely on RUM at first, it's pretty simple way. RUM has matured a lot where it's actually fairly simple to get some sort of instrumentation into your web apps and your clients. And so that's a fairly easy way to get some data in. You kind of have that traffic come in and then you can use that information to inform what you should do from a synthetic angle to say, Hey, where should we actually protect our performance and availability? So let your users tell you what to do first and then have the monitors behind that builds off of your user base is a really good way to start.

Yoz Grahame: For RUM, for setting up real user monitoring, is it the kind of thing where you basically include a script tag in your sentence somewhere and the monitoring system already start just grabbing a whole ton of data just from that script app?

Zach Henderson: Yep, exactly. And the idea is that most modern browsers, actually all modern browser support, the API