Matt Watson and Jason Taylor discuss the struggles they had as IT managers that led to why Stackify was created. Matt is the Founder & CEO of Stackify. Jason is the CTO and joined the company about 6 months after it started.
Years ago most companies monitored the servers that ran their applications but not much else. Fast forward to today, and thanks to virtualization, cloud computing, containers, and serverless architectures, we don’t even care about servers anymore. Application monitoring has become even more critical.
They discuss how Stackify doesn’t even have an IT operations team. Their developers are in charge of monitoring their production systems on Azure and operate in much more of a DevOps mode. Even though companies are leveraging the cloud to easily deploy and manage their applications, performance problems, errors, and outages are still part of the daily life of developers and operations teams.
Application monitoring requires collecting a lot of data. Including application exceptions, logs, metrics, and code-level performance analysis. Stackify helps developer teams accomplish this with Retrace.
Stackify also offers a free developer tool called Prefix that helps developers find bugs and performance problems in their code while they are writing it.
Matt Watson: | Hello, and welcome to the first episode of the Stackify Podcast. This is your host, Matt Watson. My cohost today is Jason Taylor. Jason, what do you do at Stackify? |
Jason Taylor: | Matt, I’m the Chief Technology Officer at Stackify, so I’m responsible for delivering the product that our awesome customers use every day. |
Matt Watson: | What does Stackify do? |
Jason Taylor: | Stackify helps people monitor the performance of their applications, they’ll find and troubleshoot and diagnose bottlenecks and performance issues. Ultimately, I think it helps people deliver their software faster and at a higher quality. |
Matt Watson: | So you mean people have problems with their software? |
Jason Taylor: | People do have problems with their software. We have problems with our software. |
Matt Watson: | It doesn’t work perfectly all the time. |
Jason Taylor: | Not all the time. |
Matt Watson: | Okay. Well, that brings up the question of why Stackify was started, the history of this. Why are we doing this? Stackify was started in the beginning of 2012. I, this is Matt Watson, started the company, after being the Chief Technology Officer and founder of a different company, really to solve a lot of the problems that you just mentioned. When you have a big development team and you’ve got lots of applications, you’ve got things moving all over the place, and it’s hard to track all those things, know if they work, the performance of them, troubleshooting problems. It’s kind of a giant nightmare. Wouldn’t you agree? |
Jason Taylor: | Yeah, absolutely. So that’s what drove me to Stackify as well. We’ve all had these problems before, and way back in 2012 there weren’t a whole lot of good solutions for this. |
Matt Watson: | The solutions that I had used before, and let me hear what you used before, was really a combination of a lot of things. We had different tools for application logging. My team spent over $100,000 to buy Splunk so we could process some log files I needed for one report. It was an extremely expensive thing; and nobody else on the team used it. I used it, literally, for one report; because it was the only way I could query log files across four different servers in real time. |
But then we had different tools for monitoring the servers themselves. We had different tools for application errors, which in our case, a lot of those were just going to the database, which was a black hole. Nobody ever looked at them until we ran out of database space and had to delete the stuff. We had several other tools. Things just to monitor our website. That was part of the problem. We had all these different tools that did all these isolated, different things. I had 40 developers that worked for me and none of them knew how to use any of those tools. The aha moment for me, the big problem was, myself and my other three, kind of most senior, lead developers spent all of our time putting production fires out. We were the only ones that understood how everything worked, where the bodies were buried. We were the only ones we trusted with access to production, to actually log in and do these things. What kind of challenges did you have in your background coming up to Stackify that were similar? | |
Jason Taylor:
|
Our challenges were very similar. I worked for a company that did software as a service, a company that scaled very rapidly, almost overnight. One of our first challenges was that a lot of our development was outsourced. In house, we only had a couple of people who were in charge of development or were involved with development. And, exactly what you’re saying, we spent, as a result, a lot of our time putting out fires, and dealing with production issues, and dealing with more of the DevOps side of the practice. |
We had the same death by a thousand cuts or death by a thousand tools that you had; and, unfortunately, there were also a lot of unknowns. When a lot of this development had been outsourced, the developers there had baked in air logging or air handling and logging and some of these other things, but hadn’t given much thought to it other than, well, we’ve got to put the data somewhere. And, you go into SQL or flat files or whatever; and, oftentimes, we didn’t even know what data we had available to us to do troubleshooting with. On top of that, we didn’t have very good monitoring overall of our application health. Surfacing performance metrics was a difficult thing to do. | |
Matt Watson: | So, let me ask you this. Were you primarily just monitoring the servers themselves? |
Jason Taylor: | Absolutely. And that was done largely by our co-load data center provider, and they’re infrastructure people. They knew nothing about our applications. They could tell us, is your server up and running? Yes. Well, then your applications should be fine, but our application was not fine. Our customers were complaining. |
Matt Watson: | I think there’s a few things there we should talk about because ultimately the culture of how software development is done, how applications are deployed, all of this has changed significantly over the last five years, right? |
Jason Taylor: | Mm-hmm (affirmative). |
Matt Watson: | The tools we’re talking about, a lot of them were very focused on the server. How do we monitor the server? We used Nagios. I knew if we had high CP or memory or the server was down, and some really basic stuff, but detail about my application was sort of a black hole. If you fast forward to today, Stackify, we’re hosted in Microsoft Azure. We literally don’t even care about our servers. It’s a whole different world. We go to Azure, and we say, “I want to deploy this app. I want to deploy it on 15 servers. Go.” And, it just happens. Our apps auto scale up and down. We don’t deal with things like servers having to go down for maintenance, or we have to install a new version of VMware, or storage, and all these things that would cause outages. |
Jason Taylor: | Security patches. |
Matt Watson: | Security patches. Windows updates, all this sort of stuff. It just happens, which is amazing compared to where we were five years ago. So all of the tools we were using five years ago were so focused on the servers, and the application side of it was sort of a black hole. Then you look at it today, and the servers don’t even matter, almost, anymore because so much of that is taken care of for us. But the application side of it is still really super critical. |
Jason Taylor: | And servers are going away. Think when we first started working in Azure, five, six years ago, the main service that you had available to you for deploying was, you could [inaudible 00:06:37] up traditional VMs. |
Matt Watson: | No. It was just Azure web roles and worker roles to start. |
Jason Taylor: | That’s true. That came a little later. So Azure cloud services. |
Matt Watson: | Yup. |
Jason Taylor: | Rob Wilson worker rules, which under the covers were VMs in a managed cluster and configuration that could scale together and automatically deploy your applications, and all of that good stuff. It was a little bit, when you think about it now, and look at today’s science kit, it was a little bit of a lift and shift. You were already running in full Windows servers, you picked your application up, you put it to cloud services, which was running full VMs. You look where we are today with server-less and container-ization, the servers have absolutely disappeared into the background, which is why it’s all the more important that developers know how to instrument their applications because they’re dealing very much with a black box, and it’s not what they’re used to. |
Matt Watson | Well, that goes with that other new trend that people talk about all the time, which is DevOps. Part of that is IT and development have to be working together more as a holistic piece. Some people will argue that DevOps is a culture versus a role, and is that role release automation and infrastructure as code. There’s a lot of different angles that people kind of look at that; but take Stackify, for example. We manage data centers, four different data centers based on different environments. Our production environment has quite a lot of servers, a lot of apps. Jason, you’re the CTO, how many IT operations people work here? |
Jason Taylor: | Last time I checked, I’d have to run the numbers again, I think it was zero. |
Matt Watson: | So the answer is, it’s zero or everybody. |
Jason Taylor: | Right. There you go. |
Matt Watson: | It’s zero or everybody. |
Jason Taylor: | Right. |
Matt Watson: | And for your team we have about eight developers today, is that right? |
Jason Taylor: | Mm-hmm (affirmative). |
Matt Watson: | So how many of them rotate on call? |
Jason Taylor: | Every single one of them. Except for the new guy. And his time’s coming in a couple more weeks. |
Matt Watson:
|
Yeah, sorry new guy. So that’s the point, we have no IT operations. We have eight developers that own the system on a 24-hour basis that understand how it works, how it’s deployed, how to monitor it, all of those things; and a lot of the key to that is the monitoring. How do we know if the applications are working? We know Microsoft has done a good job for us of keeping the servers up, so I think this gets to the heart of what Stackify does. |
It’s all about the applications themselves. It’s one thing to know if your application is up or down. I mean, it’s easy enough to use some tool like Pingdom, or whatever. They can tell you HGPEs, Stackify com, is down. That’s great, but mainly the problems that we deal with on a day to day basis, are intimately more complex than that. It’s bad SQL queries or slow equal queries or Reddit is down or Microsoft Azure has decided to crap the bed, or whatever the other issues are, and that’s really the focus of Stackify and Retrace, which is our application performance monitoring product. It’s getting to the why. It’s getting to the root cause. There are lots of things that can tell you that you have a problem. Actually getting down to the root cause of the problem is a way more difficult challenge. So, as a CTO of a company that has to solve that challenge, what is that like? | |
Jason Taylor: | That’s a very good question. I’ll talk to you a little bit about what we’re doing. So certainly, as you mentioned, everyone on our team is responsible for that 24/7 operational support of the SAS part that we put out there. Everybody’s responsible for performance and the care and feeding of that. The one thing that we’ve started to trend towards is that, as we’ve grown, we definitely have a couple of different development channels. We’ve got new feature development, and the product that we work on every day; but there’s still this big support component, and at the center of that is using Retrace, which, fortunately for us, is our own tool that we- |
Matt Watson: | We actually use our own tool? |
Jason Taylor: | We use it every single day in everything that we do, and that really directs us to the hot spots in our code that need improvement. What we’re doing is, so you talk about the traditional operations, and you talk about a DevOps practice, and what we’re starting to do is we’re building our version of a DevOps practice. What it really is, and the term that we’ve been gravitating towards, is operational engineering. It’s developers. It’s not network people and infrastructure people. It’s developers, and their focus is to use all of this sort of data and insights that we have available to constantly improve our platform; because, as you mentioned, the underlying infrastructure is fairly well on autopilot. The parts of it that matter to us is really infrastructure as code built on top of this great cloud platform. So the focus for our operational engineering team is, what’s the data coming out of production? What’s the experience like for our users? And, having those resources dedicated to just constantly improving the performance of our code. |
Matt Watson: | So what are some of the things that the team can get out of Retrace? They log into Retrace. What do they use it for? What do they actually do when the log in? What is the information they’re looking for? What are the problems they’re trying to solve? What kind of information does it provide them? What is the value of that? |
Jason Taylor: | There’s a couple of different things, and I’ll tell you what works great for us. I talk to a lot of our customers, and sometimes their entry points are a little bit different than ours. But for us, our team every day, when they’re doing this operational engineering and support, is getting a pulse on overall application health. One of the biggest areas for that is errors and exceptions. So we ship all of our errors and exceptions to the Stackify appenders. We’ve got a great log for that appender. So log for [inaudible 00:13:11] something that comes naturally for developers. It’s plugged into all of our applications. You just drop in the Stackify appender; and poof, magically, all of your data is in one central spot from all your servers, all your applications, aggregated nicely, graphed nicely. |
That’s where we always start. We’re looking for error rates that are trended up, for new errors that have introduced themselves that we haven’t seen before, because that’s usually the number one indication that something’s gone wrong, especially if you’re doing a lot of proper logging and error handling. If you’re having a SQL connection issue, we’re going to see that instantly. All of a sudden you’ve got this big spike in the graph, and you have thousands and thousands of SQL connection errors. | |
Matt Watson: | We recently had an issue with that, right? |
Jason Taylor: | Mm-hmm (affirmative). |
Matt Watson: | So we talked about we’re hosted in the cloud, and no matter if you’re using Azure or AWS, who you’re using, they help do so many things for us. We recently had some problems around software with SQL connections and provisioning. We were having some performance problems that were affecting a very small subset of our clients, and the way that we found it was by that, right? |
Jason Taylor: | Yeah. |
Matt Watson: | Like we’re getting SQL time outs and random different sort of SQL exceptions that we’re seeing, and a lot of times that error data is your first level of defense. It’s not necessarily always about performance, it’s also about errors. Some of the tools like ours, that exist, are really focused on performance; and performance is important. But, if you’re only paying attention to things that take longer than two seconds or five seconds or ten seconds, there are a whole lot of things that fail immediately. |
Jason Taylor: | There are. |
Matt Watson: | A rocket launch can explode within the first few hundred milliseconds. Your connection to SQL server can fail within the first 100 milliseconds. So I think that’s one of the benefits that exception handling and reporting provides developers. They are a kind of first level of defense. It’s their eyes and their ears. I would say it’s always the most important thing to watch out for while you’re doing a deployment, and just after a deployment. Because it almost never fails that the team does all this great work for a week, two weeks, four weeks, whatever it is, and they ship the code to production; and, almost immediately, they find some sort of problem. It’s inevitable. This is your first line of defense of recognizing and finding these problems, so that you can then do a hot fix as fast as you can. |
Jason Taylor: | Right. And aggregating all of these exceptions errors into one central spot from all your applications can help reduce your chances of getting focused on a red herring, as well. So the example you just gave, a SQL server, and a lot of more traditional shops that aren’t using a tool like ours, like you said, there’s a bunch of point products. One of those might be some SQL monitoring that DBA uses. |
We had a different incident a couple weeks ago. Again, a platform incident, that if you were only looking at the traditional tool that’s monitoring SQL, the DBA might sound the alarm and say, “SQL’s down. SQL’s down. We got a problem”. Then everybody’s scrambling to try to figure out why the SQL server’s down. However, what we saw, because we have all of this aggregated to one place, is we were having problems with SQL. We were having problem with Reddit’s caches. We were having problems with accessing our storage and our service bus queues. And, being able to see all of this from all of the different sources, we were quickly able to say, “No, it’s not a SQL problem. It’s an underlying networking issue that’s impacting all services.” | |
Matt Watson: | You mean Azure has problems? |
Jason Taylor: | No, no I’d never say that. [crosstalk 00:16:58] I’d never say that. I would just refer you to their service dashboard status history and let you determine on your own. |
Matt Watson: | Right. |
Jason Taylor: | But certainly it was, that day, an Azure problem. We’ve got really great relationships with the operational team there. We were able to quickly raise the flag. They had already seen it. We just saw it almost as quickly as they did. Sure enough, there was an underlying network issue within an entire region that was impacting every other service. In my past roles, that’s the sort of thing where we would have spent probably hours looking for a problem in the wrong place. |
Matt Watson: | So, that gets to another key point of this. One of the best things that Retrace does, and any sort of application performance monitoring product can do, is help you understand the performance of your different application dependencies. So to you point, if you only would have been focused on SQL, that’s one side of the story; but we’re also talking to Reddit, which is a whole different thing for caching, and then we’re talking to Azure storage because we’re accessing table storage and blogs. We’re using Azure service bus. We have multiple application dependencies. |
Jason Taylor: | Plus a number of third party APIs for different services. |
Matt Watson: | Right. So the key is that tools like this can help you understand the performance of every one of those dependencies. So the way that Retrace, and products like them work, is that they instrument your code at run time. So they track key methods in your code that know like, “Okay. I’m executing a SQL command, and how long did that take; or I’m executing a command against Reddit, and how long did that take. So by being able to instrument all of those different libraries, it can accurately tell you the performance of all of those dependencies. |
Many of our applications today have a huge array of dependencies. We’ve talked about several that we have, but a lot of them are also different sorts of web services. As people move to micro services, and stuff like that, really what you have is just a bigger and bigger spider web of dependencies. You’ve got, yeah, you don’t have a monolithic app, but you’ve got lots and lots of dependencies, lots of little pieces all over the place. If any one of those pieces break, it impacts everything that connects to it. That’s why it’s so important to have products like this that understand how all the pieces connect to each other. If there are performance problems, or just outages, exceptions, all of those things, you’ve got to be able to see that from one pane of glass and understand what is going on so you’re not chasing the random things, and you can get a clearer picture of what’s going on. | |
Jason Taylor: | That’s absolutely, absolutely true. That’s something that with Retrace we’ve spent a lot of time getting right from the first days we sat down and were looking at what needs to go into this product. What’s going to deliver the most value to me as a developer trying to solve a problem that I might be having with my application? The number one thing was identification of the service boundaries and how those are performing, because, very similar to exceptions, that’s a huge leading indicator. Oftentimes, the two will go together. If you have a service boundary where you’re having a connectivity issue or a performance problem, you’re probably seeing increase in error rate around the same thing depending on how you are capturing all of those errors. But that’s huge for us, especially as we have all these interdependencies on different services, to quickly isolate and identify what that external cause is of your performance issues, and also being able to provide some definitive evidence, I don’t want to say proof, but evidence to that third party provider to help them remedy the issue as quickly as possible. |
We hear from our customers all the time how we’ve helped them isolate and identify a problem very quickly. Oftentimes, it’s a third party that in the past, before they had Retrace, it was always very difficult trying to commence this third party, “Hey, you’ve got a performance problem over here.” But now they’ve got the evidence clearly right in front of them. Because at the end of the day there’s always, if you kind of take your root cause assessment of these performance issues and outages, and boil it down to a couple main categories, you’ve got external dependencies that you have no control over, that are no longer working, or something you’ve done has changed. And obviously if you haven’t changed anything, if you haven’t deployed code, that sort of thing, you’re going to be looking for that smoking gun of, it’s something outside of my control- | |
Matt Watson: | It’s an environmental issue. |
Jason Taylor: | It’s an environmental issue. It’s a third party issue. Try to identify that as quickly as possible; because, as you know, it can be a needle in a haystack sometimes. |
Matt Watson: | Well, I think you bring up a good point. It’s kind of where we started the conversation, where in the old days you had IT operations, and they kind of owned the servers. They might have been on call, but when there was a problem with the application it was hard for them to know if it was a problem, all of these problems that we’re mentioning. They see the servers up, CPUs good, the service has started, IAS is running or the job at JDM is running, or whatever. Past that they’re like, “I don’t know”. |
The key to this is to eliminate the finger pointing. Maybe, even today, whoever’s doing on call, if that’s IT operations or someone in more of a DevOps role, whoever it is, the key is they can see, “Okay, the server’s up; but I can see there is a problem” with this, this and this, and it’s an issue with Azure or it’s an issue with Reddit, or it’s an issue with SQL. Getting to the root cause of those issues is so much easier. | |
There’s one other thing that makes Stackify and Retrace really unique, in my opinion, and that’s how we also handle application logging for developers. So not only do we handle all the exceptions, but the first thing I want to see when I’m having a problem, is my logs. If you come to me and you say, “Hey, there’s a problem with our software. We need to figure out why.” If my application has logging, that’s the first thing I want to see. That is, hopefully, the single source of truth, if I have got good logging; and, that can help me debug problems. | |
A lot of tools like ours don’t have the application logging tied into it, which we’ve always felt was a huge shortcoming from an APM application performance monitoring perspective. The thing that has always boggled my mind is the things that really, really go together are your application errors and your application logs. They’re the same source of data that are both coming from log4net or log4j, or whatever logging appender framework you’re using for whatever programming language. Usually, those things some from the same place. | |
Jason Taylor: | Absolutely. |
Matt Watson: | But then developers use different tools for them, like the errors go to some error tool or nowhere and then logs either go to a text file, which ends up being a black hole, or they go into some logging system, which is great. But then those logging systems, like Splunk and stuff, don’t understand what an error is or what an exception is. De-duping those and identifying that this is the first time you’ve had this error or you had this exact error a thousand times an hour, to them, it’s just a bunch of text. They don’t understand it. That’s one of the things that has just always boggled my mind. |
The errors and the logs of this are so important to the developers, and the things that they want to see when they’re trouble shooting application problems. They really go together, and when you combine that with all the other sort of things that we do around code level performance, understanding the performance of dependencies, and all that stuff; all of those things together are just really create that full picture. So we’ve been in application logging for two or three years now. It’s been a really core component of our system. What would you say are some of the best features of our logging that make it unique and different for developers? Why would you want to use Stackify, Retrace’s logging, versus using something else? | |
Jason Taylor: | Well, there’s a couple things. The first is really the aggregation from all sources. I don’t know if you’ve ever done this. I’m sure you have back before you’ve had a solution like this. Hopping on a server that’s having a production problem, opening up a text log file that’s 300 megabytes in notepad on the server, and then trying to search it and correlate that to the timeline of the problem that you were having, and finding the problem didn’t happen on this server. Got to go to the next one. And, being able to put all that together into a single timeline, so you know the time window of when the problem happened, you can narrow down to that time window, you can see the logs across all the servers, search it in an indexed way. It’s not just a flat file, Stackify and Retrace, we index the data that’s coming in- |
Matt Watson: | So what does that mean exactly? What’s the benefit to the developer when you say indexing? |
Jason Taylor: | One of the great things about that is oftentimes when developers are logging, whether it’s an error or informational, there’s some contextual data to go along with that. So if you’re passing that into our appender as structured data, if we can serialize it, we index each one of those properties and fields. So if I’m looking for a log statement with this object, with this property, that has this value, you can quickly get to that, and eliminate all of the noise, especially if’s something that would text wise return a whole lot of hits. You can do that very quickly. |
Matt Watson: | So you can turn logs from being just text to being more- |
Jason Taylor: | Very structured data. |
Matt Watson: | Structured data that you can query. |
Jason Taylor: | Absolutely. |
Matt Watson: | And we do a good job internally of virtually, every logging statement we do, we tag the logging statement with the client number, right? |
Jason Taylor: | Right. |
Matt Watson: | So we can go into our logging system and search for a client number 733, and I can see every log message that was related to customer 733, right? |
Jason Taylor: | Right. |
Matt Watson: | So that makes it so much easier if you’re trying to troubleshoot problems that are specific to a client, or things like specific user, or specific transaction, or whatever it is, right, if you tag everything with that transaction number. |
That brings up another really good point. So many of our applications today are very sort of distributed or they’re … We’re doing something in the UI, but that ends up pushing some transaction that happens into a background process. So the user clicks, “Oh, I want to delete this.” Well, instead of deleting it, you’re actually writing it to a queue, and then something else is picking it up off the queue, and then it goes and does what could be some other big long complicated operation. So then trying to tie together what’s happening in the UI, to what’s happening in that other background service, can be very tricky. That’s where this sort of stuff comes in so handy to see all of your logging in one place where you can search by client or by transaction number or whatever it is, and just get that clearer picture across all the servers and all the apps in one place. | |
Jason Taylor: | That’s definitely true. There’s, of course, some best practices in there that, if you have largely distributed systems that are processing a transaction across multiple hops through all these different services, we recommend to our customers and friends that definitely take advantage of this type of logging, and have that contextual data follow it all along the way, so that you can basically hit replay, and see the journey that this data took through, and see what happened. And then the other thing that’s really great is how we’ve merged logging with our cost back level performance data. |
Matt Watson: | Oh, yeah. That’s my favorite. |
Jason Taylor: | It’s great because, let’s say, on the website, you’ve got a slow call. You can go, and you can see that request. You can see the call stack for it. You can see, going through some of your architecture and you get down to your database access layer, and all of a sudden it took 40 seconds to return; being able to see any logging that you’ve done in there might provide you with a little bit more of a context. Who is the customer? Who is the user? What was the action that was being performed? Anything that you’d want to track, you can see that in line with the call stack that was captured as part of that trace, and provide a whole lot more relevance to the problem that you’re chasing. |
Matt Watson: | Well, as you can imagine, the reason we actually call it Retrace is for that reason, right? |
Jason Taylor: | Right. |
Matt Watson: | So you can go back in and see what did my code do. And to me, this is the most powerful feature of our system as a developer. You are able to see request by request or transaction by transaction, that [inaudible 00:30:24] trace of what did my code do. It made these three Reddit calls, it made these two SQL queries, it did a web service call. And, as you’re talking about, we also intertwine in there all of the application errors and logging; so you really get that full kind of bread crumbs view of what in the world just happened. And so, that brings something else up that I think we really want to talk about a little bit is Prefix. |
Jason Taylor: | Right. |
Matt Watson: | Prefix is a free tool that we started on about two years ago. It’s been out about 18 months now, and that’s all Prefix does is actually what we just described. It’s 100% free, a free tool that we’ve developed. The difference is it’s designed to work on the developer’s work station. So you download it and it works for dot net and Java both. While you are writing and testing your code, it becomes your fast, immediate feedback loop to answer that question of what in the world did my code just do. |
So it can help you identify SQL queries that are slow or application errors that are happening or kind of bad patterns of things that are happening in code, for example, like freezing in hibernate or something like that, and it runs 50 different SQL queries, and you didn’t know it was running 50 different SQL queries. This tool makes it immediately visible and provides even some smart suggestions of stuff. It’s blown me away, the adoption of Prefix. We’ve had about 20,000 people download that, and we don’t even do any promotion of it anymore. It’s been really incredible. Jason, do you have anything to add to that about Prefix? | |
Jason Taylor: | I think you encapsulate that well. I will say it’s been pretty amazing to see the response to that. There were some things that really surprised me when we started getting feedback from our customers who are using it. I knew it would provide value. I didn’t grasp quite how quickly it would do that. We’ve had so many people tweet at us or send our customer support an email or whatever that says, “Hey, I installed Prefix. I opened up my solution. I hit F5. Wow! I just found this n + 1 problem with my ORM. I’m making the same database call 40,000 times per page, and I was able to just re-factor it, and cut my page load times down by 90%, all within that first five minutes of using Prefix. That’s just absolutely amazing.” |
Personally, when I’ve used it a few times, I’ve been surprised by some things that I see my application doing on start up and initialization that were expensive and added a lot of load. I was able to say, “Well, there’s no reason for this. I should be able to fix this.” And, I can see that as I’m writing code, and I can fix it before it even goes out to QA or [inaudible 00:33:25] environment and becomes a problem for the entire team and possibly our customers. For a free product to do that, it’s pretty amazing. | |
Matt Watson: | Yeah. I would highly recommend anybody who is listening to this right now, if you have not played with Prefix, it’s free, and it is freaking amazing. Both of our products, and Prefix is designed while you are writing and testing your code. Retrace is our paid product. It’s not very expensive, though. It’s extremely affordable. It’s designed for your servers. But they both really help answer that question of you don’t know what you don’t know. |
So many developers, you talk to them and they’re like, “You know, I kind of think this page is slow, but I thought it was just me. I’m sure it runs perfectly fine for everybody else.” They have a little bit of those spidery senses, but the problem is they never follow up on them, and tools like ours can help them validate those. They can say, “Yup. It sucks. It’s slow. We’ve got to fix this.” And so, they, “Yeah, I don’t know. Maybe. Kind of seems slow.” Those are the types of problems that people install, and immediately they’re like, ” Oh, man, I’ve thought for awhile I need to fix that.” | |
Jason Taylor: | And how many times have you inherited somebody else’s code? |
Matt Watson: | Oh, boy. |
Jason Taylor: | And you open it up, and you’re like, “Oh let’s see what this does”. You open it up, and it’s a rat’s nest. Let’s face it. There’s some ugly code out there. I’ve been guilty. You’ve been guilty. Everyone’s been guilty. But, we’ve all seen somebody else’s code who’s uglier than our worst code. |
Matt Watson: | You bring up a great point because no matter how complicated somebody’s code is, how it’s very procedural, or object oriented, or lots of layers, or however they’ve architected it, at the end of the day it still calls some SQL queries. It does some web service calls. It does whatever it does. And, by being able to see the traces that we collect either in Prefix or in Retrace, or both, you get down to what did the code do that interacted with things; even if you don’t understand the code itself, you can still see, “Okay, it does a whole bunch of weird crap. I don’t know what it is, but I know it runs this SQL query.” It helps you understand what the code does. |
Jason Taylor: | And helps, you know, rather than dig through the code to try to figure out what you need to do with it, is look at the code at run time and let that guide you towards what needs the most attention. You’ve seen many times where somebody’s made an acquisition or an employee who’s the sole owner of a particular application leaves the company, somebody else has to pick up the pieces; and all they know is somebody’s telling them, ” This is slow.” So, what do they do first? They fire up Prefix, and they try to find that big, glaring, nasty problem that should be obvious. |
Matt Watson: | Well, I think this has been a great first episode for our podcast. As we try to wrap this up here, I think one thing we should talk a little more about is the journey of this. We talked a lot about our products, and what we do, and the benefits of them. There may be some who aren’t as familiar with our products that will enlighten them about some more detail about it. |
Obviously, we’ve been doing this for five years now, and it still boggles my mind that we have customers in over 50 different countries that use our product from every different industry, different size of companies, publicly traded, private, from airlines to other little start up. I think we have a case study on our website from Carbonite who does online backup, and we’ve got some other case studies with clients we can talk about more openly. | |
It just blows my mind of how cross-cutting these issues are, like every development team has these problems. These aren’t problems just at really large companies. They’re problems at companies of all shape and size. What we’re seeing is more and more of everything is shifting from IT operations to the development team having more ownership. Part of that is the DevOps movement, and I would go so far as to say it’s sort of a no ops movement, as well. Stackify is very much sort of a no ops team. | |
Part of that is the cloud and containers and all these things really support that, where now it’s more about the dev team that creates the product can own it. They can deploy it, and they can use tools like ours to monitor it, and figure out how to optimize it, and improve it. That shift is really happening rapidly across the industry. | |
I talked to a friend of mine the other day who works at a really large insurance company, and I was telling him about what we do and how we’re hosted in Azure and stuff. I told him we manage over a thousand SQL data bases on Azure. He was blown away when I was telling him, “Yeah, you can go to our website and you can sign up for a free trial.” Our system automatically provisions a new database for the client, and they can just instantly use it; and if we need more servers, our system auto scales up and down and all this stuff. He was just blown away by that because he’s such this old dinosaur from the way that they work as a company and the way that they do things. | |
We’re on a complete opposite edge. But what he told me was, “You know what? We have a new executive team, and this is what they are pushing for. They are pushing to get us to this ability, and things like DevOps, and being more agile.” They didn’t even do agile development, which blows me away. But there’s this huge movement this way. | |
Our products can help companies of all shapes and sizes, and it still just blows me away every day the random people that use our products. I mentioned that we have about 20,000 people who have downloaded Prefix, and it’s literally like every Fortune 500 company. It’s amazing. It’s pretty cool to work on a problem like this that helps other developers. I think that’s one of the things that’s the most exciting to me is, as a developer, I get to help other developers. What do you think of that Jason? | |
Jason Taylor: | Well, that’s our DNA. I remember the first time that we sat down five and a half years ago, we talked about this primarily. We talked about it in the context of developers. It was very obvious then that this is how the industry is going to start trending. We were seeing this happen in companies that we’d been at, or where I currently was, and the big enterprise sales model of not being able to try tools like this quickly and going through months of training configuration of procurement and justifying the cost is something that developers, it’s painful for them, and at the same time they’re becoming responsible for more and more and more and these tools are necessary. |
I think that’s interesting because that’s kind of been our journey that’s got us here. But if you want to talk about where we’re going, I think there’s so much more that we can do to help developers tell this story. For me, it’s about helping these developers. If you look at the market there’s a lot being done around performance management and how that parlays into business intelligence and helping marketing teams and sales teams, and that’s all well and good, but there’s still developers struggling and needing help every day. | |
Matt Watson: | They just want to see the log files. |
Jason Taylor: | They want to see the log files. But in the larger sense, developers want to deliver code faster, with more confidence that it’s going to be good code, and reducing any risk of problems that they may have introduced during the development process. Those are the things that we want to help developers do, because it’s in our DNA. It’s who we are, at risk of getting all touch feely, we love developers. |
Matt Watson: | Sure. |
Jason Taylor: | This is our community. They are our people. Our problems are the same as theirs, and we just want to help them do their job better every day. |
Matt Watson: | Yep. Well, I think that’s a good way to sum up Stackify and what we do and kind of our culture. I think for some of the future episodes we’ll dig in a lot deeper to some of the specific problems we solve in helping people deploy faster with more confidence, all these things. These are definitely use cases of why people buy our software. I think we could talk a whole episode about that. We will definitely do that in the future. We’ll wrap this up. Thank you, everybody. Thank you, Jason, for being on the show today. |
Jason Taylor: | Thanks for having me on. |
[adinserter block=”33″]
If you would like to be a guest contributor to the Stackify blog please reach out to [email protected]