Quantcast
Channel: Dataloop.IO Blog
Viewing all 65 articles
Browse latest View live

Using Monitoring Dashboards to Change Behaviour

$
0
0

This is not your typical marketing blog-spam, although we do those sometimes. I have been nagged for the best part of a year to write this by various people. So here goes..

To set some context I’ve been working with developers for the past 14 years inside software companies, usually on the operations side of the fence. I’ve watched first hand how every one of those companies has tried to transition from creating on-premise software to running a SaaS service.

Every time it has been a total nightmare. I would actually classify it as torture. Companies born in the cloud have it easier, but SaaS is still several magnitudes harder than on-premise to be successful at. Back when I first started my career we used to get emails sent round about Sales Guys X who had sold a $Y million deal and who knows if that software ever actually got setup or run in production. Probably not, IT projects are known to fail, or were back then. Nowadays someone enters an email address and a password, signs into your service and within 20 seconds has decided if they want to actually spend any time on it or not. Often they log off and never return.

Since escaping the world of comfortable and gainful employment I’ve started my own SaaS startup. We don’t have any of the baggage that enterprises have, yet it’s still insanely hard. We’ve been absolutely prolific in our discussions with other online services, not only about monitoring, but about everything related to running an online service. At this point, if you’re running a large online service in London and you haven’t had myself and David turn up to chat about monitoring you’re in the minority. We’ve talked to hundreds of companies, we run the worlds largest DevOps meetup group, DevOps Exchange, and we’re starting to invade San Francisco. Everyone we speak to is having the same old issues.

I’ve wandered around a fair number of offices and the definition of what a dashboard is varies wildly. I’ll list a few of the types I’ve noticed so we can further define which one I’m concentrating on in this blog. In future I’ll probably blog on the rest too.

Analytics Dashboards

Used by a dev or ops person usually to help troubleshoot performance issues with the service. Often found on their 2nd monitor, or 3rd, or 4th depending on how nice they have been to IT. Example things you might find here are New Relic for Devs, Graphite / Grafana for Ops and a multitude of other tools like ELK (Kibana is exceptionally pretty and useful). Whenever we talk to anyone they immediately assume we’re talking about this type of dashboard. You’ll use this type of dashboard to correlate, aggregate, chain mathematical functions and try to coerce streams of data into discernible patterns with the hope you’ll uncover something that needs fixing. A lot of APM tools are getting really good at automating a lot of that stuff.

Grafana Screenshot

NOC Dashboards

I’m breaking these out because I generally see them being used as a kind of shared 2nd screen analytics dashboard. Their audience is generally a group of highly technical people sitting very close to them. Mostly these run <insert on-premise monitoring tool of choice here>. Nagios web host groups status pages, or in some cases something like Solarwinds. Generally these get rotated with some APM dashboards and even the live website to see if its alive. One very large news organisation we visit has their website up and automated clicking throughs to news articles. Any big 404’s are quite noticeable.

noc_dashboard

Team Dashboards

Most of the companies we talk to are trying to do ‘DevOps’. Often you’ll get various teams sitting with each other and there’s generally a consensus among everyone we speak to that they want to take a more Micro-Services approach to running their service. In this setup you often get what I call Team Dashboards. Usually rotating browser tabs showing things the that team want to see to help them run a better online service. We see a lot of Dashing.js, and a bunch of different tools that give an idea of number of users online, or marketing type data. It’s often one of the Devs on the team that knocks something up and it stays that way for months.

Dashing Example

Public Dashboards

These are the dashboards on TV screens in public places. You might have one in the kitchen or dotted around the office. These are generally very simplified, quite sanitised dashboards that rarely change.

ka-dashboard-photo

This blog post is about the latter two types of dashboards; the Team and the Public ones. While analytics is certainly very valuable, and companies spend a considerable amount of money on tooling to help with this, in my experience this isn’t actually the biggest problem. You need good analytics, I’ll do a blog post about it some other time, right now it’s at least number 2 in the list of things I think are important for the majority of SaaS companies I talk to.

So what’s the biggest problem?

It’s people.

At a SaaS company you need to be working on the right thing. If you end up working on the wrong thing you may as well have gone on vacation. In fact, going on vacation and doing no work at all would have been better, since that would have resulted in less wrong code to support in production.

It is crazy easy to get detached from reality in SaaS. I’ve seen situations where teams of people have been patted on the back for successful feature deliveries when the service itself is dreadful. We’re talking 40 seconds to login dreadful. Or other cases where so much process and red tape is put around deployments, yet when you look at the number of users it’s less than a handful. People have experiences, perceptions and a lot of assumption they carry around with them. Often these things get in the way, especially when your goal is to move fast and do the important things like actually making users happy.

Then you have different roles within a team. The developers are head down playing with technology and solving problems. The ops guys are trying to work out ways to keep the service stable and bullet proof by doing things like removing single points of failure etc. Ops are rewarded for uptime, devs are rewarded for features. Unfortunately product management usually only care about features too. You end up with this weird divide and as time moves on various sides entrench and it’s really hard to undo.

You also have the opinion driven discussions. Everyone has an opinion, some are even valid. It’s extremely frustrating to be in a team where opinions differ on important things. You can end up in an extremely toxic state. Nobody ever wants to be wrong about something so you end up avoiding topics or going with the status quo, no matter how silly.

In all of these cases the solution is dashboards. But not the type I sometimes see when I wander round offices or referred to in the Dilbert strip below. If you have a dashboard that’s black, with tiny size 8 font and graphs that nobody even knows the meaning of then you might as well turn off the TV.

Dilbert Dashboards

The sort of dashboards I’m talking about are ones that change human behaviour. They exist, I have seen them work many times and they are awesome.

If your team and public dashboards aren’t designed to change human behaviour, if they aren’t simple and relevant to your audience and if they aren’t being updated regularly to help shape the direction of your SaaS service, then my advice is to put the TV screens up for sale on Ebay and use the money for something else.

Operations often do not realise the power of the real time metrics coming out of production systems. You can literally control the product roadmap

A Real-Life Example

I’ll give an example of how dashboards have changed a previous team.

At one company we suffered months of service neglect. The service was unstable, it was slow and quite frankly it was embarrassing to be associated with. Development were off building features, product management had zero interest in the service, all they cared about were their own lists. As an Ops team we had to beg for time to be spent on critical issues. Management reports from engineering were always glowing as they focussed on delivery dates and other non tangible things. Since the on-premise business was keeping things afloat the company wasn’t about to die by releasing a rubbish service. But you can probably imagine the levels of frustration started to rise.

One day we wrote some brittle Ruby scripts that polled various services. The collated the metrics into a simple database and we automated some email reports and built a dashboard showing key service metrics. We pinpointed issues that we wanted to show people. Things like the login times, how long it would take to search for certain keywords in the app, and how many users were actually using the service, along with costs and other interesting facts. We sent out the link to the dashboard at 9am on Monday morning, before the weekly management call.

Within 2 weeks most problems were addressed. It is very difficult to combat data, especially when it is laid out in an easy to understand way. Within 2 months we had a dedicated development team, a new development process and a system to prioritise service issues alongside features. The most important thing was the change in culture. Everyone had the same goal, everyone was rewarded for common metrics and that resulted in a much more fun atmosphere.

Other Real-Life Examples

Before this blog turns into one of those touchy feely self help articles it’s probably best to show a few examples. I’ll add more over time, as I get approval from their creators. Here are some starters:

1. Monitoring Deployments

We moved to a new code deployment model that involves each engineer releasing their own code in an automated way between environments, all the way to production. Before we shifted to devolving the responsibility of releases down to each engineer I was having to spend a lot of time chasing when things would be ready, coordinating deployments, guessing how much risk was involved and then testing code I had no clue about. Luckily, we started to get more US customers so I started travelling and handed all of that over to Colin who agreed it was a very silly thing to be doing. To ensure this was being followed and to help the team adapt we show this dashboard.

deployment_dashboard
https://nagios-public.dataloop.io/?dashboard=54e36685e6a8c1c8072424d9

During a change it is best to over communicate. This dashboard really helped us get a handle on which build was in which environment and how far apart each was from each other. It also charts the number of releases per day which has increased significantly. This dashboard took 5 minutes to create and really helped sanity check the process was being followed during the first few weeks until human behaviour was changed for good.

We even setup an alert rule so if Staging gets more than 10 releases ahead of Production, the developers are notified its probably time to do a new production release before things get too out of sync, and the release risk gets too high.

2. Broken Builds Stop the Floor

We use Jenkins, and specifically Jenkins with the Green Balls plugin. Even so, the status indicators are a bit small to throw up on a TV screen. What we really wanted to ensure was that nobody was checking in code that would break the build and deploy pipelines. It’s far easier to correct those issues immediately than it is to wait and find out some time later. At that point you may have a bunch of commits that came after and it becomes a horrible mess. If you can’t deploy code then you shouldn’t be writing code, everyone should be reassigned to fixing the pipeline if it breaks really horribly (usually it’s just the person who broke it who needs to jump in and fix it though). After throwing this dashboard up I’m not aware of a single time someone has come to deploy and things have been broken.

jenkins Dashboardhttps://nagios-public.dataloop.io/?dashboard=54ea011c61178bf41fcf7a6a

3. Establishing Credibility & Building Relationships

Typically when work is agreed, people go off and no one has any visibility into the progress of that work until they get an email or talk at the next meeting. Dashboards give a real-time view on what’s happening right now, allowing different teams to see the progress of that work in between status updates. Being kept in the loop with such a high level of visibility is great for building relationships. We’ve seen countless examples of this working within an organisation.

One of our customers, DevOps Guys, provides DevOps consultancy to various online services. They wrote this blog on how they used our dashboards to provide visibility to one of their customers while patching a critical security vulnerability. In minutes, they had a dashboard they could share with the client, who could then see in real-time how many of their boxes were fixed.

A lot of the work I’ve done in the past has all been about building credibility. Giving people extra visibility into how you do stuff can make you look competent and professional. Once you’ve established credibility, people are far more willing to listen and accept the the things you say, and ultimately this leads to positive change.

ghost-after

4. All Hands on Deck!

This Star Trek dashboard is still the coolest dashboard I’ve ever seen. It doesn’t change behaviour (probably), but it is cool, and probably makes it harder for the team to ignore critical issues on the service, like they could an email, ensuring issues get attended to and resolved faster!

Summary

It’s not up to 1 person to decide what to show. It should be a collaborative group effort where individuals can self service these dashboards form a wealth of production data within seconds.

If you are running an online service and aren’t using data to drive how it is run then you are missing out. If you have any tales about how you’ve used dashboards to change behaviour within your group we’d love to hear from you.


Tagged: dashboards, DevOps, monitoring

#DOXLON DevOps Exchange (Mar 15) – DevOps for Windows, an Oxymoron?

$
0
0

This month’s meetup was for the significant number of you who live in the Windows world, and for those that are skeptical, a chance to see what life is like on the other side of the fence.

Most DevOps people out there think ‘DevOps for Windows’ is an oxymoron, but is it? Some of our largest online services in the UK like JustEat.com and ASOS run on Windows, and although Windows has been behind the curve at times, Microsoft is investing heavily to close the gap.

Our next meetup will be on NoSQL for DevOps on 30th April, you can sign up now!

CALL TO ACTION: We’ve put up a speaker registration form so if you’re interested in speaking at one of our future meetups please fill out the form here so you’re on our radar!

Boris Devouge from Microsoft – DevOps on Azure

Boris kicked off the meetup with Microsoft’s intro to the world of DevOps on Azure and how Microsoft is increasingly playing nice with the Open-Source world.

Steve Thair from DevOps Guys – DevOps for Windows in the Wild

Steve talked about DevOps Guys experience working with several Windows customers, and how they did all the DevOps basics on Windows such as automation and deployments, and some best practices for those of you out there looking to implement DevOps on Windows yourselves!

Russell Seymour from ASOS – A brief introduction into POSHChef

A brief introduction into POSHChef, a new PowerShell based client for Chef. It has been designed to work with Chef but leverage technologies within Windows. The session will cover using POSHChef, compatibility with existing community cookbooks and how DSC is utilized.

Some of our Favorite Tweets From the Night


Tagged: DevOps, DevOps Exchange, DOXLON, microsoft, Slides, videos, Windows

Throwing the baby out with the bathwater

$
0
0

You may have heard the phrase “treat your servers like cattle and not like pets”.

A lot of people have embraced this mindset and the rise of configuration management tools has helped everyone to think about their servers being part of a specific environment and performing a particular role. We advise people to group their servers up into product, environment and role as this makes both deployment and monitoring vastly simpler. This way when you want some more capacity at peak times you can spin up a few more worker nodes in production for your service. If you need a new test environment, just click a button. Have a DR project to set up a cold standby in an alternative cloud? Easy.

So in a world where you supposedly don’t care about individual servers I’ve seen a few worrying trends start to emerge. There is a tendency, and this usually comes from groups with a predominantly development focused backgrounds, to think that you just need to throw a stream of time series data from a service at an endpoint. Then you can look for deviations in your graphs, build complicated functions and by keeping an eye on this you’re fully monitored. For those of us who have been supporting production services for a while we know this isn’t quite the full picture.

If we go back to basics, and ignore PaaS which has largely failed, everything has to run somewhere, you either have physical boxes or virtual machines (with or without containers on them). These servers are your building blocks. They determine the capacity available to you and even if you can rebuild them with the click of a button you need to know the current state of your infrastructure.

There are a few fundamental things you may want to know about your infrastructure. How many servers do you have? Are they all on? Has anyone wandered into the computer room, or logged into the admin console and turned anything important off? Questions, that are easily answered by something like Nagios or a long list of tools traditionally used by operations teams. Usually you’d do a ping check or hit an SSH or agent port. There is a defined polling interval and something that should always respond that is reliable. It seems realistic that if central polling fails then a server might be hosed.

What happens when you simply monitor using streams? Clearly you just setup ‘no data’ alerting and everything works perfectly. But when you think about it, what does ’no data’ actually mean? It simply means you didn’t get your metrics. Is that because there aren’t any metrics? What is the interval? Has your metrics sender died? (yes CollectD I’m looking at you). What happens when DNS dies and none of your metrics arrive?

Let’s assume that the metrics always get sent by a 100% reliable sender so that solves one problem but opens up a new one.

What about the alerting? I’ve stopped receiving data for a server, so I must have a server defined in my monitoring tool in order to create an alert rule to detect that. That’s easy, we’ll define a server so we can alert when metrics stop coming in from that host. We’re back to caring about servers again at this point, and by dealing only with streams of data (or no data) coming in we’re guessing if the server is down. We also have to do some complicated stuff when servers are supposed to scale up and down. It’s one thing to know when a server is alive or not, it’s another thing entirely to keep your monitoring tool up to date with rapidly changing environments and not receive a barrage of alerts.

At this stage what if we agree we don’t need to know the current state of our environment. We don’t care how many servers we have, how many are up or down, and our monitoring map doesn’t need to fit the terrain. All we care about is whether the service is up and performing well for users. Well, that’s a nice idea but what happens in real life, in large scale environments anyway, is you end up in a total mess. You hopefully spent some time load testing your app in the past and from those figures determined the capacity you’d need. You provision the boxes and then have no idea what’s really going on. The great thing about software is that it can be running fine one minute, then you look away and while nothing changed it stops working, you have hit some magical threshold and now you have issues. Having no clue what is going on and relying on ’everything is working for users right now’ is a recipe for disaster.

While nobody likes staring at a status page full of green and red servers, or receiving the Host Down! emails, they are actually quite useful. Even in a world where servers are cattle it is nice to be able to actually take a quick look at what went wrong before you shoot it in the head. Or even know you have some servers left to shoot in the head.

We took a view a long time ago that being able to answer the fundamental questions was quite important. Providing an accurate map of your environments as they change is important. For that reason a lot of complexity has been pushed into the Dataloop agent and we have various mechanisms for determining whether a server is alive or not. We have a built in DNS client and numerous other ‘keep alive’ failsafes designed to ensure we know exactly what is going on and try to get as close as possible to knowing if something is up or down. Not only that, we have presence detection by holding a websocket connection open. So we know instantly if there is a problem. Regardless of whether you use Dataloop or not though, putting a bunch of resiliency into knowing the state of a server is pretty important.

All of the complexity of managing what servers are registered at any given point is also handled by the Dataloop agent. We did this because we got fed up of how complex everything gets when you try to wrap up 15 year old technology in a band aid of config management. The complex automation that you would need to write side by side to update server alerting with purely stream based monitoring is taken care of by running agent commands so the server can register and de-register itself as they are spun up and down. We also designed agent fingerprinting so you can reliably tie metrics to hosts between rebuilds.

Metric sending is actually only one very small piece of functionality when compared to presence detection, registration, de-registration and fingerprinting.

What people generally do in the real world is run something like Nagios alongside Graphite. Separate systems joined together through custom scripts that attempt to layer on graphs to visualise and alert on the streams of data, on top of boolean checks. Unless you plan to do that with your stream processing tool then you’re missing half the picture.


How we scaled our monitoring platform

$
0
0

Monitoring at scale is a hard task so we often get asked by people what our architecture looks like. The reality is that it’s constantly changing over time. This blog aims to capture our current design based upon what we’ve learnt to date. It may all be different given another year. To provide some background we initially started Dataloop.IO just under 18 months ago. Before then we had all been involved in creating SaaS products at various companies where monitoring and deployments were always a large part of our job.

We had a fair idea about what we wanted to create from a product perspective and what would be needed in order to make it scale. Our aim was to build a platform from the ground up that would sit between New Relic (Application Performance Management) and Splunk (Log Management) to provide the same set of functionality provided by Nagios, Graphite, Dashing and complex configuration management tooling.

Customers would still need to write Nagios check scripts, configure 3rd party collectors to output Graphite metrics, and setup their own dashboards and rules. However, we would provide a highly available, massively scalable SaaS hosted solution that took away the hassle of running the server side piece entirely. Everything we do on the collection side is totally open source and standards based which means no lock in. Like many, we had got frustrated by the lack of customisation possible in most of the MaaS tools available at the time.

We would provide a fancy UI that simplifies and speeds up setup and helps gain adoption outside of operations team in micro-services environments. We would also help anyone to create dashing.js style dashboards to change human behaviour without any programming knowledge. Our current product direction is:

A customer signs up to Dataloop, downloads and installs an agent on their servers and magic happens.

Architecture Overview

Architecture

We knew from the beginning that we would likely end up with thousands, if not hundreds of thousands (or potentially millions) of agents. In our case an agent is a packaged piece of software that we provide to customers that performs the same job as the Nagios NRPE agent, or the Sensu agent. There is one subtle difference to our design in that we don’t do central polling in the traditional sense. Each agent has its own scheduler and is responsible for keeping a websocket connection alive to the agent exchange as shown in the picture. This means we de-couple up / down and presence detection from metrics collection. The agent is passed down a set of configuration which includes which plugins to run. It then inserts those plugin jobs into its own scheduler and sends back the data. In the event of network failure the agent will buffer up to 50mb of metrics which can be replayed when connectivity is restored. We use Chef.IO’s awesome omnibus packaging system to bundle up our Python agent along with an embedded interpreter and all of the dependencies required by our out of the box Nagios check scripts.

On the server side everything in green coloured boxes in the diagram above, are NodeJS micro-services. The backend is essentially a routing platform for metrics and Node is especially well suited for IO bound operations. From a deployment perspective we have kept it extremely simple. Each micro-service is packaged up into a .deb and split into app nodes (that run the services that customers connect to) and worker nodes that run the workers. These currently run on large physical dedicated servers (4 app nodes and 6 worker nodes). We can add more hardware within a few minutes to each of these areas depending on what our current load looks like. Ultimately, we may end up moving towards docker on mesos with marathon. For now we’re pretty happy with pressing a button in Jenkins which triggers ansible to orchestrate the setup and chef to configure.

Everything communicates via a global event bus which is currently in RabbitMQ. We have a single 2 node cluster for redundancy which is currently passing approximately 60,000 messages per second between various exchanges. Metrics per second varies as we stuff multiple metrics into a message (up to 10 metrics per message currently). As we add more load we will shard across multiple pairs of Rabbit boxes. Our basic load testing has shown we can scale to several million metrics per second on the current hardware. When we get close to those numbers we’ll add more hardware, load test, and tweak the design.

Although it isn’t shown on the drawing we use hardware load balancers for external https traffic and Amazon Route53 with health checks for the graphite tcp and udp traffic. Internally we use HAProxy and Nginx.

We think about the various components in the following terms:

Sources: Agent, Graphite. We may support additional sources later depending on what becomes popular. For our agent we chose Python mostly because that’s what we write our Nagios check scripts in.

Collectors: Exchange, Graphite – these are the end points we host that the sources send to. Things we have thought about here are collectors for OpenTSDB, SNMP, Metrics V2.0, although none of those have significant traction to warrant addition. Our architecture allows us to add any arbitrary interface while keeping the core code stable.

Queues: RabbitMQ does everything. We’ve looked at Kafka but the complexity of running another technology doesn’t seem worth it.

Workers: All NodeJS but we could write these in any language. We already have 10 micro-service packages and I expect that to increase as time goes on.

Databases: Riak and Mongo. Why Mongo? Well, the library support and dev time with NodeJS apps is really quick. If you use it in the right way it’s actually very good. We keep Mongo away from the metrics processing pipeline as it becomes very hard to scale once you move beyond simple scenarios. What it is good at is document storage for web applications.

Why Riak? It’s awesome. We were a bit early for InfluxDB when we discussed using it with Paul Dix last year. Our servers are dedicated physical boxes with striped SSD raid and it’s not unfeasible we’ll lose a box at some point. We sacrificed a lot of development time building a time series layer on top of Riak in return for a truly awesome level of redundancy.

People always pop up and ask us ‘why not X’ database. Some we have tried and were too slow, others had issues in the event of node failure and some we haven’t looked at. At this point we’re pretty happy with Riak and until we’re a bigger company we can’t dedicate a lot of time to moving technologies. Although, we are guilty of playing with new stuff in our spare time (currently looking at Druid).

Console: Single page app written in Backbone with Marionette. We wanted a single public API that we could give to customers and a single page app sitting on top that would give our real-time metrics processing an instant wow factor. Like every company that’s full of Javascript developers we’ve started to add a bunch of reactive stuff into the console.

Metrics Pipeline

When we talk about metrics what we mean currently is converting Nagios format metrics and Graphite format metrics into our own internal format for processing.

A Nagios Metric might be: “OK | cpu=15%;;;; memory=20%;;;;”

Nagios scripts also return an exit code of 0, 1, 2 or 3 depending on success warning or failure. We process these in line with the performance data metrics.

A Graphite metric might be: “load.load.shortterm 4″

metrics

Each agent binds to a particular exchange which means chunks of data from the same source can be sent to the metrics workers for processing. The Graphite data is a little harder to process since the data could come from any host. To solve this we use a consistent hashing algorithm to direct messages to the correct metrics worker for processing. Data is then inserted into Riak sharded by time and series.

In order to increase throughput we have implemented various bucket types in Riak. Metrics initially get written to an in-memory bucket which are then rolled up for persistence. Colin was invited to give a talk about event processing in Riak at last years Ricon.

Alerts Pipeline

alerts

The metrics workers split the pipeline into 3 parallel tasks. Time series storage happens as you would expect, with the data being written to Riak.

Live updates are sent directly to the browser to provide real time metrics. With the Graphite endpoint it is not unusual for customers to stream 1 second updates to us. We want to ensure that those updates appear instantly.

The alerts engine itself is a state machine that works on each metric to decide what action to take. In the future we intend to add additional exec actions so that customers can write scripts that we execute to automatically fix problems, or automate their run-books via our ‘if this, then that‘ style rules.

Challenges

From day one we’ve been working for paying customers who have helped drive our roadmap. Initially we started with a simple agent and Nagios check scripts that run every 30 seconds. Scaling that up isn’t terribly challenging as the volume of metrics is within the boundaries of what you can process on a single node with some failover. As you scale up you can just shard. It was obvious that we needed a ‘push gateway’ so that metrics could be streamed in at high resolution alongside the Nagios data. What wasn’t obvious was how quickly in the product roadmap this would be required. We chose Graphite for this as it’s incredibly popular and expands our out of the box collection significantly, however it also creates a big data problem and means you end up needing to become a distributed systems expert. We’ve hired some clever people but the technical challenges of processing at this volume means that everyone is on a steep learning curve all of the time.

Writing our own time series data store was also not something we wanted to do. If we had started later I believe we would have gone with InfluxDB and saved a bunch of time. Eventually they will add features we desire and we’ll be sat watching from the sidelines knowing we’ll need to write those ourselves. Not much we can do about that now, other than make the product so awesome that enough people buy it we can afford the migration time later.

We have in the past accidentally pushed some data required by the processing pipelines into Mongo. This was done for convenience and speed, but it came back to bite us. If we had spent the time to plan this out properly and designed for it in Mongo it would have been fine, but unfortunately we didn’t, so we’ve had a bit of a hard time extracting all of that in flight into Riak.

NodeJS has worked out pretty well in general. The entire team have a high proficiency in Javascript and code is shared via npm packages across every component, including the front end. We have hit some problems with concurrency and memory leaks, but over time we’re getting better and those are being resolved quickly. We’ve also been hit with a few type related bugs in the past where metrics strings have been added to floats causing outrageously wrong data. Wherever types are important we now add a lot of unit tests and code using safer parsing options.

The agent itself has also been a massive challenge. After the first few iterations we’re on version 3 which is an omnibus installer with an embedded python interpreter. We had numerous issues with PyInstaller and various threading issues. Writing code that runs on somebody else’s machine is hard. You never know what crazy stuff might happen. Unfortunately, some companies spend a while rolling out our agent manually so it’s hard to ask them to update to newer versions with bug fixes in it. We’ve had to put a lot of time into testing and building stability and reconnect logic into the agent. Sometimes it feels like we’re working for NASA designing software to run on the Mars Rover when we code for the agent, as once it’s released you may never get the chance to fix it.

Wins

Overall going with a micro-services style architecture has been a huge help. We are able to split services across boxes and isolate issues very quickly. Being stateless means we can just spin up more. Hooking everything together over a global queue has also simplified things immensely, as we don’t need service discovery or any of the other complex tooling you’d expect to find when using REST.

Although we believe picking a different time series database might be a better option in the future we also gained a lot for free from Riak. We have lost nodes in the past and it hasn’t been a big deal. Adding more capacity is also a breeze, as is monitoring it. From an operations perspective Riak is amazing.

Summary

Usually these types of post start with statistics about X customers with Y agents and Z metrics per second. From a technology perspective those are all fairly irrelevant. Youtube probably processes more bits per hour that we will all year. What does matter is what you are doing with the data and why. Hopefully this blog has helped explain a little about our architecture, what we’re doing behind the scenes, and why.


#DOXLON DevOps Exchange (Apr 15) – NoSQL & DevOps

$
0
0

Did the appearance of NoSQL influence DevOps? Or was it viceversa? What’s sure is that NoSQL technologies are both used by many DevOps-friendly tools (e.g. Elasticsearch in ELK) and the target of much automation when they are part of distributed systems that need automation and management.

This meetup was for those who have scars to show they know what they talk about and for those who are just now dipping their toes in NoSQL-land and wonder how it changes their daily job as “DevOps people”.

Our next meetup will be on 26th May, you can sign up now!

CALL TO ACTION: We’ve put up a speaker registration form so if you’re interested in speaking at one of our future meetups please fill out the form here so you’re on our radar!

Joel Jacobson (Datastax) – Diagnosing Cassandra Problems in Production

This sessions covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Attendees will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster.

Richard Pijnenburg (Elastic) – Managing ElasticSearch Environments

Bring control to your development team by making it easier to maintain templates and scripts across your nodes.

James Tan (MongoDB) – Automate Production-Ready MongoDB Deployments

Getting from dev to a robust, performant, and scalable production environment takes a fair bit of work. Doing this manually is time-consuming and error-prone, so let’s look at the various ways to automate this with Vagrant + Chef, as well as MMS Automation (free up to 8 servers).


Tagged: DevOps, DevOps Exchange, DevOps Exchange London, DOXLON, Meetup, meetup group, nosql, Slides, Video

#DOXLON DevOps Exchange (May 15) – DevOps & DevOps

$
0
0

This month we are going back to the roots and the theme will be ‘Anything DevOps’, a chance for speakers to tackle their preferred topic rather than being constrained by a narrower theme.

Our next meetup will be on 24th May at the Cloud World Forum at Olympia so come along to the show too to meet cool vendors (including us! :)) and DevOps talks during the day, you can sign up now!

CALL TO ACTION: We’ve put up a speaker registration form so if you’re interested in speaking at one of our future meetups please fill out the form here so you’re on our radar!

Michael Ducy (Chef) – The Goat and the Silo

We may know the Goat and Silo problem as a common calculus mathematical problem, but Goats (scape goats) and Silo (organizational silos) problems also plague IT organizations. How can we turn Goats and Silos into assets that can help in implementing a culture supportive of Cloud, DevOps, and the next generation of IT paradigms. This talk will build on Organizational Management philosophies, as well as the philosophies of Lean and Agile.

Matthew Skelton (Skelton Thatcher) – Long Live the DevOps Team

What team configuration is right for DevOps to work? Devs doing Ops? Ops doing Dev? Everyone doing a bit of everything, or a special new silo doing Docker and Jenkins in the corner of the room? In this talk, Matthew Skelton joins speculation with practical in-the-trenches experience to arrive at some working ‘team topologies’ for effective DevOps.

James Brooks (Betfair) – Show me the Metrics

Time Series metrics can be an important part of a comprehensive monitoring solution. Betfair will present a talk on their experiences running OpenTSDB and a new open source tool called OpenTSP, designed to streamline the process of gathering and delivering system metrics quickly and reliably to multiple endpoints so that you can use any of your favourite tools to analyse the stream.


Automated Status pages with Status.io Plugin

$
0
0

When it comes to service status pages most of us feel they are more of a marketing gimmick than fact. For example with Amazon Web Services the first time you are aware of a problem it is not from the status page. It is when twitter sets on fire with people complaining about the poor service. The trend is alarming and it is not just Amazon doing it, almost all service providers do the same thing. For some reason special authorisation is required to update the status page. Special people need to confirm that this is the right marketing move for the business. That’s not how we work.

People need trust in a service. People want to feel like they are getting the information as and when it happens.  Not 30 minutes or 40 minutes later, if at all. That is where status.io comes in for us. We needed a way to communicate with our users how our service was doing and we can do that through status.io. Using The Dataloop.IO platform, I wrote a statusio plugin that checks http and tcp endpoints. It then reports back if they’re working or not. Every 30 seconds.

Let’s take a look at what V1 does. (You can find it our github plugin repo: here)

This is only the first version and it has got some flaws. The main 2 flaws being it does not time the tcp connections yet and it does not update the metrics on the status.io page yet. But yet it does work, just. All the config for this plugin is at the top:

#Config - change these bits, hope for best.
api_id = "your-api-id"
api_key = "your-api-key"
statuspage_id = "your-status-page-id"
checks = {
'endpoint 1': { 'id': "component-id", 'check_type': 'url', 'target' : 'https://agent.dataloop.io'},
'endpoint 2': { 'id': "component-id", 'check_type': 'tcp', 'target' : 'graphite.dataloop.io:2003'},
}

After that, the application goes off and determines the containers that each component belong. It will carry out the relevant http or tcp check for each. If any one of the checks fail it will update that component (and it’s containers) on status.io. It will change the status of the component to say there has been an issue. This in turn will make it so when you look at the status page: here what you see is the current health of the platform. Currently the statusio plugin is looking for maintenance mode. If the maintenance mode is in effect it will not update the status until the maintenance mode has been completed. With this feature we can add a planned maintenance via the API or the status.io website and the plugin will not update it.

One of the reasons we chose status.io to host our status page was that it was quick and easy. We were able to get something like this up and working in hours. Whenever we trigger an incident or the planned maintenance it takes care of notifying our users. The users are happy, and we’re happy, everyone is happy.

Now, to drive these checks we make use our plugin distribution and scheduling system. After all, they are just scripts that run on an interval every 30 seconds. This does mean that some transient issues are currently lost, But you can then produce something like this:

public status page

There are a few issues in the current version so the plan is to do the following for the next version:

  • Add TCP timer
  • Add Metric update via status.io api
  • Add Incident Raise / Resolve

Once that is there and stable we will trial it on our own status page before updating the plugin in the plugin repo. There would be no point in writing all this and then keeping the value locked away, so share and share alike! All you will need to do is drag the statusio plugin onto a tag that is applied to a single server and fill in the config. Then you too will have a cool status page that will magically update.

Once it’s running you can then create pretty dashboards like this to show the response times of the end points.


Docker Monitoring

$
0
0

Battling mess is an ongoing struggle that has plagued most of my career. Docker presents an opportunity to explosively increase the chance of mess. You can of course reduce mess with a local registry, proper build process and sane use of docker files. Unfortunately, if my experiences pre-docker-era are anything to go by, things will not be done properly.

As a career SysAdmin I have mixed feelings about Docker. Why, I hear you ask? Because everything has a tendency to get into a mess. Usually when starting a new job I’ll spend time orienting myself, asking the basic questions like what servers do we have? what do they do? can we log into them all? what are the differences between environments? can I still manually create this stuff in the case of an emergency, or did we create an overly complicated monster that will one day leave us crying into our hands because we can no longer actually build stuff. Ultimately it ends up automated, but only by starting from a sane beginning.

So with the complaining now out of the way, let’s imagine you have your house 100% in order and have decided to use Docker properly. Awesome! You’re in the 1% – here’s how you could monitor those containers..

1. Run a container to scrape host and container metrics (CAdvisor)

Google provide a container that’s really easy to get running on your docker hosts. Spin up a CAdvisor container on every Docker host you have and it will happily sit there in the background sucking out every metric from the host and every running container. It presents a nice little web interface which updates in realtime too which can be fun to look at.

Command to start CAdvisor:


sudo docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:rw \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--publish=8080:8080 \
--detach=true \
--name=cadvisor \
google/cadvisor:latest

This starts the fancy local web interface on port 8080.

2. Run a container with a monitoring agent inside (dataloop-docker)

In the spirit of one container one task it makes a lot of sense to run your monitoring agent in a container too. You want to keep your Docker hosts clean and untainted from 3rd party software after all. You’ll need to link this container to your CAdvisor container. Docker links create network connections between containers and exposes the remote endpoint addresses via environment variable which is handy.

Command to start dataloop-docker and link it to the CAdvisor container:


API_KEY=<insert your key>
sudo docker run \
--volume=/var/run/docker.sock:/var/run/docker.sock \
--detach=true \
--name=dataloop-docker \
--hostname=$(hostname) \
-e API_KEY=$API_KEY \
--link cadvisor:cadvisor \
dataloop/dataloop-docker

The –link part is quite important here. As well as setting the correct API key, otherwise the container won’t pop up like magic inside Dataloop.

Here they are running alongside a redis and postgres container.

Screen Shot 2015-06-30 at 22.39.47

 

3. Apply a plugin to the monitoring agent that scrapes CAdvisor over the Docker link

Screen Shot 2015-06-30 at 22.32.15

We provide a CAdvisor plugin that automatically collects every metric from the CAdvisor API. This includes the Docker host metrics as well as every running container. We send these back centrally so they can be aggregated on dashboards and in alerts across multiple Docker hosts. It’s just a standard Nagios check script and is open source if anyone wants to play with it.

That’s pretty much it. On every Docker host you run two commands, and in Dataloop you apply a plugin to your hosts. You then get every metric possible. I counted the host, dataloop-docker and cadvisor metrics and got over 400 individual metrics. We may need to tune down the plugin one day but for now it’s quite fun to have so many.

Screen Shot 2015-06-30 at 22.32.58

Now the real question is how do you easily instrument service level metrics from your containers. We’ve tried a few different approaches but that’s a topic that should probably be covered by another blog post.



#DOXLON DevOps Exchange (June 15) – DevOps @ Cloud World Forum

$
0
0

This month we did our 2nd general DevOps meetup at Cloud World Forum in Olympia. Despite being the other side of London from our usual locations we had a great turnout and some really good speakers all talking about Cloud and Docker/Containers.

DOXLON at the Cloud World Forum

Our next meetup will be on 30th July at ASOS and you can sign up here.

CALL TO ACTION: We’ve put up a speaker registration form so if you’re interested in speaking at one of our future meetups please fill out the form here so you’re on our radar!

Jon Topper (Scale Factory) – Migrating to the Cloud in 20 Minute

A how-to compressed in 20 minutes on how to do migrations to the Cloud.

Craig Box (Google) – The road to Kubernetes 1.0

Review Kubernetes history as the project moves towards a 1.0 release.

Anne Currie (Force12.io) – Game of Hosts: Containers vs VMs

There’s a lot of talk about Docker, Linux containers and sub-second container startup times but are they achievable and what might they mean for ops and dev? We’ve been wrestling with ECS and microscaling and we’re back to tell you all about it.


Tagged: DevOps, DevOps Exchange, DevOps Exchange London, DOXLON, Meetup, meetup group, Slides, Video

Real-Time Monitoring

$
0
0

The words ‘Real-Time’ can be found on the glossy websites of many monitoring products. Rarely do you find any context behind those words. Does it really mean ‘Real-Time’? Or, is there some noticeable lag between metrics being collected, sent and displayed. The truth will vary quite a bit between products and yet the marketing words remain the same which is very confusing.

The difference between Real-Time that is almost instant versus Real-Time measured in minutes can mean the difference between useful and useless depending on the scenario.

I started off thinking that perhaps this blog should be about benchmarking the top monitoring systems but then decided that perhaps that would be inflammatory for a vendor to do directly. There are many factors to consider when looking at a monitoring system and focusing on a single table of timings could come across badly. However, if somebody else wants to do a benchmark I’d happily link to this blog.

I’ll try to instead describe the various considerations that need to be made when looking at real-time metrics. Most of these problems are not immediately obvious but soon present themselves when you start to operate at any kind of scale.

Starting Simple (Historic)

The most basic setup is to send metrics into a central database which is then be queried by a front end. A good example of this setup would be InfluxDB and Grafana. Once you get to a certain scale you have two problems to solve; write performance and read performance. Unfortunately, if you design for high writes you will often lose proper Real-Time characteristics which is often the case with solutions based on Hadoop like OpenTSDB. Often due to the nature of eventual consistency your metrics may take a few minutes to appear.

graphite

Starting Simple (Realtime)

Lets say you really only care about what’s happening right now. This becomes quite a simple problem to solve. Send the metrics to a central stream processing system and pump the metrics directly to the browser as they arrive. I believe this is what the Riemann dashboard does and you get really low latency metrics as a result.

riemann

Historic and Realtime

So it’s clear from above that you need a stream processing system like Riemann or Heka that can split your metrics and send some to long term storage for historic viewing, and also send metrics in realtime to your screen for low latency visualisation. Problem solved! Or, not quite.

What happens now is when the graphs render they will pull from your historic metrics database which could be a few minutes behind due to processing (batching, calculations, eventual consistency etc). Then live updates get pumped onto the end of the graph directly from the stream. You now have a graph that displays good data up to a few minutes ago, then a gap of nothing, and then whatever data has been pushed to the browser from the stream since you opened the page.

Solving the Gap

At this point you can probably think up lots of ways around the gap problem. Storing data in the browser to mitigate some of it, perhaps putting in some caches. The reality is that it becomes very complicated. Fortunately, others have solved this problem already and documented a pattern that seems to work pretty well for this type of problem. They call it the Lambda Architecture.

la-overview_small

http://lambda-architecture.net/

We have already written a blog about how we scale our monitoring platform but didn’t go into too much detail about how we present data by seamlessly joining historic and Real-Time with low latency (milliseconds) and no gaps.

Internally we use Riak as our master dataset and Redis in our serving layer. Metrics get split into three streams; storage, alerts, and live updates that get sent directly to open dashboards.

For storage we split that into Short Term Storage (STS) and Long Term Storage (LTS). Metrics are pulled off durable queues and pushed initially into STS (Redis) where after a short duration they are then rolled up into LTS (Riak). When queries are made on dashboards our API pulls from STS, LTS and keeps an SSE connection open for the live updates.

Although our last blog post talked about our micro service workers all being written in NodeJS we have unfortunately had to rewrite a couple of them due to problems with parallelism and fault tolerance. So we have a few Erlang workers now in testing that will eventually replace our V1 node workers (not all of them, just the ones in the metrics and alerts pipelines). I’ll do a blog about why the switch from Node to Erlang for some workers in future.

So with all of this work what do we get? Mostly no realisation by users that any of this is happening. They fire in their 1 second granularity metrics into Dataloop and they appear in their dashboards instantly, with the ability to change time ranges and it all just works transparently with no gaps and a perfect representation of the data. Same deal with alerts.

As with everything we have made various trade-off decisions which are mostly around immutability of data and could probably be another blog topic on their own. Currently we are working towards the goal of keeping all data at raw resolution forever. Whether we achieve that I don’t know, but we’ll definitely get close :)


New Feature: Hosted Docker Containers

$
0
0

As of this week all accounts will get provisioned with their own ‘dataloop’ docker container!

Traditionally we have only provided internal monitoring. You can install an agent or configure third party tools to send data into our Graphite port. Metrics collection has always been performed by servers that customers run. Our SaaS platform is really just an open standards compliant backend to the same collection you would typically setup if you had Nagios or Graphite on-premise. We provide a very slick UI that helps non-ops people setup collection, dashboards and alerts, as well as take care of all of the scaling issues. Problems that present themselves as SaaS companies start to scale up.

The new docker containers are the first time where we perform the collection on our servers. I guess this now also makes Dataloop the first monitoring focussed PaaS :D

While working with our current customers we have often suggested they setup a central polling server running a Dataloop agent that collects metrics from remote systems where installing an agent isn’t possible. When those metrics come from servers behind the firewall like self hosted Jenkins or Jira this will still be the case. However, there are a whole bunch of scenarios where having an external agent makes a lot of sense.

Uses for the Dataloop container

The docker container does two things:

1. You can use it to run Nagios check scripts on Dataloop infrastructure

2. It runs a StatsD server so you can connect StatsD clients directly to it

Nagios Scripts

We are the only monitoring tool that lets you write custom plugins in the browser, test and then deploy them directly to agents. So providing a safe, containerised docker environment to test out those features instantly was one of our goals. We also wanted to provide a quick way for people to check their internet facing endpoints and scrape their 3rd party services (e.g. Google Analytics, Splunk, New Relic or even Twitter). Stuff like Pingdom is cool for checking latency from around the world but they don’t give you the ability to run proper check scripts.

Here’s an example of a quick twitter script I created running in a new Dataloop docker container.

Screen Shot 2015-07-24 at 13.58.33

For the first iteration of the docker container we’re focusing support on our ‘built-in python’ shell. This means you can write scripts in any language as long as that language is Python :) We will add more languages to the container in future releases.

What does the ‘built-in python’ shell give you? Quite a lot actually. We bundle an embedded Python 2.7 runtime and lots of libraries for you to use to create your own plugins. So you get the full power of the Python standard library, as well as libraries that let you scrape api’s (requests, beautiful soup, xmltodict) or even test websites with robobrowser. There’s quite a lot in there and we’re happy to add more. To see exactly what is available we list all of the libraries here:

https://support.dataloop.io/hc/en-gb/articles/205618375-Built-In-Python

We’ve already written quite a few plugins that you can try:

https://github.com/dataloop/plugins

You could set all of this up before the docker container by using a local agent on your own server but now you can get from signup to sending in data from all over the place in a few minutes.

Signup, have a play around and if you end up creating an awesome script and dashboard please feel free to submit a pull request back to our Github plugin and dashboards repo so other people can try them.

Here’s a quick dashboard I did for our Twitter the other day.

Screen Shot 2015-07-24 at 13.57.02

We need a bit more social activity to make the graphs look cooler. The fact that we have twitter data in the app now sitting alongside dev and ops metrics, as well as data from a range of other places means you can create dashboards and alerts that are useful for a wide variety of teams. You might even want everything together on one summary dashboard.

If you are interested in what sort of dashboards you should be creating we did a blog that touches on that subject here:

http://blog.dataloop.io/2015/03/18/using-monitoring-dashboards-to-change-behaviour/

I’m also talking at Operability.IO in September about this. Hopefully with a few more examples of how SaaS companies are using data to improve their businesses.

Hosted StatsD

Developers love StatsD metrics. They are probably the single greatest tool a developer could ever use for monitoring their app. Extremely quick to setup and immensely powerful. You can get production to emit non blocking UDP packets containing metrics about any piece of live data you wish. From timing functions and classes to tracking shopping basket contents, or even how many times a feature was used.

Until now customers were expected to setup their own StatsD servers. This is still a supported option, however, you can now get from signup to accepting StatsD metrics in minutes without any additional servers setup. For developers with a strict time budget hopefully this will help out.

Some details on how you get up and running with that are here:

https://support.dataloop.io/hc/en-gb/articles/204924459-Hosted-StatsD

I’ve already had a bit of fun profiling our Jenkins JVM. With the addition of a javaagent option to the Jenkins startup command in the init script and a service restart I now get all of the memory and cpu profiling directly into my docker statsd port.

jenkins

Not bad for less than 1 minute of time invested to setup.

Request for feedback

The docker container is a new feature and as such we’re always very interested in feedback. Is it a crazy idea? Could it be made even better? Ping us an email with any suggestions you might have.


Monitoring Java apps with Nagios, Graphite and StatsD

$
0
0

I’ve always found it strange that Java is the most popular programming language on the planet yet actually getting stats out is extremely frustrating. I’ll go through the ways I’ve tried, from worst to best.

Nagios Scripts

There are many check scripts available on the internet and sadly they all work the same way. To collect the metrics they need to talk to the JMX interface of the application server. That in itself isn’t too bad, but it does require a bit of fiddling usually to enable JMX. The bad bit is that to talk to the JMX interface you need more Java. For check scripts that poll every 30 seconds this means spinning up an entirely new Java process each time you want to grab some data.

The main plugin I’ve used is check_jmx. And, although it works, it’s probably one of the less friendly ways to get metrics. You need to specify each individual metric you want to collect manually and run multiple commands. So that’s multiple:

check script -> java -jar jmxquery.jar -> jmx

There are other scripts around that use friendlier Java clients like cmdline-jmxclient. This makes the check creation slightly easier but doesn’t really solve the problem of not starting Java, which was designed for long running processes, multiple times for seconds at a time. With all of the startup overhead that entails.

Unless you have massive servers, very few metrics to collect and enjoy browsing jconsole in one window while hand crafting check script arguments in another I’d probably steer clear of any of these scripts.

Update: festive_mongoose on Reddit suggested enabling SNMP, something that I’ve never seen done or tried. But it seems like a reasonably sane approach. Here’s a blog describing how:

https://www.badllama.com/content/monitor-java-snmp

I’m not a massive fan of SNMP after writing a bunch of scripts to monitor Cisco devices. So my recommendation is still to go with the easier options in the next sections. Although that is just a personal peeve so you may want to give it a try.

JMX-HTTP Bridge

This is probably the entry point for JVM monitoring. Wouldn’t it be cool if your JVM started up and presented a rest interface that was easy to browse and query like a lot of more modern software does for their metrics? Well, it doesn’t by default but you can make it.

The two things I’ve tried are Jolokia and mx4j. Both worked well, but I prefer Jolokia so we’ll discuss that one.

To get Jolokia starting along with your JVM just pass in a link to the Jolokia Jar as part of your JAVA_OPTS.

-javaagent:/path/to/jolokia-jvm-<version>-agent.jar

That’s pretty much it. You now have a rest interface to JMX that you can query on /jolokia. At this point you can write simple bash scripts to curl for metrics, or if you want to do something a bit more complicated I tend to write Python scripts using the awesome Requests library.

As far as I can tell hitting up the rest interface has negligible affects on performance. Certainly, if you are just pulling out the common stuff like memory (heap, non heap and garbage collect) then it’s a good way to do it.

JMX to Graphite

Polling for metrics via a rest endpoint is good if you’re just checking you haven’t blown your memory or aren’t constantly garbage collecting every 30 seconds. But what if you want to stream real-time metrics out of the JVM into graphs? Other software exists that makes this possible by holding open a long running connection to the JMX port and using it to make more rapid requests for metrics.

The main piece of software I’ve used for this is JMXTrans. On Ubuntu it’s as simple as installing the package and putting some JSON config files into a directory. Some more detailed setup instructions can be found here:

https://github.com/dataloop/java-metrics

Although it’s easy to setup JMXTrans I think the JSON config files are pretty horrible and quite hard to work with. Also, guiding people through how to set it up is a bit of a pain as invariably they need to spend a bit of time getting it working on a single box, then writing config management to roll it out everywhere. It isn’t exactly plug and play and is prone to errors that are hard to debug.

I believe you can also use CollectD with the Java plugin to construct scripts that can also poll the JMX API more frequently without incurring the Java startup penalties. However, you have to stop somewhere and I’ve not had time to try it yet.

JMX Profiling

More recently I started playing around with Java monitoring again and found the awesome StatsD JVM Profiler from Etsy:

https://github.com/etsy/statsd-jvm-profiler

Again, it’s another case of adding a javaagent option when you start the JVM. But that’s it, there are a few options you can pass in, but no horrible config files!

For my first set of testing I decided to victimise our Jenkins server. It’s probably the wrong way to do it, but I just hacked in the following into /etc/init.d/jenkins

STATSD_ARGS="-javaagent:/usr/lib/statsd-jvm-profiler-0.8.2.jar=\
             server=fingerprint.statsd.dataloop.io,port=8125"

Then modified the line in the do_start() init function to load it.

$SU -l $JENKINS_USER --shell=/bin/bash -c \
    "$DAEMON $DAEMON_ARGS -- $JAVA $JAVA_ARGS $STATSD_ARGS -jar $JENKINS_WAR $JENKINS_ARGS" \
    || return 2

The Jenkins guys are probably turning in their grave at this point. But it was a quick test I thought I could revert very quickly if the metrics were rubbish.

After spending those 2 minutes wantonly breaking config management in the quest for monitoring satisfaction I got a massive surprise. It just worked, and not only that the metrics sent back are awesome.

Here’s a quick dashboard I knocked up and the list of some of the metrics available in the right hand metrics sidebar. You will probably need to click the image to expand it to see those.

Screen Shot 2015-07-25 at 13.47.02

With this alongside Jolokia, a few dashboards and some alert rules you’d have quite good monitoring coverage.

I haven’t done much with the CPU metrics simply because we don’t have a cool enough widget to display them yet. The blog post that piqued my interest about this project shows a flame chart being used for those. We could probably add a widget for that at some point.

That blog also talks about scaling problems so you may want to be careful with how many metrics you send back, unless you’re using Dataloop, in which case we don’t mind :)

Summary

I would personally avoid the Nagios check scripts that connect to JMX via a jar file.

You should setup Joloka regardless as this gives you a very simple way to collect any metric by polling the rest interface. Create or use Nagios check scripts that hit the /jolokia endpoint.

If you want real-time generic metrics then the StatsD JVM Profiler is the way to go. For a line or two of config you get some amazing results.

If you really do need real-time custom metrics from your JVM then I believe JMXTrans is a workable piece of software. It’s just really horrible to setup the JSON files. Luckily, with the two options above you probably don’t need this.

For applications that you are building yourself in Java you should definitely setup a StatsD client in your code. You can then send custom metrics from your application as it runs in production. You may also want to investigate DropWizard.

The final piece is probably some kind of APM tool. Both New Relic and App Dynamics will do a good job of monitoring the performance of your application.

Some benefits to using Dataloop (skip this section if you aren’t using Dataloop)

We recommend that you install Jolokia on every JVM and tag the Dataloop agent on those servers with something like Jolokia or Java, or whatever tag you can use to differentiate those servers from the others.

Once you have them tagged you can immediately start to construct scripts in the browser to collect metrics in Nagios format. These will be polled every 30 seconds and will be immediately available to use in Dashboards and Alerts. With the benefit that you can create different views on the same data by combining tags for your environments and services.

The other major benefit is you can go from deciding what additional metric to grab from Jolokia to being able to graph and alert on it in seconds. Once you have Jolokia and the Dataloop agent on a server you never need to do anything other than create and edit scripts in the browser. You have a platform where everything is available and a central place where you can edit what to collect in real-time.

To get the Etsy JVM Profiler working you don’t need to setup anything other than the –javaagent on the server you wish to profile. Simply send the StatsD traffic into the Docker container in your account.

-javaagent:/usr/lib/statsd-jvm-profiler-0.8.2.jar=\
           server=fingerprint.statsd.dataloop.io,port=8125

In the above case I downloaded the jar file into /usr/lib/ and then set the fingerprint my ‘dataloop’ container address. Some more details can be found here:

https://support.dataloop.io/hc/en-gb/articles/204924459-Hosted-StatsD

So to confirm, the only real effort you guys need to go to is putting a couple of jar files into directories and adding a couple of —javaagent options to your JVM’s. One so we can poll from Nagios scripts in Dataloop, and one so the JVM emits StatsD metrics directly to us. This should all only take a few minutes to setup and we’re available on Slack to help as always.


Super duper startup deal!

$
0
0

At Dataloop we love startups. Being one ourselves we know intimately the financial, technical and time pressures involved with building an online service from scratch.

As of today we are offering unlimited usage of our service for $199 per month to startups. Install as many agents as you want. Send as many metrics in as you desire. Don’t worry about the cost until you launch!

monitor

We know the bar keeps raising every year from a technical perspective. While you’re out pitching, you don’t want to be worrying about your service being down.

If you are interested in this offer then sign up to Dataloop and email info@dataloop.io. Send us a quick description of what you’re building and we’ll flip you over onto the super duper startup deal.

Benefits

Launch with a stable and fast service

Screen Shot 2015-08-05 at 15.56.55

You’ve spent too long building something awesome for outages and performance issues to steal the show after launch. Having a rock solid MVP is crucial if you want to start getting relevant, feature focused customer feedback. Use Dataloop to get complete visibility into your stack so you can solve all of those issues before you go live.

Quick Setup

Screen Shot 2015-08-05 at 16.04.18

Time is a premium in startups and we believe we can save you weeks of setup time with no vendor lock-in. You get all of the benefits of having something like Nagios, Graphite and StatsD immediately. We have configuration management repos for Chef, Puppet, Ansible and Salt that make agent rollout painless and plugins and dashboards for most common open source software. Have a problem? Talk to us on Slack for instant help.

Server Monitoring

Screen Shot 2015-08-05 at 16.09.16

Want to create a dashboard or get alerted if any of your stack breaks? Pinpoint immediately if it’s a problem with your code or something environmental. Things change rapidly in a startup so having monitoring watching your back and catching breakages as they happen means less time spent troubleshooting.

Service Monitoring

Screen Shot 2015-08-05 at 16.11.37

Use Elasticsearch? Cool, we have a plugin for that, and a dashboard. We have almost 100 plugins for common open source services. All in Nagios format in our public Github repos being worked on by a team of DevOps people who can be reached in our public Slack channel if you need help with changes.

Code Metrics

Screen Shot 2015-08-05 at 16.14.21

Get visibility into what your services are doing. We provide a hosted StatsD server per account that you can use immediately to start instrumenting your code. Track the performance of your code. Viewing these metrics alongside your server metrics as they update in real-time on our dashboards is invaluable when performance testing.

Business Metrics

Screen Shot 2015-07-24 at 13.57.02

We have plugins for Google Analytics, Twitter and a range of other services. Wouldn’t it be cool if you had a dashboard setup for launch where you can track signups and other interesting numbers in one place alongside your infrastructure and code metrics.

Get started today by signing up and dropping us an email.

PS. If you aren’t a startup we love you too. We offer some great volume discounts for over 200 agents!


#DOXLON DevOps Exchange (July 15) – DevOps @ ASOS

$
0
0

This month we did our 3rd general DevOps meetup which we held at ASOS’s great offices in London. 3 more great speakers and a great attendee turnout, it was great to see lots of regulars, and also many new faces for beers and pizza afterwards.

thumb_IMG_0190_1024

Our next meetup will be on 27th August and you can sign up here.

CALL TO ACTION: We’ve put up a speaker registration form so if you’re interested in speaking at one of our future meetups please fill out the form here so you’re on our radar!

Charlie de Courcy (Rackspace) – Deploying applications through ChatOps

A talk on how Rackspace is trying to simplify DevOps for their customers, deploying infrastructure through the code via Slack, how chatops has benefited the customer experience, using NewRelic to drive autoscaling and Blue / green deployments.

Vic van Gool (Cloud 66) – Building a multi-cloud high availability web application with Docker

In this talk Vic shows what it takes to build a high availability, disaster proof web service hosted on multiple cloud providers and data centers.

Sean Reilly (Equal Experts) – The most important users – agile development with operations in mind

Operations staff are arguably the most important users of your product. They’re the ones who turn it on.

Agile development has taken over the modern software development world, but it’s important to properly handle the way it interfaces with production operations. In this talk, we take a look at one effective way of managing the interaction of operational concerns with development, and as an example show how a traditional feature (logging) gets built designed and build completely differently when you look at it from this perspective.


Tagged: DevOps, DevOps Exchange, DevOps Exchange London, DOXLON, Meetup, Slides, Video

Dashboard Examples: Background

$
0
0

Everyone loves example dashboards! So we’ve decided to do a series to highlight a few of the coolest ones. If you’re easily bored skip to the bottom set of links to get your dopamine fix of screenshots. For those with a longer attention span hopefully this post helps explain some context.

Firstly, some background, for those who know nothing about Dataloop. You can think of our dashboards as a mix of Grafana and Dashing (if you come from the open source world). They contain analytics and business dashboard features all wrapped up in a UI that’s designed to promote adoption outside of operations. We work primarily with companies who are launching an online service, most of which are on public cloud infrastructure and who encourage a DevOps culture.

Our focus is on custom monitoring which I often compare to how Lego is promoted. We provide some guidance and templates to automatically monitor common services (like Lego box sets, or in our case ‘packs’) but our real focus is on helping people to create their own unique stuff. Dataloop makes it incredibly simple to get all of your custom metrics into a central place and then encourages your team mates to express their creativity in the form of point and click dashboards that can be shared around the office. We also do some complex stream processing stuff for alerting but that isn’t nearly as fun to blog about.

We often get asked to provide some inspiration for what to show on dashboards. Previously, we have written about using dashboards to change behaviour but didn’t provide many concrete examples. This series of blogs will hopefully provide some specifics that everyone can draw from.

So what happens when you provide a tool like Dataloop to teams outside Ops? We’ve been working extremely closely with a number of online services to find out. We sit every day in our public Slack chat room talking about exactly this type of stuff and sharing ideas, along with Plugins and Dashboard YAML files in our public Github repos. The results have been generally quite surprising.

The examples are grouped up by genre as shown by the post titles. Everything loosely fits under these so far but over time we may think up a better way to group them.

Posts:

Dashboard Examples: Ops Dashboards

Dashboard Examples: Dev Dashboards

Dashboard Examples: Business Dashboards

Dashboard Examples : Status Dashboards

Dashboard Examples: Capacity Dashboards

Dashboard Examples: DevOps Dashboards

Caveat: We have obtained permission to show some customer dashboard screenshots as they are far more interesting than mockups. However,  parts of those dashboards may be obscured for privacy reasons.

If you have any questions please reply to the blog or join us in our Slack chat room. Or simply sign up and join us. We’d be happy to help you dream up and create your own set of shiny screens to impress your friends.



Dashboard Examples: Ops Dashboards

$
0
0

Backups

<insert dashboard screenshots here>

This was an unexpected example but quite a few customers have created dashboards to monitor their backups. Back in the old days I remember logging into the Backup Exec remote console to see what was going on. Nowadays with the move to cloud everyone has little scripts here and there to push stuff to S3 or similar. There isn’t that central console any more and yet people still want to glance at something that immediately lets them know all backups are happening and the numbers look right.

Databases

<insert dashboard screenshots here>

The trend towards polyglot databases is real and happening at an alarming rate. Nowadays people have MySQL, ElasticSearch, Redis, Riak, Mongo.. the list is endless. There are obviously a lot of benefits to using the right tool for the job when it comes to managing different types of data. However for the Ops and DBA teams this means a lot of work. Most of the dashboards created for the databases are designed to minimise the time spent troubleshooting issues, and in a lot of cases rule out the database as the problem so that effort can be focussed on getting the developers to fix the app instead.

Big Data

<insert dashboard screenshots here>

Hadoop and all of its component services are an absolute monster to operate. I’d put this on par with OpenStack in terms of complexity to run. I don’t have any examples of OpenStack dashboards but I know some people are monitoring it using Dataloop. Again these dashboards are typically used for troubleshooting issues and they only tend to get opened when something goes wrong.

Queues, Caches, Load balancers, Web Servers

<insert dashboard screenshots here>

Bundling all of these together as they are all related to passing data around between services. Tracking Nginx and Apache status codes is a favourite, as is alerting off of spikes of 5xx codes. What we see in general with these is that the dashboards provide context to the alerts. Usually an alert gets triggered and you want a dashboard to look at the trend and see the big picture.

Adhoc Dashboards

<insert dashboard screenshots here>

Another unexpected use like the backups example was seeing dashboards for ‘weekend hard drive migration’ and similar. I guess it make sense to create a quick dashboard showing data transfer progress for swapping some disks out. These dashboards tend to be reasonably short lived so I don’t have too many screenshots. But it just goes to show that people are quite visual and for some reason have started to think about using dashboards in a fairly disposable way as part of tasks and mini projects.

Summary

In summary the Ops guys are generally creating dashboards to help troubleshoot issues when something has failed. They are also creating dashboards for views into things they may want to check quickly, whether that’s an ongoing task like backups or short lived tasks like swapping disks.

Posts:

Dashboard Examples: Background

Dashboard Examples: Dev Dashboards

Dashboard Examples: Business Dashboards

Dashboard Examples : Status Dashboards

Dashboard Examples: Capacity Dashboards

Dashboard Examples: DevOps Dashboards


Dashboard Examples – Dev Dashboards

$
0
0

Code Level Metrics

erlang

Developers are primarily interested in how their code is performing. Whether that’s on their laptop or production, or anything in-between. Seeing that alongside system resource metrics and business metrics is quite powerful.

Dataloop supports ‘StatsD’ metrics which means developers can add an open source library and then instrument their code by adding lines in a similar way they would add logging. This is extremely lightweight, with the metric data being sent to Dataloop over UDP, so there’s no noticeable performance penalty.

The most common metrics to send back are usually counters and gauges. These can be used to track things like the performance of an API, the throughput of a service, error rates and even how often a feature has been used (product managers find it hard to argue with graphs). The benefit is that with more data a developer should make better decisions. The cool thing about this compared to APM tools like New Relic and AppDynamics is that you get to specify exactly what metrics to watch. There aren’t really any limits – if you can do it in code then it can be tracked and graphed.

api workers rollup
API Performance Worker Capacity Worker Throughput

Unlike the Ops dashboards that only get looked at when something is broken, these dashboards are typically used to confirm things are working as expected. Having sat with our own developers I know how powerful this stuff is for optimising code and finding issues. The only downside is that these developers create some messy dashboards. They throw on StatsD metrics in a haphazard manor and I’m constantly having to go in and clear up :)

Micro Services

microservice

 

Looking at the performance of individual pieces of code only gets you so far in complex environments. You might have dozens of services that all communicate as part of a complicated distributed system. It’s pretty hard to work in this kind of environment without good monitoring.

Infrastructure Metrics

rabbit

Developers increasingly rely on an assortment of services as we saw in the previous blog about Ops Dashboards. In some cases the dashboards can be shared as both Dev and Ops might be concerned about the same metrics. Often developers might go off and create their own dashboards for the infrastructure services that focus more on performance and scale.

Posts:

Dashboard Examples: Background

Dashboard Examples: Ops Dashboards

Dashboard Examples: Business Dashboards

Dashboard Examples : Status Dashboards

Dashboard Examples: Capacity Dashboards

Dashboard Examples: DevOps Dashboards


Dashboard Examples – DevOps Dashboards

$
0
0

Prioritise service over features

Screen Shot 2015-09-08 at 13.50.18

There is a constant tug of war going on in most SaaS companies. With finite resources you have to make tough decisions on whether to push forward with the feature roadmap or slow down and reduce technical debt that may be contributing to slowness or instability of your app. New features can mean more customers while bad service means losing customers.

Product managers have a natural tendency to prioritise shiny new features over work that needs to be done to help run a successful service. This problem is magnified a thousand times if you work in a company transitioning from product centric development to service centric development. Even when everyone is working in the same team, unless you have data to backup your requests, I wish you good luck getting the priority to fix the non visible stuff.

One way to level the playing fields is with data. We’ve seen several examples of operations teams in companies making performance metrics easily visible to other teams, and especially senior management. It’s surprising how quickly login times, or search result performance gets fixed when you start sending out graphs showing how bad things are.

Track Deployments

Screen Shot 2015-09-08 at 13.59.14

Ever wondered what build number was in each environment? Moving to micro services makes that question even harder.

In our example above we have 3 environments. The first column is our internal monitoring server that we affectionately call Nagios, even though it runs a slightly older copy of Dataloop. Then we have Staging, and finally Production. We have simple nagios scripts that run dpkg -s <package name> and output the build number on each server.

Plotting the minimum version across a group of hosts means we catch problems with servers that don’t update. Getting this level of visibility into which package version is installed where is usually a 5 minute job. We reference this dashboard multiple times a day when pressing the deploy buttons in Jenkins just to make sure everything is as we expect.

We also alert off drift between versions in each environment. If staging gets out of step with production by more than 10 builds we trigger an alert that there needs to be a release. The greater the difference between environments the greater the risk of deployment due to the size of change. We like to keep our releases small and often so that if there is a problem it’s easier to diagnose and fix.

Stop the floor on build errors

Screen Shot 2015-09-08 at 14.10.53

What’s the point continuing to develop software if you can’t release it? As with most modern companies we release updates to production multiple times per day. Our ability to release is paramount so we keep our build radiator dashboard green at all times. Our example dashboard above also graphs build times so you can detect if a developer has imported half of the packages on Github into the project by accident.

Continuous Testing

Different parts of an online service get built and released at different rates with different levels of risk. When dealing with an important data store you may want to do more upfront testing before it hits production. Other parts of the service may be able to move at a much faster rate, and in these cases you may want to focus more on agility with a quick build and deploy time and push more of the testing into monitoring in production. This is when you may want to create dashboards that show the current state of your smoke tests and alert on issues quickly so you can quickly roll forwards with fixes.

Summary

Many of our dashboards are being used to bring development and operations together. Whether that’s to align on features vs operability improvements, or simply to save time on errors that happen when you aren’t sure which version of software is deployed where. Overall the point of these dashboards are to improve the service as a whole; Not just from a infrastructure or code perspective as shown in our previous two blog posts, but at the level of what’s truly important; which is what customers see when they use your product.

A small efficient team working on the right things can run rings around a giant team that isn’t focussed on what’s important.

Posts:

Dashboard Examples: Background

Dashboard Examples: Ops Dashboards

Dashboard Examples: Dev Dashboards

Dashboard Examples: Business Dashboards

Dashboard Examples : Status Dashboards

Dashboard Examples: Capacity Dashboards


#DOXLON DevOps Exchange (August 15) – DevOps Exchange August

$
0
0

Last month we did our 4th general DevOps meetup which we held at DigitasLBi’s great offices in London. 3 more great speakers and a great attendee turnout, it was great to see lots of regulars, and also many new faces for beers and pizza afterwards.

doxlon_august_15

Our next meetup will be on 7th October at Europe’s largest Sysadmin/DevOps trade show, IPExpo, You don’t need to come to the show if you can’t make it during the day, but we highly recommend coming early, not only to check out our very own Dataloop.IO stand, but also to enjoy the free Oktoberfest beers and pretzels beforehand! If you want to come to the show then sign up now and to register for the meetup please sign up here.

CALL TO ACTION: We’ve put up a speaker registration form so if you’re interested in speaking at one of our future meetups please fill out the form here so you’re on our radar!

Vik Bhatti (Beamly) – Service Discovery for DevOps

What is service discovery? In the modern world of containers and highly distributed systems, service discovery is mentioned time and time again, but details on how and why it should be implemented are pretty thin. In this talk I will walk through the trials and tribulations of how Beamly has approached the problem, and how the solutions have changed over the past 3 years as the architecture as grown.

Rich Harvey (Ngineered) – A look at GKE and Kubernetes

A practical look at Google’s new container platform GKE and its integration with Kubernetes. Covering the basics of how to get started with a live demo launching Nginx containers and connecting them to load balancers, using kubectl.

Chris Jackson (Pearson) – DevOps Ground Zero

In the beginning there was nothing… No, really, nothing. This talk is a snapshot of the first six months of working in a 150-year old Enterprise and my drive to establish new working practices, seed the through processes, organisational structure and beliefs that support DevOps and get senior leadership to buy into an approach that gives you the opportunity to prove you are right and break traditional decision making cycles. This talk will be an honest, sometimes humorous, sometimes terrifying account of what it is like to build DevOps from zero which highlights good suggestions and patterns for people in similar situations but also the huge achievement associated with progress on this kind of work.


Tagged: DevOps Exchange, DevOps Exchange London, DOXLON, Meetup, Slides, videos

Enhanced Docker Monitoring

$
0
0

Since our last blog about Docker we’ve been working on making setup even simpler. Here’s a quick step by step guide in our new UI that we’ll be releasing in a couple of weeks (until then everything works the same, but things may look cosmetically different).

1. Sign up at www.dataloop.io for a 2 week trial

2. Click Install Agent at the bottom of the first page

intro

3. Click the Docker link for the command to copy and paste onto your Docker host

install

That’s it! Magic should now happen..

By default our Dataloop/Dataloop-Docker container will stay running on each of your Docker hosts and keep your containers in sync automatically. As you spin up and down new containers they will appear and disappear as if by magic. We’ve even given them little whale icons.

agentlist

Clicking on one of the containers shows a bunch of details about when it last connected and some basic performance data.

containersummary

If you click the details link at the top you’ll also get a bunch of info about what’s running in your container as well as some network details. One other cool thing we do is tag your container with a bunch of stuff like their container name, image name and even some environment variables.

containerdetails

Clicking the analytics tab will give you some historic graphs showing you what this particular container has been up to. A good tip for this page is you can click on the heading numbers to overlay avg, min, max etc.

containergraphs

And given that we tag everything automatically you can of course browse aggregated data about your containers by looking in the tags. This one shows all the containers running our agent_statsite:latest Docker image – about 149 in total currently.

tagcontainersummary

We even sum up the count of your running processes within a tag and let you view graphs across a bunch of containers with a few clicks.

tag details

Because everything is tagged you can treat your containers like you would normal agents in Dataloop for pretty dashboards. Here we have an example showing our 149 StatSite containers and a Graylog container with some overall host metrics. A common scenario is Devs wanting to create dashboards showing resource utilisation of their containers alongside code level metrics coming from StatsD.

dashboards

Finally you can also setup alerts on the container metrics just like anything else in Dataloop.

alerts

We hope you enjoy our next level of Docker monitoring. In theory you should be able to get up and monitored with all of your containers on a host in seconds. Combine this with our hosted StatsD and you could be doing some advanced stuff in a few hours.

We’re working behind the scenes on some cool things around monitoring popular off the shelf open source software automatically via inter-container auto-discovery. Things like MySQL and ElasticSearch. We’d love to get your feedback on the current stuff shown in this blog and any other ideas you might have for the future.

Feel free to contact us on info@dataloop.io with any questions. Happy Dockering!


Viewing all 65 articles
Browse latest View live