DevOps & Ops for Devs – Top Takeaways

Sylvana Lewin | Monday, October 23rd, 2017

Last Thursday, MEST Incubator Fellow Anton Hägerstrand held a workshop on DevOps and Ops for Developers. Anton comes to MEST from Meltwater where he works as a software developer. Read on for his top takeaways!

DevOps is about people, not tools.

DevOps is about culture. The idea is that the people and teams who build products should own these products in production environments. This includes doing deployments, handling incidents, and performance profiling.

The idea is that DevOps will bring learnings from the way software behaves in a production setting into the team building the software. This means that the pain of running the software more directly affects how the software is built. This should force the builders of the software to make it more ready for a production setting, leading to fewer incidents.

DevOps emerged from a culture where developers released software into the hands of operations/system administrators. Often tensions arose when software did not behave well in production. The developers did not have a deep understanding of production requirements, while sysadmins did not have enough understanding, or permission, to change the software so it was more production-ready.

It is important to note that DevOps is one answer to this problem. Some larger companies, e.g. Google, do not do all-out DevOps. Instead they rely on more traditional setups, avoiding the blame-game between developers and operations by culture and communication.

More reading: https://martinfowler.com/bliki/DevOpsCulture.html

Deployment options

When deploying applications, there is a large set of options. The first option to consider is where you want to deploy your application. The (reverse) hierarchy goes something like this:

  • Heroku/Google App Engine etc. – These are opinionated ( in terms of e.g. what programming language & framework to use) platform-as-a-service (PAAS) providers. Generally these are a good fit for a majority of web applications, but offer little in flexibility. Costs are generally low to start with, but expensive if/when you need to grow.
  • Amazon Web Services/Google Cloud/Azure etc. – These are also PAAS providers, but with a richer featureset. This gets you more flexibility, but also makes them harder to use. The risk of messing up is higher, but generally cost is lower than previous options.
  • DigitalOcean/Linode etc. – These services provide you with access to (virtual) machines and basic networking, but without the extra functionality that a PAAS would offer. If you want to run a database, you need to manage it (and backups) yourself. They are often (much) cheaper than previous options, but at the cost of you knowing more.
  • Colocated Data center (COLO) – You rent space in a datacenter and buy (or lease) machines in a datacenter.  This puts some of the burden of managing machines on your company, but can be cheaper if you invest in people managing the infrastructure. It gives you less flexibility than cloud solutions in terms of growing your infrastructure.
  • Build your own datacenter – This is only cost-effective for very large companies, e.g. Google, Amazon and Facebook.

Generally, you start higher up in the stack and move downwards as your startup grows. The more you spend on infrastructure costs, the more lucrative moving downward should become.

Where you deploy your stack influences how to package it. For Heroku, git is used. For other options, Linux Containers (e.g. Docker) is becoming more popular, replacing .rpm/.deb deployments. Many PAAS has supporting infrastructure for orchestrating deployments of container-based images. If you use non-PAAS offerings, you often have to maintain infrastructure for this. Example open-source solutions are Mesos+Marathon and Kubernetes.

In larger deployments, you might also want to manage configuration centrally. Options for doing this include Puppet, Chef and Ansible.

Logging

It seems, to me, that logging is undervalued in some application stacks. Logs are very valuable when diagnosing problems, but can also be overwhelming.

I advise everyone to invest in using a proper logging framework. There are good built-in options for both Ruby and Python. For JVM-based applications, I recommend slf4j. For NodeJS, all options seems inadequate, mostly due to there being no real standard. The logging framework will allow you to enable different logs in different environments. This lets you write log statements all over the place, and disable them in production if not needed.

Starting out, SSH:ing into machines and looking at logs is sufficient. When you get a larger fleet of machines, you should consider log aggregation solutions. These include the Elastic Stack (formerly ELK), Graylog, Riemann and Splunk. All but Splunk are open source. Generally, you have the option of hosting yourself or paying for a hosted solution. Evaluate whether these costs are worth it for you. If you chose to use a service like this, consider logging as JSON instead of just a message. This will make handling the logs easier.

Metrics

Metrics are measurements of some value at a certain time. You can measure how your system performs using metrics such as:

  • request_per_second – How many requests are executed per second?
  • Request_took – How long did a request take to execute?
  • cpu_user – How utilized is the CPU in user space?
  • gc_pause_time – How much time does your application spend in Garbage Collection?

There are hundreds of similar metrics that can be gathered. In general, these metrics are rolled up into averages, medians and percentiles.

Metrics should be accessible from one central interface, where metrics from many machines and applications can be accessed. Solutions for achieving this include StatsD+graphite, Prometheus (both open-source) and Datadog (commercial). These are many other options, which might be better.

Metrics are used for two primary purposes: Diagnostic Dashboards and Alerting. Dashboards show how metrics change over time, and can be useful to e.g. correlate increased request time with high memory usage. Alerting is a bigger subject.

Alerting

Alerts (also called pages) is when some system alerts a person that something happens. For example, if your website is down you would want a system to tell you that. Other examples of when you want an alert is when the average request time has been over 10s for 5 minutes, or when a disk is about to run out of space.

There are tools to support you with this. There are simpler tools such as Uptime Robot and more advanced tools such as OpsGenie and PagerDuty. The more advanced ones integrate with metrics systems. These tools supports sending notifications to people or teams based on some kind of rules engine.

When setting up rules for when to get alerted, consider that the goal is to only get alerted when you need to take action. If an alert will resolve itself, or if you can’t do anything about it, you might have gotten woken up at 3am unnecessarily. This eats into your productivity the next day, so it’s bad for your personal life and for the business. If you get a spike in request times which resolves itself in 1min, you should not get alerted. It might just have been Heroku having issues.

For issues that resolved themselves, consider having the system send you an email instead of alerting you directly. This means that you get to know about them, but in your own time.

When you get alerted and have to take some kind of action, you should consider creating a fix for it. Do this the first working day after you received the alert.  If it was a software bug, fix the bug. If you notice that your database connection sometimes dies, diagnose why. Not getting woken up by the same issue again will make you more productive.

Lastly, always try to alert on symptoms, not causes. If you own a stack where the end-user interacts with a web page, alert on things that the user would see. The user does not care about how your stack looks; she cares about the web-page not loading. By putting alerts higher in the stack, you capture more causes. For example, catching 500s in the web-layer will capture both the database connection (that you did monitor), but also corrupted TCP packets from memcached (which you did not monitor).

The one exception to this rule is if there is a cause that will eventually lead to user symptoms. The best example is disk space: the user won’t notice the disks filling up, but once they do all hell will break loose. Create alerts on disks filling up (e.g. above 80%).

Most of the ideas on alerting come from this excellent article: My Philosophy on Alerting