Jekyll2021-03-07T19:43:04+00:00https://stian.tech/feed.xmlStian ØvrevågeMusings on tech and stuffStian ØvrevågeA side quest in API development, observability, Kubernetes and cloud with a hint of database2021-03-06T00:00:00+00:002021-03-06T00:00:00+00:00https://stian.tech/a-side-quest-in-api-dev-operations-cloud-and-database<p>Quite often people ask me what I actually do. I have a hard time giving a short answer. Even to colleagues and friends in the industry.</p>
<p>Here I will try to show and tell how I spent an evening digging around in a system I helped build for a client.</p>
<!--more-->
<p><br /></p>
<hr />
<p><br /></p>
<p><strong>Table of contents</strong></p>
<ul>
<li><a href="#Background">Background</a></li>
<li><a href="#TheProblem">The (initial) problem</a>
<ul>
<li><a href="#FurtherReading">Fixing the (initial) problem</a></li>
<li><a href="#FurtherReading">Verifying the (initial) fix</a>
<ul>
<li><a href="#FurtherReading">Baseline simple request - HTTP1 1 connections, 20000 requests</a></li>
<li><a href="#FurtherReading">Baseline complex request - HTTP1 1 connections, 20000 requests</a></li>
</ul>
</li>
<li><a href="#FurtherReading">Verifying the fix for assumed workload</a>
<ul>
<li><a href="#FurtherReading">Complex request - HTTP1 6 connections, 500 requests</a></li>
<li><a href="#FurtherReading">Complex request - HTTP2 500 “connections”, 500 requests</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#FurtherReading">Side quest: Database optimizations</a></li>
<li><a href="#FurtherReading">Determining the next bottleneck</a></li>
<li><a href="#FurtherReading">Side quest: Cluster resources and burstable VMs</a></li>
<li><a href="#Conclusion">Conclusion</a></li>
</ul>
<p><a id="Background"></a></p>
<h1 id="background">Background</h1>
<p>I’m a consultant doing development, DevOps and cloud infrastructure.</p>
<p>For this specific client I mainly develop APIs using Golang to support new products and features as well as various exporting, importing and processing of data in the background.</p>
<p>I’m also the “ops” guy handling everything in AWS, setting up and maintaing databases, making sure the “DevOps” works and the frontend and analytics people can do their work with little friction.
99% of the time things work just fine. No data is lost. The systems very rarely have unforeseen downtime and the users can access the data they want with acceptable latency rarely exceeding 500ms.</p>
<p>A couple of times a year I assess the status of the architecture and set up new environments from scratch and update any documentation that has drifted. This is also a good time to do changes and add or remove constraints in anticipation of future business needs.</p>
<p>In short, the current tech stack that has evolved over a couple of years is:</p>
<ul>
<li>Everything hosted on Amazon Web Services (AWS).</li>
<li>AWS managed Elastic Kubernetes Service (EKS) currently on K8s 1.18.</li>
<li>GitHub Actions for building Docker images for frontends, backends and other systems.</li>
<li>AWS Elastic Container Registry for storing Docker images.</li>
<li>Deployment of each system defined as a Helm chart alongside source code.</li>
<li>Actual environment configuration (Helm values) stored in repo along source code. Updated by GitHub Actions.</li>
<li>ArgoCD in cluster to manage status of all environments and deployments. Development environments usually automatically deployed on change. Push a button to deploy to Production.</li>
<li>Prometheus for storing metrics from the cluster and nodes itself as well as custom metrics for our own systems.</li>
<li>Loki for storing logs. Makes it easier to retrieve logs from past Pods and aggregate across multiple Pods.</li>
<li>Elastic APM server for tracing.</li>
<li>Pyroscope for live CPU profiling/tracing of Go applications.</li>
<li>Betteruptime.com for tracking uptime and hosting status pages.</li>
</ul>
<p>I might write up a longer post about the details if anyone is interested.</p>
<p><a id="TheProblem"></a></p>
<h1 id="the-initial-problem">The (initial) problem</h1>
<p>A week ago I upgraded our API from version 1, that was deployed in January, to version 2 with new features and better architecture.</p>
<p>One of the endpoints of the API returns an analysis of an object we track. I have previously reduced the amount of database queries by 90% but it still requires about 50 database calls from three different databases.
Getting and analyzing the data usually completes in about 3-400 milliseconds returning an 11.000 line JSON.</p>
<p>It’s also possible to just call <code class="language-plaintext highlighter-rouge">/objects/analysis</code> to get the analysis for all the 500 objects we are tracking. It takes 20 seconds but is meant for exports to other processes and not interactive use, so not a problem.</p>
<p>Since the product is under very active development the frontend guys just download the whole analysis for an object to show certain relevant information to users. It’s too early to decide on which information is needed more often and how to optimize for that. Not a problem.</p>
<p>So we need an overview of some fields from multiple objects in a dashboard / list. We can easily pull analysis from 20 objects without any noticable delay.</p>
<p>But what if we just want to show more, 50? 200? 500? The frontend already have the IDs for all the objects and fetches them from <code class="language-plaintext highlighter-rouge">/objects/id/analysis</code>. So they loop over the IDs and fire of requests simultaneously.</p>
<p>Analyzing the network waterfall in Chrome DevTools indicated that the requests now took 20-30 seconds to complete! But looking closer most of the time they were actually queued up in the browser. This is because
Chrome only allows 6 concurrent TCP connection to the same origin when using HTTP1 (https://developers.google.com/web/tools/chrome-devtools/network/understanding-resource-timing).</p>
<p><a id="TheProblem"></a></p>
<h2 id="fixing-the-initial-problem">Fixing the (initial) problem</h2>
<p>HTTP2 should fix this problem easily. By default HTTP2 is disabled in nginx-ingress. I add a couple of lines enabling it and update the Helm deployment of the ingress controller.</p>
<p><a id="TheProblem"></a></p>
<h2 id="verifying-the-initial-fix">Verifying the (initial) fix</h2>
<p>Some common development tools doesn’t support HTTP2, such as Postman. So I found <code class="language-plaintext highlighter-rouge">h2load</code> which can both help me verify HTTP2 is working and I also get to measure the improvement, nice!</p>
<blockquote>
<p>Note that I’m not using the analysis endpoint since I want to measure the change from HTTP1 to HTTP2 and it will become apparent later that there are other bottlenecks preventing us from a linear performance increase when just changing from HTTP1 to HTTP2.</p>
</blockquote>
<blockquote>
<p>Also note that this is somewhat naive since it requests the same URL over and over which can give false results due to any caching. But fortunately we don’t do any caching yet.</p>
</blockquote>
<p><a id="TheProblem"></a></p>
<h3 id="baseline-simple-request---http1-1-connections-20000-requests">Baseline simple request - HTTP1 1 connections, 20000 requests</h3>
<p>Using 1 concurrent streams, 1 client and HTTP1 I get an estimate of performance pre-http2:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>h2load --h1 --requests=20000 --clients=1 --max-concurrent-streams=1 https://api.x.com/api/v1/objects/1
</code></pre></div></div>
<p>The results are as expected:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>finished in 1138.99s, 17.56 req/s, 18.41KB/s
requests: 20000 total, 20000 started, 20000 done, 19995 succeeded, 5 failed, 0 errored, 0 timeout
</code></pre></div></div>
<div style="text-align:center;">
<a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/0-baseline-http1-1-concurrent-apm.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/0-baseline-http1-1-concurrent-apm.png" />
</a>
_Overview from Elastic APM. Duration is very acceptable at around 20ms. No errors. And about 25% of the time spent doing database queries._
<a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/0-baseline-http1-1-concurrent-cpu.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/0-baseline-http1-1-concurrent-cpu.png" />
</a>
_Container CPU usage. Nothing special._
<a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/0-baseline-http1-1-concurrent-db-latency.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/0-baseline-http1-1-concurrent-db-latency.png" />
</a>
_Database query latency. The vast majority under 5ms. Acceptable._
<a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/0-baseline-http1-1-concurrent-db-queries.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/0-baseline-http1-1-concurrent-db-queries.png" />
</a>
_Number of DB queries per second._
<a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/0-baseline-http1-1-concurrent-http-latency.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/0-baseline-http1-1-concurrent-http-latency.png" />
</a>
_HTTP response latency._
<a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/0-baseline-http1-1-concurrent-http-requests.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/0-baseline-http1-1-concurrent-http-requests.png" />
</a>
_Number of HTTP requests per second. Unsurprisingly the number of database queries are identical to the number of HTTP requests. Latency of HTTP requests also tracks the latency of the (single) database query._
</div>
<p>For http2 we set max concurrent streams to the same as number of requests:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>h2load --requests=200 --clients=1 --max-concurrent-streams=200 https://api.x.com/api/v1/objects/1
</code></pre></div></div>
<p>Which results in almost half the latency:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>finished in 1.23s, 162.65 req/s, 158.06KB/s
requests: 200 total, 200 started, 200 done, 200 succeeded, 0 failed, 0 errored, 0 timeout
</code></pre></div></div>
<p>So HTTP2 is working and providing significant latency improvements. Success!</p>
<p><a id="TheProblem"></a></p>
<h3 id="baseline-complex-request---http1-1-connections-20000-requests">Baseline complex request - HTTP1 1 connections, 20000 requests</h3>
<p>We start by establishing a baseline with 1 connection querying over and over.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>h2load --h1 --requests=20000 --clients=1 --max-concurrent-streams=1
</code></pre></div></div>
<p><a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/1-baseline-http1-1-concurrent-analysis-apm.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/1-baseline-http1-1-concurrent-analysis-apm.png" />
</a></p>
<p><em>Latency increases as much more computation is done and data is returned. But the latency is consistent which is good. We also see that the database is becomming the bottleneck for where most time is spent.</em></p>
<p><a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/1-baseline-http1-1-concurrent-analysis-cpu.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/1-baseline-http1-1-concurrent-analysis-cpu.png" />
</a></p>
<p><em>CPU usage increased to 15%. Lower increase than expected considering the complexity involved in serving the requests.</em></p>
<p><a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/1-baseline-http1-1-concurrent-analysis-db-latency.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/1-baseline-http1-1-concurrent-analysis-db-latency.png" />
</a></p>
<p><em>Database query latency still mostly under 5ms.</em></p>
<p><a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/1-baseline-http1-1-concurrent-analysis-db-queries.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/1-baseline-http1-1-concurrent-analysis-db-queries.png" />
</a></p>
<p><em>Number of database queries increases by a factor of 10 compared to HTTP requests.</em></p>
<p><a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/1-baseline-http1-1-concurrent-analysis-http-latency.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/1-baseline-http1-1-concurrent-analysis-http-latency.png" />
</a></p>
<p><em>HTTP latency.</em></p>
<p><a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/1-baseline-http1-1-concurrent-analysis-http-requests.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/1-baseline-http1-1-concurrent-analysis-http-requests.png" />
</a></p>
<p><em>HTTP requests per second.</em></p>
<p><a id="TheProblem"></a></p>
<h2 id="verifying-the-fix-for-assumed-workload">Verifying the fix for assumed workload</h2>
<p>So we verified that HTTP2 gives us a performance boost. But what happens when we fire away 500 requests to the much heavier <code class="language-plaintext highlighter-rouge">/analysis</code> endpoint?</p>
<blockquote>
<p>These graphs are not as pretty since the ones above. This is mainly due to the sampling interval of the metrics and that we need several datapoints to accurately determine the rate() of a counter.</p>
</blockquote>
<p><a id="TheProblem"></a></p>
<h3 id="complex-request---http1-6-connections-500-requests">Complex request - HTTP1 6 connections, 500 requests</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>finished in 32.25s, 14.88 req/s, 2.29MB/s
requests: 500 total, 500 started, 500 done, 500 succeeded, 0 failed, 0 errored, 0 timeout
</code></pre></div></div>
<p><a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/2-burst-http1-6-concurrent-analysis-apm.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/2-burst-http1-6-concurrent-analysis-apm.png" />
</a>
<a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/2-burst-http1-6-concurrent-analysis-cpu.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/2-burst-http1-6-concurrent-analysis-cpu.png" />
</a>
<a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/2-burst-http1-6-concurrent-analysis-db-latency.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/2-burst-http1-6-concurrent-analysis-db-latency.png" />
</a>
<a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/2-burst-http1-6-concurrent-analysis-db-queries.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/2-burst-http1-6-concurrent-analysis-db-queries.png" />
</a>
<a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/2-burst-http1-6-concurrent-analysis-http-latency.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/2-burst-http1-6-concurrent-analysis-http-latency.png" />
</a>
<a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/2-burst-http1-6-concurrent-analysis-http-requests.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/2-burst-http1-6-concurrent-analysis-http-requests.png" />
</a></p>
<p>In summary it so far seems to scale linearly with load. Most of the time is spent fetching data from the database. Still very predictable low latency on database queries and the resulting HTTP response.</p>
<p><a id="TheProblem"></a></p>
<h3 id="complex-request---http2-500-connections-500-requests">Complex request - HTTP2 500 “connections”, 500 requests</h3>
<p><em>So now we unleash the beast. Firing all 500 requests at the same time.</em></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>finished in 16.66s, 30.02 req/s, 3.55MB/s
requests: 500 total, 500 started, 500 done, 500 succeeded, 0 failed, 0 errored, 0 timeout
</code></pre></div></div>
<p><a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/3-burst-http2-500-concurrent-analysis-cpu.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/3-burst-http2-500-concurrent-analysis-cpu.png" />
</a></p>
<p><em>CPU on API still doing good. A slight hint of CPU throttling due to CFS, which is used when you set CPU limits in Kubernetes.</em></p>
<blockquote>
<p>Important about Kubernetes and CPU limits<br />
Even with CPU limits set to 1 (100% of one CPU), your container can still be throttled at much lower CPU usage. Check out <a href="https://medium.com/omio-engineering/cpu-limits-and-aggressive-throttling-in-kubernetes-c5b20bd8a718">this article</a> for more information.</p>
</blockquote>
<p><a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/3-burst-http2-500-concurrent-analysis-db-latency.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/3-burst-http2-500-concurrent-analysis-db-latency.png" />
</a></p>
<p><em>Whopsie. The average database query latency has increased drastically, and we have a long tail of very slow queries. Looks like we are starting to see signs of bottlenecks on the database. This might also be affected by our maximum of 60 concurrent connections to the database, resulting in queries having to wait their turn before executing.</em></p>
<p><a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/3-burst-http2-500-concurrent-analysis-db-queries.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/3-burst-http2-500-concurrent-analysis-db-queries.png" />
</a></p>
<p><em>It’s hard to judge the peak rate of database queries due to limited sampling of the metrics.</em></p>
<p><a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/3-burst-http2-500-concurrent-analysis-http-latency.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/3-burst-http2-500-concurrent-analysis-http-latency.png" />
</a></p>
<p><em>Now individual HTTP requests are much slower due to waiting for the database.</em></p>
<p><a href="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/3-burst-http2-500-concurrent-analysis-apm-trace.png">
<img src="../images/2021-03-06-a-side-quest-in-api-dev-operations-cloud-and-database/3-burst-http2-500-concurrent-analysis-apm-trace.png" />
</a></p>
<p><em>Here is just a random trace from Elastic APM to see if the increased database latency is concentrated to specific queries or tables or just general saturation. Indeed there is a single query responsible for half the time taken for the entire query! We better get back to that in a bit and dig further.</em></p>
<p>In an ideal world all 500 requests should start and complete in 2-300ms regardless. Since that is not happening it’s an indication that we are now hitting some other bottleneck.</p>
<p>Looking at the graphs it seems we are starting to saturate the database. The latency for every request is now largely dependent on the slowest of the 10-12 database queries it depends on. And as we are stressing the database the probability of slow queries increase. The latency for the whole process of fetching 500 requests are again largely dependent on the slowest requests.</p>
<p>So this optimization gives on average better performance, but more variability of the individual requests, when the system is under heavy load.</p>
<p><a id="TheProblem"></a></p>
<h1 id="side-quest-database-optimizations">Side quest: Database optimizations</h1>
<p>It seems we are saturating the database. Before throwing more money at the problem (by increasing database size) I like to know what the bottlenecks are. Looking at the traces from APM
I see one query that is consistently taking 10x longer than the rest. I also confirm this in the AWS RDS Performance Insights that show the top SQL queries by load.</p>
<p>When designing the database schema I came up with the idea of having immutability for certain data types. So instead of overwriting row with ID 1, we add a row with ID 1 Revision 2. Now we have the history of who did what to the data and can easily track changes and roll back if needed. The most common use case is just fetching the last revision. So for simplicity I created a PostgreSQL view that only shows the last revision. That way clients don’t have to worry about the existense of revisions at all. That is now just an implementation detail.</p>
<p>When it comes to performance that turns out to be an important implementation detail. The view is using <code class="language-plaintext highlighter-rouge">SELECT DISTINCT ON (id) ... ORDER BY id, revision DESC</code>. However many of the queries to the view is ordering the returned data by time, and expect the data returned from database to already be ordered chronologically. Using <code class="language-plaintext highlighter-rouge">EXPLAIN ANALYZE</code> on the queries this always results in a full table scan instead of using indexes, and is what’s causing this specific query to be slow. Without going into details it seems there is no simple and efficient way of having a view with the last revision and query that for a subset of rows ordered again by time.</p>
<p>For the forseable future this does not actually impact real world usage. It’s only apparent under artificially large loads under the worst conditions. But now we know where we need to refactor things if performance actually becomes a problem.</p>
<p><a id="TheProblem"></a></p>
<h1 id="determining-the-next-bottleneck">Determining the next bottleneck</h1>
<p>Whenever I fix one problem I like to know where, how and when the next problem or limit is likely to appear. When increasing the number of requests and streams I expected to see increasing latency. But instead I see errors appear like a cliff:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>finished in 27.33s, 36.59 req/s, 5.64MB/s
requests: 5000 total, 1002 started, 1002 done, 998 succeeded, 4002 failed, 4000 errored, 0 timeout
</code></pre></div></div>
<p>Consulting the logs for both the nginx load balancer and the API there are no records of failing requests. Since nginx does not pass the HTTP2 connection directly to the API, but instead “unbundles” them into HTTP1 requests I suspect there might be issues with connection limits or even available ports from nginx to the API. But maybe it’s a configuration issue. By default nginx does <a href="http://nginx.org/en/docs/http/ngx_http_upstream_module.html#server">not limit the number of connections to a backend</a> (our API). . But, there is actually a <a href="https://nginx.org/en/docs/http/ngx_http_v2_module.html#http2_max_requests">default limit to the number of HTTP2 requests that can be served over a single connection</a> - And it happens to be 1000.</p>
<p>I leave it at that. It’s very unlikely we’ll be hitting these limits any time soon.</p>
<p><a id="TheProblem"></a></p>
<h1 id="side-quest-cluster-resources-and-burstable-vms">Side quest: Cluster resources and burstable VMs</h1>
<p>When load testing the first time around sometimes Grafana would also become unresponsive. That’s usually a bad sign. It might indicate that the underlying infrastructure is also reaching saturation. That is not good since it can impact what should be independent services.</p>
<p>Our Kubernetes cluster is composed of 2x t3a.medium on demand nodes and 2x t3a.medium spot nodes. These VM types are burstable. You can use 20% per vCPU sustained over time without problems. If you exceed those 20% you start consuming CPU credits faster than they are granted and once you run out of CPU credits processes will be forcibly throttled.</p>
<p>Of course Kubernetes does not know about this and expects 1 CPU to actually be 1 CPU. In addition Kubernetes will decide where to place workloads based on their stated resource requirements and limits, and not their actual resource usage.</p>
<p>When looking at the actual metrics two of our nodes are indeed out of CPU credits and being throttled. The sum of factors leading to this is:</p>
<ul>
<li>We have not yet set resource requests and limits making it harder for Kubernetes to intelligently place workloads</li>
<li>Using burstable nodes having some additional constraints not visible to Kubernetes</li>
<li>Old deployments laying around consuming unnecessary resources</li>
<li>Adding costly features without assessing the overall impact</li>
</ul>
<p>I have not touched on the last point yet. I started adding <a href="https://pyroscope.io/">Pyroscope</a> to our systems since I simply love monitoring All The Things. The documentation does not go into specifics but emphasizes that it’s “low overhead”. Remember that our budget for CPU usage is actually 40% per node, not 200%. The Pyroscope server itself consumes 10-15% CPU which seems fair. But investigating further the Pyroscope agent also consumes 5-6% CPU per instance. This graph shows the CPU usage of a single Pod before and after turning of Pyroscope profiling.</p>
<p>5-6% CPU overhead on a highly utilized service is probably worth it. But when the baseline CPU usage is 0% CPU and we have multiple services and deployments in different environments we are suddenly using 40-60% CPU on profiling and less than 1% on actual work!</p>
<p>The outcome of this is that we need to separate burstable and stable load deployments. Monitoring and supporting systems are usually more stable resource wise while the actual business systems much more variable, and suitable for burst nodes. In practice we add a node pool of non-burst VMs and use NodeAffinity to stick Prometheus, Pyroscope etc to those nodes. Another benefit of this is that the supporting systems needed to troubleshoot problems are now less likely to be impacted by the problem itself, making troubleshooting much easier.</p>
<p><a id="Conclusion"></a></p>
<h1 id="conclusion">Conclusion</h1>
<p>This whole adventure only took a few hours but resulted in some specific and immediate performance gains. It also highlighted the weakest links in our application, database and infrastructure architecture.</p>Stian ØvrevågeQuite often people ask me what I actually do. I have a hard time giving a short answer. Even to colleagues and friends in the industry. Here I will try to show and tell how I spent an evening digging around in a system I helped build for a client.End of 2020 rough database landscape2020-11-27T00:00:00+00:002020-11-27T00:00:00+00:00https://stian.tech/end-of-2020-rough-database-landscape<p>There seems to exist a database for every niche, mood or emotion. And they seem to change just as fast.</p>
<p>How do you balance the urge for the new and shiny but without risking too much headache down the road?</p>
<p>This post is an attempt to lay out the rough landscape of databases that you might encounter or consider as of late 2020.</p>
<p>There will be broad generalizations for brevity.</p>
<p>The goal is not to be exhaustive or take all possible precautions. Consider it a starting point for further research and planning.</p>
<!--more-->
<p><br /></p>
<hr />
<p><br /></p>
<p>TLDR: Scroll to the <a href="#Landscape">diagrams</a> or view the <a href="../images/2020-11-27-end-of-2020-rough-database-landscape/map-complete.png">big picture</a>.</p>
<p><br /></p>
<p><strong>Table of contents</strong></p>
<ul>
<li><a href="#Background">Background</a>
<ul>
<li><a href="#ProjectPhase">Project phase overview</a></li>
</ul>
</li>
<li><a href="#Planning">Planning</a>
<ul>
<li><a href="#DatabaseCategories">Database categories</a>
<ul>
<li><a href="#SQL">SQL</a></li>
<li><a href="#NoSQL">NoSQL</a></li>
<li><a href="#KeyValue">KeyValue</a></li>
<li><a href="#Timeseries">Timeseries</a></li>
<li><a href="#Graph">Graph</a></li>
<li><a href="#OtherNiceThings">Other nice things</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#Landscape">The Landscape</a>
<ul>
<li><a href="#SQLMap">SQL</a></li>
<li><a href="#NoSQLMap">NoSQL</a></li>
<li><a href="#KeyValueMap">KeyValue</a></li>
<li><a href="#TimeseriesMap">Timeseries</a></li>
<li><a href="#GraphMap">Graph</a></li>
</ul>
</li>
<li><a href="#FurtherReading">Further reading</a></li>
<li><a href="#Conclusion">Conclusion</a></li>
</ul>
<p><a id="Background"></a></p>
<h1 id="background">Background</h1>
<p>I’m a consultant doing development, DevOps and cloud infrastructure. I also have the occasional side project trying out the Tech Flavor of the Month.</p>
<p><a id="ProjectPhase"></a></p>
<h2 id="project-phase-overview">Project phase overview</h2>
<p>The typical phases in projects I’m involved in follow no scientific or trademarked methodology, so YMMV:</p>
<h3 id="starting-out">Starting out</h3>
<p>Get something working as fast as possible. Take all the shortcuts. Use some opinionated framework or platform.</p>
<h3 id="moving-from-development-to-production">Moving from development to production</h3>
<p>People like it, people use it. Move the thing from a single “pet server” to a more robust cloud environment.</p>
<h3 id="scaling-production">Scaling production</h3>
<p>Bottlenecks and scaling problems start to emerge. Refactor or replace some pieces to remove the bottlenecks.</p>
<h3 id="challenges">Challenges</h3>
<p>Moving between these phases might be a major PITA if the wrong shortcuts were taken in the previous phases.</p>
<p><em>This of course applies to all technology choices and not just databases. But we have to start somewhere, right?</em></p>
<p><a id="Planning"></a></p>
<h1 id="planning">Planning</h1>
<p>When starting out I try to envision all the phases of the project and which directions it may take in the future.</p>
<p>First I want the technology or software I choose to be instantly usable. A Docker image. Great. An <code class="language-plaintext highlighter-rouge">apt-get install</code>. Sweet. <code class="language-plaintext highlighter-rouge">npm install</code>. Sure, why not. Downloading a tarball. Installing some C dependencies. Setting some flags. Compiling. Symlinking and fixing permissions. Creating some configuration from scratch. Making my own systemd service definitions. Going back and doing every step again because it failed. <em>Mkay, no thanks, I’m out.</em></p>
<p>At least for me it’s a plus if it’s easy to deploy on Kubernetes since I use it for everything already. I always have a cluster or three laying around so I can get a prototype or five up and running quickly before later spending money for cloud hosting.</p>
<p>Does the thing have momentum and a community? If it does it probably has high quality tooling either by the vendor or the open source community (preferably both). It probably also has lots of common questions answered on blogs and StackOverflow and Github issues.</p>
<p><strong>So we managed to build something and the audience likes it.</strong></p>
<p>How easy is it to move it from a production environment into something stable and low-maintenance? For databases that would typically involve using a managed service for hosting it. You do not want to be responsible for operating your own databases. Is it common enough that there are competitors in the marketplace offering it as a managed service? If there is only a single option expect prices to be very steep. Preferably also a managed service by one of the big known cloud platforms. They are usually cheaper. They are less likely to vanish. It might make integration with other systems easier later.</p>
<p><strong>We hit some problems either because of raw scale or some type of usage we did not anticipate in the beginning.</strong></p>
<p>Are there compatible implementations that might solve some common problems? Typically this is because an implementation has to make a decision about it’s trade-offs. For a database system this is usually around the CAP theorem. A database system (or anything that keeps state) can be:</p>
<ul>
<li><em>Partition Tolerant</em> - The system still works if a node or the network between nodes fail.</li>
<li><em>Available</em> - All requests receive a response.</li>
<li><em>Consistent</em> - The data we read is the current data and not an earlier state.</li>
</ul>
<p>But, you can only have two at the same time. And distributed systems tends to need to be partition tolerant. So we are stuck between consistency and availability.</p>
<p>It might be a good to have an idea of the CAP tradeoffs an implementation has done, and whether there are compatible implementations with different tradeoffs that can be used if later we find out we need to tweak our trade-offs for speed and/or scale.</p>
<blockquote>
<p><em>More information about CAP theorem <a href="https://en.wikipedia.org/wiki/CAP_theorem">here</a> and <a href="https://towardsdatascience.com/cap-theorem-and-distributed-database-management-systems-5c2be977950e">here</a>. Jepsen have also <a href="https://jepsen.io/analyses">extensively tested</a> many popular databases to see how they break and if they are true to their stated trade-offs.</em></p>
</blockquote>
<p><a id="DatabaseCategories"></a></p>
<h2 id="database-categories">Database categories</h2>
<p>Databases can be roughly sorted into categories. I’ll keep it simple and use the everyday lingo and not go into details about semantics and definitions (forgive me).</p>
<p>https://www.prisma.io/dataguide/intro/comparing-database-types</p>
<p><a id="Planning"></a></p>
<h3 id="sql">SQL</h3>
<p>The oldest category is the relational database, also known as SQL based on the typical interface used to access these databases.</p>
<p>In general these databases have tables with names, a set of pre-defined columns and an arbitrary number of rows. You should have an idea of the data types to be stored in each column (such as text or numbers).</p>
<p>The downside of this is that you have to start with a rough model of the data you want to store and work with. The benefit of this is that later you know something about the model of the data you are working with. Most of the time I’ll happily do this in the database rather than handle all the potential inconsistencies in all systems that use that database.</p>
<p><em>Main contenders: PostgreSQL. MySQL & MariaDB.</em></p>
<p><a id="NoSQL"></a></p>
<h3 id="nosql">NoSQL</h3>
<p>All the rage the last decade. You put data in you get data out. The data is structured but not necessarily predefined. Think JSON object with values, arrays and lists.</p>
<p>The benefit is productivity when developing. The drawback is that you may pay a price for those shortcuts later if you’re not careful.</p>
<p><em>Main contender: MongoDB.</em></p>
<p><a id="KeyValue"></a></p>
<h3 id="keyvalue">KeyValue</h3>
<p>Technically a sub-category of NoSQL, and should probably be called caches. But I feel it deserves it’s own category.</p>
<p>A hyper-fast hyper-simple type of database. It has two columns. A key (ID) and value. The value can be anything, a string, a number, an entire JSON object or a blob containing binary data.</p>
<p>These are typically used in combination with another type of database. Either by storing very commonly used data for even quicker access. Or for certain types of simple data that requires insane speed or throughput and you don’t want to overload the main database.</p>
<p><em>Main contender: Redis.</em></p>
<p><a id="Timeseries"></a></p>
<h3 id="timeseries">Timeseries</h3>
<p>A lesser known type of database optimized for storing a time series. A time series is a specific data type where the index is typically the time of a measurement. And the measurement is a number.</p>
<p>A time series is almost never changed after the fact. So these databases can be optimized for writing huge amounts of new data and reading and calculating on existing data. At the cost of performance for deleting or updating old data which is sloooow. Since the values are always numbers that tend to change somewhat predictably compression and deduplication can save us massive amounts of storage.</p>
<p><em>Main contenders: Prometheus, InfluxDB, TimescaleDB (plugin for PostgreSQL).</em></p>
<p><a id="Graph"></a></p>
<h3 id="graph">Graph</h3>
<p>Graph databases are cool. In a graph database the relationship between objects is a primary feature. Whereas in SQL you need to join an element from one table with another object in another table with some kind of common identifier.</p>
<p>For most simple use cases a regular SQL database will do fine. But when the number of objects stored (rows) and the number of intermediary tables (joins) become large it gets slow, or expensive, or both.</p>
<p>I don’t have much experience with graph databases but I suspect they are less suited to general tasks and should be reserved for solving specific problems.</p>
<p><em>Main contenders: Neo4j. Redis + RedisGraph.</em></p>
<blockquote>
<p>PS: Graph databases and GraphQL are completely separate things.</p>
</blockquote>
<p><a id="OtherNiceThings"></a></p>
<h3 id="other-nice-things">Other nice things</h3>
<p>When researching this post I’ve come across things that look promising but are hard to categorize or fall in their own very niche categories.</p>
<ul>
<li><a href="https://dgraph.io">Dgraph</a> - A GraphQL and backend in one.</li>
<li><a href="https://prestodb.io">PrestoDB</a> - An SQL interface on top of whatever database or storage you want to connect.</li>
<li><a href="https://rethinkdb.com">RethinkDB</a> - A NoSQL database focused on real-time streaming/updating clients.</li>
<li><a href="https://www.foundationdb.org">FoundationDB</a> - A transactional key-value store by Apple.</li>
<li><a href="https://clickhouse.tech/">ClickHouse</a> - An SQL database that stores data (on disk) in columns instead of rows. Makes for blazingly fast analytical and aggregation queries.</li>
<li><a href="https://aws.amazon.com/qldb/">Amazon Quantum Ledger Database</a> - A managed distributed ledger database (aka blockchain).</li>
<li><a href="https://www.enterprisedb.com/products/edb-postgres-advanced-server-secure-ha-oracle-compatible">EDB Postgres Advanced Server</a> - An Oracle compatible PostgreSQL variant.</li>
</ul>
<p><a id="Landscape"></a></p>
<h1 id="the-landscape">The Landscape</h1>
<p><em>How to use these maps:</em></p>
<p>Version compatibility are in parenthesis. I have not mapped every version and how much breaking they are compared to previous versions but included some notes where I know there might be issues.</p>
<p><em>API/Protocol/Interface</em> - This is decided by the framework, tool or driver you want to use. Sometimes it might be easier to choose the framework first and then a fitting database protocol. Or you might be lucky to choose the database features you need first and then select frameworks, tools and drivers that support it.</p>
<blockquote>
<p>I think interfaces are really important when creating and choosing technology. I had a <a href="https://speakerdeck.com/stianovrevage/avoiding-lock-in-without-avoiding-managed-services">presentation</a> about it a while ago and I think it’s still relevant.</p>
</blockquote>
<p><em>Engine</em> - Database implementations that are independent but try to be compatible. If there are alternatives to the “original” implementation they might have done different tradeoffs with regards to the CAP theorem or solve other specific problems.</p>
<p><em>Big three managed</em> - Available managed services by the big three clouds, Amazon (AWS), Google (GCP) or Microsoft (Azure). Having an option to host in the big three is most likely the cheapest method as well as having a variety of other managed services to build a complete system in a single cloud.</p>
<p><em>Vendor managed</em> - If the database vendor or backing company offers an Official managed service. They are usually hosted on the big three. Potentially a large cost premium over the raw compute power.</p>
<p><em>Self-hosted</em> - Implementations you can run on your own computer or server.</p>
<div>
<br />
<table style="text-align:center;">
<tr><td colspan="2"><strong>Legend</strong></td></tr>
<tr><td style="vertical-align: middle;" width="50px"><img src="../images/2020-11-27-end-of-2020-rough-database-landscape/icon-checklist.png" /></td><td>The checklist icon marks potential compatibility issues. For most use cases not a problem.<br /><strong>PS:</strong> The absence of this icon does not automatically mean compatibility.</td></tr><tr><td> </td><td></td></tr>
<tr><td style="vertical-align: middle;"><img src="../images/2020-11-27-end-of-2020-rough-database-landscape/icon-operator.png" /></td><td>I put the lightning icon on the self-hosted implementations that have what seems to be stable Kubernetes operators available. In short, a Kubernetes operator makes running a stateful system, such as a database, on Kubernetes much easier. It might allow for longer time before migrating to a managed service.</td></tr></table></div>
<p><a id="SQLMap"></a></p>
<h2 id="sql-1">SQL</h2>
<div style="text-align:center;">
<a href="../images/2020-11-27-end-of-2020-rough-database-landscape/map-sql.png">
<img src="../images/2020-11-27-end-of-2020-rough-database-landscape/map-sql.png" />
</a>
</div>
<p><br /></p>
<blockquote>
<p>Compatibility:</p>
<ul>
<li><a href="https://blog.yugabyte.com/postgresql-compatibility-in-yugabyte-db-2-0/">PostgreSQL - Yugabyte</a></li>
<li><a href="https://www.cockroachlabs.com/docs/stable/postgresql-compatibility.html">PostgreSQL - CockroachDB</a></li>
<li><a href="https://mariadb.com/kb/en/mariadb-vs-mysql-compatibility/">MySQL - MariaDB</a></li>
</ul>
</blockquote>
<blockquote>
<p>Kubernetes Operators:</p>
<ul>
<li><a href="https://github.com/CrunchyData/postgres-operator">PostgreSQL (CrunchyData)</a></li>
<li><a href="https://github.com/zalando/postgres-operator">PostgreSQL (Zalando)</a></li>
<li><a href="https://docs.yugabyte.com/latest/deploy/kubernetes/single-zone/oss/yugabyte-operator/">Yugabyte</a></li>
<li><a href="https://github.com/cockroachdb/cockroach-operator">CockroachDB</a></li>
<li><a href="https://www.percona.com/software/percona-kubernetes-operators">Percona PostgreSQL for MySQL & XtraDB</a></li>
</ul>
</blockquote>
<p><a id="NoSQLMap"></a></p>
<h2 id="nosql-1">NoSQL</h2>
<div style="text-align:center;">
<a href="../images/2020-11-27-end-of-2020-rough-database-landscape/map-nosql.png">
<img src="../images/2020-11-27-end-of-2020-rough-database-landscape/map-nosql.png" />
</a>
</div>
<p><br /></p>
<blockquote>
<p>PS: There are some <a href="https://docs.mongodb.com/manual/release-notes/4.0-compatibility/">breaking changes</a> from MongoDB 3.6 to 4 so make sure the tools you intend to use are compatible with the database version you intend on using.</p>
</blockquote>
<blockquote>
<p>Kubernetes Operators:</p>
<ul>
<li><a href="https://github.com/mongodb/mongodb-kubernetes-operator">MongoDB</a></li>
<li><a href="https://www.percona.com/doc/kubernetes-operator-for-psmongodb/index.html">Percona Distribution for MongoDB</a></li>
<li><a href="https://github.com/scylladb/scylla-operator">ScyllaDB</a></li>
<li><a href="https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-overview.html">Elastic Stack</a></li>
</ul>
</blockquote>
<p><a id="KeyValueMap"></a></p>
<h2 id="keyvalue-1">KeyValue</h2>
<div style="text-align:center;">
<a href="../images/2020-11-27-end-of-2020-rough-database-landscape/map-keyvalue.png">
<img src="../images/2020-11-27-end-of-2020-rough-database-landscape/map-keyvalue.png" />
</a>
</div>
<p><br /></p>
<blockquote>
<p>Kubernetes Operators:</p>
<ul>
<li><a href="https://github.com/spotahome/redis-operator">Redis (Spotahome)</a></li>
</ul>
</blockquote>
<p><a id="TimeseriesMap"></a></p>
<h2 id="timeseries-1">Timeseries</h2>
<div style="text-align:center;">
<a href="../images/2020-11-27-end-of-2020-rough-database-landscape/map-timeseries.png">
<img src="../images/2020-11-27-end-of-2020-rough-database-landscape/map-timeseries.png" />
</a>
</div>
<p><br /></p>
<blockquote>
<p>Kubernetes Operators:</p>
<ul>
<li><a href="https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack">Prometheus-Stack</a></li>
<li><a href="https://github.com/VictoriaMetrics/operator">VictoriaMetrics</a></li>
</ul>
</blockquote>
<p><a id="GraphMap"></a></p>
<h2 id="graph-1">Graph</h2>
<div style="text-align:center;">
<a href="../images/2020-11-27-end-of-2020-rough-database-landscape/map-graph.png">
<img src="../images/2020-11-27-end-of-2020-rough-database-landscape/map-graph.png" />
</a>
</div>
<p><br /></p>
<blockquote>
<p>Kubernetes Operators:</p>
<ul>
<li><a href="https://www.arangodb.com/docs/stable/deployment-kubernetes-usage.html">ArangoDB</a></li>
</ul>
</blockquote>
<p><a id="FurtherReading"></a></p>
<h1 id="further-reading">Further reading</h1>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems">Wikipedia on RDBMS</a></li>
<li><a href="https://db-engines.com/en/">DB-engines.com</a> - Lots of statistics and comparisons between DB engines</li>
<li><a href="https://landscape.cncf.io/">CNCF Landscape</a> - What’s moving in the cloud native landscape, including databases.</li>
</ul>
<p><a id="Conclusion"></a></p>
<h1 id="conclusion">Conclusion</h1>
<p>Congratulations if you made it this far!</p>
<p>I did this research primarily to reduce my own analysis paralysis on various projects so I can get-back-to-building. If you learned something as well, great stuff!</p>
<p>And if you want my advice, just use PostgreSQL unless you really know about some special requirements that necessitates using something else :-)</p>Stian ØvrevågeThere seems to exist a database for every niche, mood or emotion. And they seem to change just as fast. How do you balance the urge for the new and shiny but without risking too much headache down the road? This post is an attempt to lay out the rough landscape of databases that you might encounter or consider as of late 2020. There will be broad generalizations for brevity. The goal is not to be exhaustive or take all possible precautions. Consider it a starting point for further research and planning.Mini-post: Down-scaling Azure Kubernetes Service (AKS)2019-06-04T00:00:00+00:002019-06-04T00:00:00+00:00https://stian.tech/downscaling-aks<p>We discovered today that some implicit assumptions we had about AKS at smaller scales were incorrect.</p>
<p>Suddenly new workloads and jobs in our Radix CI/CD could not start due to insufficient resources (CPU & memory).</p>
<p>Even though it only caused problems in development environments with smaller node sizes it still surprised some of our developers, since we expected the size of development clusters to have enough resources.</p>
<p>I thought it would be a good chance to go a bit deeper and verify some of our assumptions and also learn more about various components that usually “just works” and isn’t really given much thought.</p>
<!--more-->
<p>First I do a <code class="language-plaintext highlighter-rouge">kubectl describe node <node></code> on 2-3 of the nodes to get an idea of how things are looking:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Resource Requests Limits
<span class="nt">--------</span> <span class="nt">--------</span> <span class="nt">------</span>
cpu 930m <span class="o">(</span>98%<span class="o">)</span> 5500m <span class="o">(</span>585%<span class="o">)</span>
memory 1659939584 <span class="o">(</span>89%<span class="o">)</span> 4250M <span class="o">(</span>228%<span class="o">)</span>
</code></pre></div></div>
<p>So we are obviously hitting the roof when it comes to resources. But why?</p>
<h2 id="node-overhead">Node overhead</h2>
<p>We use <code class="language-plaintext highlighter-rouge">Standard DS1 v2</code> instances as AKS nodes and they have 1 CPU core and 3.5 GiB memory.</p>
<p>The output of <code class="language-plaintext highlighter-rouge">kubectl describe node</code> also gives us info on the Capacity (total node size) and Allocatable (resources available to run Pods).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Capacity:
cpu: 1
memory: 3500452Ki
Allocatable:
cpu: 940m
memory: 1814948Ki
</code></pre></div></div>
<p>So we have lost <strong>60 millicores / 6%</strong> of CPU and <strong>1685MiB / 48%</strong> of memory. The next question is if this increases linearly with node size (the percentage of resources lost is the same regardless of node size) or is fixed (always reserves 60 millicores and 1685Mi of memory), or a combination.</p>
<p>I connect to another cluster that has double the node size (<code class="language-plaintext highlighter-rouge">Standard DS2 v2</code>) and compare:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Capacity:
cpu: 2
memory: 7113160Ki
Allocatable:
cpu: 1931m
memory: 4667848Ki
</code></pre></div></div>
<p>So for this the loss is <strong>69 millicores / 3.5%</strong> of CPU and <strong>2445MiB / 35%</strong> of memory.</p>
<p>So CPU reservations are close to fixed regardless of node size while memory reservations are influenced by node size but luckily not linearly.</p>
<p>What causes this “waste”? Reading up on <a href="https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/">kubernetes.io</a> gives a few clues. Kubelet will reserve CPU and memory resources for itself and other Kubernetes processes. It will also reserve a portion of memory to act as a buffer whenever a Pod is going beyond it’s memory limits to avoid risking System OOM, potentially making the whole node unstable.</p>
<p>To figure out what these are configured to we log in to an actual AKS node’s console and run <code class="language-plaintext highlighter-rouge">ps ax|grep kube</code> and the output looks like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/usr/local/bin/kubelet --enable-server --node-labels=node-role.kubernetes.io/agent=,kubernetes.io/role=agent,agentpool=nodepool1,storageprofile=managed,storagetier=Premium_LRS,kubernetes.azure.com/cluster=MC_clusters_weekly-22_northeurope --v=2 --volume-plugin-dir=/etc/kubernetes/volumeplugins --address=0.0.0.0 --allow-privileged=true --anonymous-auth=false --authorization-mode=Webhook --azure-container-registry-config=/etc/kubernetes/azure.json --cgroups-per-qos=true --client-ca-file=/etc/kubernetes/certs/ca.crt --cloud-config=/etc/kubernetes/azure.json --cloud-provider=azure --cluster-dns=10.2.0.10 --cluster-domain=cluster.local --enforce-node-allocatable=pods --event-qps=0 --eviction-hard=memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5% --feature-gates=PodPriority=true,RotateKubeletServerCertificate=true --image-gc-high-threshold=85 --image-gc-low-threshold=80 --image-pull-progress-deadline=30m --keep-terminated-pod-volumes=false --kube-reserved=cpu=60m,memory=896Mi --kubeconfig=/var/lib/kubelet/kubeconfig --max-pods=110 --network-plugin=cni --node-status-update-frequency=10s --non-masquerade-cidr=0.0.0.0/0 --pod-infra-container-image=k8s.gcr.io/pause-amd64:3.1 --pod-manifest-path=/etc/kubernetes/manifests --pod-max-pids=-1 --rotate-certificates=false --streaming-connection-idle-timeout=5m
</code></pre></div></div>
<blockquote>
<p>To log in to the console of a node, go to the MC_resourcegroup_clustername_region resource-group and select the VM. Then go to <code class="language-plaintext highlighter-rouge">Boot diagnostics</code> and enable it. Go to <code class="language-plaintext highlighter-rouge">Reset password</code> to create yourself a user and then <code class="language-plaintext highlighter-rouge">Serial console</code> to log in and execute commands.</p>
</blockquote>
<p>We can see <code class="language-plaintext highlighter-rouge">--kube-reserved=cpu=60m,memory=896Mi</code> and <code class="language-plaintext highlighter-rouge">--eviction-hard=memory.available<750Mi</code> which adds up to <code class="language-plaintext highlighter-rouge">1646Mi</code> which is pretty close to the <code class="language-plaintext highlighter-rouge">1685Mi</code> that was the gap between Capacity and Allocatable.</p>
<p>We also do this on a <code class="language-plaintext highlighter-rouge">Standard DS2 v2</code> node and get <code class="language-plaintext highlighter-rouge">--kube-reserved=cpu=69m,memory=1638Mi</code> and <code class="language-plaintext highlighter-rouge">--eviction-hard=memory.available<750Mi</code>.</p>
<p>So we can see that the memory of <code class="language-plaintext highlighter-rouge">kube-reserved</code> grows almost linearly and seems to always be about 20-25% while CPU reservations are almost the same. The memory eviction buffer is always fixed at <code class="language-plaintext highlighter-rouge">750Mi</code> which would mean bigger resource waste as nodes decrease in size.</p>
<h4 id="cpu">CPU</h4>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: right">Standard DS1 v2</th>
<th style="text-align: right">Standard DS2 v2</th>
</tr>
</thead>
<tbody>
<tr>
<td>VM capacity</td>
<td style="text-align: right">1.000m</td>
<td style="text-align: right">2.000m</td>
</tr>
<tr>
<td>kube-reserved</td>
<td style="text-align: right">-60m</td>
<td style="text-align: right">-69m</td>
</tr>
<tr>
<td>Allocatable</td>
<td style="text-align: right">940m</td>
<td style="text-align: right">1.931m</td>
</tr>
<tr>
<td>Allocatable %</td>
<td style="text-align: right">94%</td>
<td style="text-align: right">96.5%</td>
</tr>
</tbody>
</table>
<h4 id="memory">Memory</h4>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: right">Standard DS1 v2</th>
<th style="text-align: right">Standard DS2 v2</th>
</tr>
</thead>
<tbody>
<tr>
<td>VM capacity</td>
<td style="text-align: right">3.500Mi</td>
<td style="text-align: right">7.113Mi</td>
</tr>
<tr>
<td>kube-reserved</td>
<td style="text-align: right">-896Mi</td>
<td style="text-align: right">-1.638Mi</td>
</tr>
<tr>
<td>Eviction buf</td>
<td style="text-align: right">-750Mi</td>
<td style="text-align: right">-750Mi</td>
</tr>
<tr>
<td>Allocatable</td>
<td style="text-align: right">1.814Mi</td>
<td style="text-align: right">4.667Mi</td>
</tr>
<tr>
<td>Allocatable %</td>
<td style="text-align: right">52%</td>
<td style="text-align: right">65%</td>
</tr>
</tbody>
</table>
<h2 id="node-pods-daemonsets">Node pods (DaemonSets)</h2>
<p>We have some Pods that run on every node, and they are installed by default by AKS. We get the resource limits of these by describing either the pods or the daemonsets.</p>
<h4 id="cpu-1">CPU</h4>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: right">Standard DS1 v2</th>
<th style="text-align: right">Standard DS2 v2</th>
</tr>
</thead>
<tbody>
<tr>
<td>Allocatable</td>
<td style="text-align: right">940m</td>
<td style="text-align: right">1.931m</td>
</tr>
<tr>
<td>kube-system/calico-node</td>
<td style="text-align: right">-250m</td>
<td style="text-align: right">-250m</td>
</tr>
<tr>
<td>kube-system/kube-proxy</td>
<td style="text-align: right">-100m</td>
<td style="text-align: right">-100m</td>
</tr>
<tr>
<td>kube-system/kube-svc-redirect</td>
<td style="text-align: right">-5m</td>
<td style="text-align: right">-5m</td>
</tr>
<tr>
<td>Available</td>
<td style="text-align: right">585m</td>
<td style="text-align: right">1.576m</td>
</tr>
<tr>
<td>Available %</td>
<td style="text-align: right">58%</td>
<td style="text-align: right">81%</td>
</tr>
</tbody>
</table>
<h4 id="memory-1">Memory</h4>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: right">Standard DS1 v2</th>
<th style="text-align: right">Standard DS2 v2</th>
</tr>
</thead>
<tbody>
<tr>
<td>Allocatable</td>
<td style="text-align: right">1.814Mi</td>
<td style="text-align: right">4.667Mi</td>
</tr>
<tr>
<td>kube-system/kube-svc-redirect</td>
<td style="text-align: right">-32Mi</td>
<td style="text-align: right">-32Mi</td>
</tr>
<tr>
<td>Available</td>
<td style="text-align: right">1.782Mi</td>
<td style="text-align: right">4.635Mi</td>
</tr>
<tr>
<td>Available %</td>
<td style="text-align: right">50%</td>
<td style="text-align: right">61%</td>
</tr>
</tbody>
</table>
<p>So for <code class="language-plaintext highlighter-rouge">Standard DS1 v2</code> nodes we have about 0.5 CPU and 1.7GiB memory per node for pods. And for <code class="language-plaintext highlighter-rouge">Standard DS2 v2</code> nodes it’s about 1.5 CPU and 4.6GiB memory.</p>
<h2 id="kube-system-pods">kube-system pods</h2>
<p>Now lets add some standard Kubernetes pods we need to run. As far as I know these are pretty much fixed for a cluster and not related to node size or count.</p>
<table>
<thead>
<tr>
<th>Deployment</th>
<th style="text-align: right">CPU</th>
<th style="text-align: right">Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>kube-system/kubernetes-dashboard</td>
<td style="text-align: right">100m</td>
<td style="text-align: right">50Mi</td>
</tr>
<tr>
<td>kube-system/tunnelfront</td>
<td style="text-align: right">10m</td>
<td style="text-align: right">64Mi</td>
</tr>
<tr>
<td>kube-system/coredns (x2)</td>
<td style="text-align: right">200m</td>
<td style="text-align: right">140Mi</td>
</tr>
<tr>
<td>kube-system/coredns-autoscaler</td>
<td style="text-align: right">20m</td>
<td style="text-align: right">10Mi</td>
</tr>
<tr>
<td>kube-system/heapster</td>
<td style="text-align: right">130m</td>
<td style="text-align: right">230Mi</td>
</tr>
<tr>
<td>Sum</td>
<td style="text-align: right">460m</td>
<td style="text-align: right">494Mi</td>
</tr>
</tbody>
</table>
<h2 id="third-party-pods">Third party pods</h2>
<table>
<thead>
<tr>
<th>Deployment</th>
<th style="text-align: right">CPU</th>
<th style="text-align: right">Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>grafana</td>
<td style="text-align: right">200m</td>
<td style="text-align: right">500Mi</td>
</tr>
<tr>
<td>prometheus-operator</td>
<td style="text-align: right">500m</td>
<td style="text-align: right">1.000Mi</td>
</tr>
<tr>
<td>prometheus-alertmanager</td>
<td style="text-align: right">100m</td>
<td style="text-align: right">225Mi</td>
</tr>
<tr>
<td>flux</td>
<td style="text-align: right">50m</td>
<td style="text-align: right">64Mi</td>
</tr>
<tr>
<td>flux-helm-operator</td>
<td style="text-align: right">50m</td>
<td style="text-align: right">64Mi</td>
</tr>
<tr>
<td>Sum</td>
<td style="text-align: right">900m</td>
<td style="text-align: right">1.853Mi</td>
</tr>
</tbody>
</table>
<h2 id="radix-platform-pods">Radix platform pods</h2>
<table>
<thead>
<tr>
<th>Deployment</th>
<th style="text-align: right">CPU</th>
<th style="text-align: right">Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>radix-api-prod/server (x2)</td>
<td style="text-align: right">200m</td>
<td style="text-align: right">400Mi</td>
</tr>
<tr>
<td>radix-api-qa/server (x2)</td>
<td style="text-align: right">100m</td>
<td style="text-align: right">200Mi</td>
</tr>
<tr>
<td>radix-canary-golang-dev/www</td>
<td style="text-align: right">40m</td>
<td style="text-align: right">500Mi</td>
</tr>
<tr>
<td>radix-canary-golang-prod/www</td>
<td style="text-align: right">40m</td>
<td style="text-align: right">500Mi</td>
</tr>
<tr>
<td>radix-platform-prod/public-site</td>
<td style="text-align: right">5m</td>
<td style="text-align: right">10Mi</td>
</tr>
<tr>
<td>radix-web-console-prod/web</td>
<td style="text-align: right">10m</td>
<td style="text-align: right">42Mi</td>
</tr>
<tr>
<td>radix-web-console-qa/web</td>
<td style="text-align: right">5m</td>
<td style="text-align: right">21Mi</td>
</tr>
<tr>
<td>radix-github-webhook-prod/webhook</td>
<td style="text-align: right">10m</td>
<td style="text-align: right">30Mi</td>
</tr>
<tr>
<td>radix-github-webhook-prod/webhook</td>
<td style="text-align: right">5m</td>
<td style="text-align: right">15Mi</td>
</tr>
<tr>
<td>Sum</td>
<td style="text-align: right">415m</td>
<td style="text-align: right">1.718Mi</td>
</tr>
</tbody>
</table>
<p>If we add up the resource usage of these groups of workloads and see the total available resources on our 4 node Standard DS1 v2 clusters we are left with 0.56 CPU cores (14%) and 3GB of memory (22%):</p>
<table>
<thead>
<tr>
<th>Workload</th>
<th style="text-align: right">CPU</th>
<th style="text-align: right">Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>kube-system</td>
<td style="text-align: right">460m</td>
<td style="text-align: right">494Mi</td>
</tr>
<tr>
<td>third-party</td>
<td style="text-align: right">900m</td>
<td style="text-align: right">1.853Mi</td>
</tr>
<tr>
<td>radix-platform</td>
<td style="text-align: right">415m</td>
<td style="text-align: right">1.718Mi</td>
</tr>
<tr>
<td>Sum</td>
<td style="text-align: right">1.760m</td>
<td style="text-align: right">4.020Mi</td>
</tr>
<tr>
<td>Available on 4x DS1</td>
<td style="text-align: right">2.340m</td>
<td style="text-align: right">7.128Mi</td>
</tr>
<tr>
<td>Available for workloads</td>
<td style="text-align: right">565m</td>
<td style="text-align: right">3.063Mi</td>
</tr>
</tbody>
</table>
<p>Though surprising that we lost this much resources before being able to deploy our actual customer applications, it should still be a bit of headroom.</p>
<p>Going further I checked the resource requests on 8 customer pods deployed in 4 environments (namespaces). Even though none of them had a resource configuration in their <code class="language-plaintext highlighter-rouge">radixconfig.yaml</code> files they still had resource requests and limits. Not surprising since we use LimitRange to set default resource requests and limits. The surprise was that half of them had 50Mi of memory and the other half 500Mi, seemingly at random.</p>
<p>It turns out that we did an update to the LimitRange values a few days ago but that only applies to new Pods, so depending on if the Pods got re-created for any reason they may or may not have the old request of 500Mi, which in our case of small clusters will quickly drain the available resources.</p>
<blockquote>
<p>Read more about LimitRange here: <a href="https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/memory-default-namespace/">kubernetes.io</a> , and here is the commit that eventually trickled down to reduce memory usage: <a href="https://github.com/equinor/radix-operator/commit/f022fcde993efdf6cbcafb2c6632707a823a2a27">github.com</a></p>
</blockquote>
<h2 id="pod-scheduling">Pod scheduling</h2>
<p>Depending on the weight between CPU and memory requests and how often things get destroyed and re-created you may find yourself in a situation where you have enough resources in your cluster but new workloads are still Pending. This can happen when one resource type (e.g. CPU) is filled before another (e.g. memory), leading one or more resources to be stranded and unlikely to be utilized.</p>
<p>Imagine for example a cluster that is already utilized like this:</p>
<table>
<thead>
<tr>
<th> </th>
<th>CPU</th>
<th>Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>node0</td>
<td>94%</td>
<td>86%</td>
</tr>
<tr>
<td>node1</td>
<td>80%</td>
<td>89%</td>
</tr>
<tr>
<td>node2</td>
<td>98%</td>
<td>60%</td>
</tr>
</tbody>
</table>
<p>Scheduling a workload that requests 15% CPU and 20% memory cannot be scheduled since there are no nodes fulfilling both requirements. In theory there is probably a CPU intensive Pod on node2 that could be moved to node1 but Kubernetes does not do re-scheduling to optimize utilization. It can do re-scheduling based on Pod priority (<a href="https://medium.com/@dominik.tornow/the-kubernetes-scheduler-cd429abac02f">medium.com</a>) and there is an incubator project (<a href="https://akomljen.com/meet-a-kubernetes-descheduler/">akomljen.com</a>) that can try to drain nodes with low utilization.</p>
<p>So for the foreseable future keeping in mind that resources can get stranded and that looking at the sum of cluster resources and sum of cluster resource demand might be misleading.</p>
<h2 id="calico-node">calico-node</h2>
<p>The biggest source of waste on our small clusters is <code class="language-plaintext highlighter-rouge">calico-node</code> which is installed on every node and requests 25% of a CPU core while only using 2.5-3% CPU:</p>
<p><img src="/images/2019-06-04-downscaling-aks/calico-node-cpu.png" alt="calico-node cpu usage" title="calico-node cpu usage" /></p>
<p>The request is originally set here <a href="https://github.com/Azure/aks-engine/blob/master/parts/k8s/containeraddons/kubernetesmasteraddons-calico-daemonset.yaml">github.com</a> but I have not got into why that number was choosen. Next steps would be to do some benchmarking of <code class="language-plaintext highlighter-rouge">calico-node</code> to smoke out it’s performance characteristics to see if it would be safe to lower the resource requests, but that is out of scope for now.</p>
<h1 id="conclusion">Conclusion</h1>
<ul>
<li>By increasing node size from <code class="language-plaintext highlighter-rouge">Standard DS1 v2</code> to <code class="language-plaintext highlighter-rouge">Standard DS2 v2</code> we also increase the available CPU from 58% per node to 81% per node. Available memory increases from 50% to 61% per node.</li>
<li>With a total platform requirement of 3-4GB of memory and 4.6GB available on <code class="language-plaintext highlighter-rouge">Standard DS2 v2</code> we might have more resources for actual workloads on a 1-node <code class="language-plaintext highlighter-rouge">Standard DS2 v2</code> cluster than a 3-node <code class="language-plaintext highlighter-rouge">Standard DS1 v2</code> cluster!</li>
<li>Beware of stranded resources limiting the utilization you can achieve across a cluster.</li>
</ul>Stian ØvrevågeWe discovered today that some implicit assumptions we had about AKS at smaller scales were incorrect. Suddenly new workloads and jobs in our Radix CI/CD could not start due to insufficient resources (CPU & memory). Even though it only caused problems in development environments with smaller node sizes it still surprised some of our developers, since we expected the size of development clusters to have enough resources. I thought it would be a good chance to go a bit deeper and verify some of our assumptions and also learn more about various components that usually “just works” and isn’t really given much thought.Disk performance on Azure Kubernetes Service (AKS) - Part 1: Benchmarking2019-02-23T00:00:00+00:002019-02-23T00:00:00+00:00https://stian.tech/disk-performance-on-aks-part-1<p>Understanding the characteristics of disk performance of a platform might be more important than you think. If disk resources are not correctly matched to your workload, your performance will suffer and might lead you to incorrectly diagnose a problem as being related to CPU or memory.</p>
<p>The defaults might also not give you the performance you expect.</p>
<p>In this first post on troubleshooting some disk performance issues on Azure Kubernetes Service (AKS) we will benchmark Azure Premium SSD to find how workloads affect performance and which metrics to monitor to know when troubleshooting potential disk issues.</p>
<!--more-->
<p>TLDR:</p>
<ul>
<li>Disable Azure cache for workloads with high number of random writes</li>
<li>Use a P15 (256GB) or larger Premium SSD even though you might only need a fraction of it.</li>
</ul>
<p>Table of contents</p>
<ul>
<li><a href="#Background">Background</a>
<ul>
<li><a href="#MetricsMethodologies">Metric Methodologies</a></li>
<li><a href="#StorageBackground">Storage Background</a></li>
</ul>
</li>
<li><a href="#WhatToMeasure">What to measure?</a></li>
<li><a href="#HowToMeasureDisk">How to measure disk</a></li>
<li><a href="HowToMeasureDiskOnAKS">How to measure disk on Azure Kubernetes Service</a></li>
<li><a href="#Tests">Test results</a>
<ul>
<li><a href="#Test1">Test 1 - Learning to dislike Azure Cache</a></li>
<li><a href="#Test2">Test 2 - Disable Azure Cache - enable OS cache</a></li>
<li><a href="#Test3">Test 3 - Disable OS cache</a></li>
<li><a href="#Test4">Test 4 - Increase IO depth</a></li>
<li><a href="#Test5">Test 5 - Larger block size, smaller IO depth</a></li>
<li><a href="#Test6">Test 6 - Enable OS cache</a></li>
<li><a href="#Test7">Test 7 - Random writes, small block size</a></li>
<li><a href="#Test8">Test 8 - Large block size</a></li>
</ul>
</li>
<li><a href="#Conclusion">Conclusion</a></li>
</ul>
<h4 id="microsoft-azure">Microsoft Azure</h4>
<blockquote>
<p><a href="https://azure.microsoft.com/en-us/free/">If you don’t have a Azure subscription already you can try services for $200 for 30 days.</a> The VM size <strong>Standard_B2s</strong> is Burstable, has 2vCPU, 4GB RAM, 8GB temp storage and costs roughly $38 / month. For $200 you can have a cluster of 3-4 B2s nodes plus traffic, loadbalancers and other additional costs.</p>
</blockquote>
<blockquote>
<p>See my blog post <a href="2017-12-23-managed-kubernetes-on-azure.md">Managed Kubernetes on Microsoft Azure (English)</a> for information on how to get up and running with Kubernetes on Azure.</p>
</blockquote>
<blockquote>
<p><em>I have no affiliation with Microsoft Azure except using them through work.</em></p>
</blockquote>
<h2 id="corrections">Corrections</h2>
<p><strong>February 2020</strong>: Some of my previous knowledge and assumptions were not correct when applied to a cloud + Docker environment, as <a href="https://github.com/jnoller/kubernaughty/issues/46">explained by
AKS PM Jesse Noller on GitHub</a>.</p>
<p>One of the issues is that even accessing a “data disk” will incur IOPS on the OS disk, and throttling of the OS disk will also constraint IOPS on the data disks.</p>
<p><a id="Background"></a></p>
<h2 id="background">Background</h2>
<p>I’m part of a team at Equinor building an internal PaaS based on Kubernetes running on AKS (Azure managed Kubernetes). We use Prometheus for monitoring each cluster as well as InfluxDB for collecting metrics from k6io which runs continous tests on our public endpoints.</p>
<p>A couple of weeks ago we discovered some potential problems with both Prometheus and InfluxDB with memory usage and restarts. High CPU usage of type <code class="language-plaintext highlighter-rouge">iowait</code> suggested that there might be some disk issues contributing to the problems.</p>
<blockquote>
<p>iowait: “Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.” (<a href="https://support.hpe.com/hpsc/doc/public/display?docId=c02783994">hpe.com</a>). You can see <code class="language-plaintext highlighter-rouge">iowait</code> on your Linux system by running <code class="language-plaintext highlighter-rouge">top</code> and looking at the <code class="language-plaintext highlighter-rouge">wa</code> percentage.</p>
<p>PS: You can have a disk IO bottleneck even with low <code class="language-plaintext highlighter-rouge">iowait</code>, and a high <code class="language-plaintext highlighter-rouge">iowait</code> does not always indicate a disk IO bottleneck (<a href="https://www.ibm.com/developerworks/community/blogs/AIXDownUnder/entry/iowait_a_misleading_indicator_of_i_o_performance54?lang=en">ibm.com</a>).</p>
</blockquote>
<p>First off we need to benchmark the underlying disk to get an understanding of it’s performance limits and characteristics. That is what we will cover in this post.</p>
<p><a id="MetricsMethodologies"></a></p>
<h3 id="metric-methodologies">Metric Methodologies</h3>
<p>There are two helpful methodologies when monitoring information systems. The first one is Utilization, Saturation and Errors (USE) from <a href="http://www.brendangregg.com/usemethod.html">Brendan Gregg</a> and the second one is Rate, Errors, Duration (RED) from <a href="https://www.slideshare.net/weaveworks/monitoring-microservices">Tom Wilkie</a>. RED is best suited when observing workloads and transactions while USE is best suited for observing resources.</p>
<p>I’ll be using the USE method here. USE can be summarised as:</p>
<ul>
<li><strong>For every resource, check utilization, saturation, and errors.</strong>
<ul>
<li><strong>resource</strong>: all physical server functional components (CPUs, disks, busses, …)</li>
<li><strong>utilization</strong>: the average time that the resource was busy servicing work</li>
<li><strong>saturation</strong>: the degree to which the resource has extra work which it can’t service, often queued</li>
<li><strong>errors</strong>: the count of error events</li>
</ul>
</li>
</ul>
<p><a id="StorageBackground"></a></p>
<h3 id="storage-background">Storage Background</h3>
<p>Disk usage has two dimensions, throughput/bandwidth(BW) and operations per second (IOPS), and the underlying storage system will have upper limits of how much data it can receive (BW) and the number of operations it can perform per second (IOPS).</p>
<blockquote>
<p><strong>Background - harddrive types</strong>: harddrives come in two types, Solid State Disks (SSD) and spindle (HDD). A SSD disk is a microship capable of permanently storing data while a HDD uses spinning platters to store data. HDDs have a fixed rate of rotation (RPM), typically 5.400 and 7.200 RPM for lower cost drives for home use and higher cost 10.000 and 15.000 RPM drives for server use. Over the last 20 years of HDDs their storage density has increased, but the RPM has largely stayed the same. A disk with twice the density (500GB to 1TB for example) can read twice as much data on a single rotation and thus increase the bandwidth significantly. However, reading or writing a random block still requires waiting for the disk to spin enough to reach the relevant sector on the disk. So IOPS has not increased much for HDDs and is still a low 125-150 IOPS for a 10.000 RPM enterprise disk. A SSD does not have any moving parts so is able to reach MUCH higher IOPS. A low end Samsung 960 EVO with 500GB capacity costs $150 and can achieve a whopping 330.000 IOPS! (<a href="https://en.wikipedia.org/wiki/IOPS">wikipedia.com</a>)</p>
</blockquote>
<blockquote>
<p><strong>Background - access patterns</strong>: The way a program uses storage also has a huge impact on the performance one can achieve. Sequential access is when we read or write a large file. When this happens the operating system and harddrive can optimize and “merge” operations so that we can read or write a much bigger chunk of data at a time. If we can read 1MB at a time 150 times per second we get 150MB/s of bandwidth. However, fully random access where the smallest chunk we read or write is a 4KB block the same 150 IOPS would only give a bandwidth of 0.6MB/s!</p>
</blockquote>
<blockquote>
<p><strong>Background - cloud vs physical</strong>: Now we know what HDDs are limited to a low IOPS and low IOPS combined with a random access pattern gives us a low overall bandwidth. There is a huge gotcha here when it comes to cloud. On Azure when using Premium Managed SSD the IOPS you are given is a factor of the disk size you provision (<a href="https://azure.microsoft.com/en-us/pricing/details/managed-disks/">microsoft.com</a>). A 512GB disk is limited to 2.300 IOPS and 150MB/s. With 100% random access that only gives about 9MB/s of bandwidth!</p>
</blockquote>
<blockquote>
<p><strong>Background - OS caching</strong>: To overcome some of the limitations of the underlying disk (mostly IOPS) there are potentially several layers of caching involved. Linux file systems can have <code class="language-plaintext highlighter-rouge">writeback</code> enabled which causes Linux to temporarily store data that is going to be written to disk in memory. This can give a big performance increase when there are sudden spikes of writes exceeding the performance of the underlying disk. It also increases the chance that operations can be <code class="language-plaintext highlighter-rouge">merged</code> where several write operations to areas of the disk that are nearby can be executed as one. This caching works best for sudden peaks and will not necessarily be enough if there is continous random writes to disk. This caching also means that even though an application thinks it has saved some data to disk it can be lost in the case of a power outage or other failure. Applications can also explicitly request <code class="language-plaintext highlighter-rouge">direct</code> access where every operation is persisted to disk before receiving a confirmation. This is a trade-off between performance and durability that needs to be decided based on the application itself and the environment.</p>
</blockquote>
<blockquote>
<p><strong>Background - Azure caching</strong>: Azure also provides read and write cache for its <code class="language-plaintext highlighter-rouge">disks</code> which is enabled by default. As we will see soon for our use case it’s not a good idea to use.</p>
</blockquote>
<p><a id="WhatToMeasure"></a></p>
<h2 id="what-to-measure">What to measure?</h2>
<blockquote>
<p>These metrics are collected by the Prometheus <code class="language-plaintext highlighter-rouge">node-exporter</code> and follows it’s naming. I’ve also created a dashboard that is available on <a href="https://grafana.com/dashboards/9852">Grafana.com</a>.</p>
</blockquote>
<p>With the USE methodology as a guideline and the two separate but related “resources”, bandwidth and IOPS we can look for some useful metrics.</p>
<p>Utilization:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">rate(node_disk_written_bytes_total)</code> - Write bandwidth. The maximum is given by Azure and is 25MB/s for our disk size.</li>
<li><code class="language-plaintext highlighter-rouge">rate(node_disk_writes_completed_total)</code> - Write operations. The maximum is given by Azure and is 120 IOPS for our disk size.</li>
<li><code class="language-plaintext highlighter-rouge">rate(node_disk_io_time_seconds_total)</code> - Disk active time in percent. The time the disk was busy servicing requests. 100% means fully utilized.</li>
</ul>
<p>Saturation:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">rate(node_cpu_seconds_total{mode="iowait"}</code> - CPU iowait. The percentage of time a CPU core is blocked from doing useful work because it’s waiting for an IO operation to complete (typically disk, but can also be network).</li>
</ul>
<p>Useful calculated metrics:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">rate(node_disk_write_time_seconds_total) / rate(node_disk_writes_completed_total)</code> - Write latency. How long from a write is requested until it’s completed.</li>
<li><code class="language-plaintext highlighter-rouge">rate(node_disk_written_bytes_total) / rate(node_disk_writes_completed_total)</code> - Write size. How big the <strong>average</strong> write operation is. 4KB is minimum and indicates 100% random access while 512KB is maximum and indicates sequential access.</li>
</ul>
<p><a id="HowToMeasureDisk"></a></p>
<h2 id="how-to-measure-disk">How to measure disk</h2>
<p>The best tool for measuring disk performance is <code class="language-plaintext highlighter-rouge">fio</code>, even though it might seem a bit intimidating at first due to it’s insane number of options.</p>
<p>Installing <code class="language-plaintext highlighter-rouge">fio</code> on Ubuntu:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt-get install fio
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">fio</code> executes <code class="language-plaintext highlighter-rouge">jobs</code> described in a file. Here is the top of our jobs file:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[global]
ioengine=libaio # sync|libaio|mmap
group_reporting
thread
size=10g # Size of test file
cpus_allowed=1 # Only use this CPU core
runtime=300s # Run test for 5 minutes
[test1]
filename=/tmp/fio-test-file
direct=1 # If value is true, use non-buffered I/O. Non-buffered I/O usually means O_DIRECT
readwrite=write # read|write|randread|randwrite|readwrite|randrw
iodepth=1 # How many operations to queue to the disk
blocksize=4k
</code></pre></div></div>
<p>The fields we will be changing for the various tests are <code class="language-plaintext highlighter-rouge">direct</code>, <code class="language-plaintext highlighter-rouge">readwrite</code>, <code class="language-plaintext highlighter-rouge">iodepth</code> and <code class="language-plaintext highlighter-rouge">blocksize</code>. Save the contents in a file named <code class="language-plaintext highlighter-rouge">jobs.fio</code> and we run a test with <code class="language-plaintext highlighter-rouge">fio --sector test1 jobs.fio</code> and wait until the test completes.</p>
<blockquote>
<p>PS: To run these tests on higher performance hardware and better caching you might want to set <code class="language-plaintext highlighter-rouge">runtime</code> to <code class="language-plaintext highlighter-rouge">0</code> to have the test run continously and monitor the metrics until performance reaches a steady-state.</p>
</blockquote>
<p><a id="HowToMeasureDiskOnAKS"></a></p>
<h2 id="how-to-measure-disk-on-azure-kubernetes-service">How to measure disk on Azure Kubernetes Service</h2>
<p>For this testing we use a standard Prometheus installation collecting data from <code class="language-plaintext highlighter-rouge">node-exporter</code> and visualizing data in Grafana. The dashboard I created for the testing can be found here: <a href="https://grafana.com/dashboards/9852">https://grafana.com/dashboards/9852</a>.</p>
<p>By default Kubernetes will schedule a Pod to any node that has enough memory and CPU for our workload. Since one of the tests we are going to run are on the OS disk we do not want the Pod to run on the same node as any other disk-intensive application, such as Prometheus.</p>
<p>Look at which Pods are running with <code class="language-plaintext highlighter-rouge">kubectl get pods -o wide</code> and look for a node that does not have any disk-intensive application.</p>
<p>Then we tag that node with <code class="language-plaintext highlighter-rouge">kubectl label nodes aks-nodepool1-37707184-2 tag=disktest</code>. This allows us later to specify that we want to run our testing Pod on that specific node.</p>
<hr />
<p>A StorageClass in Kubernetes is a specification of a underlying disk that Pods can request usage of through <code class="language-plaintext highlighter-rouge">volumeClaimTemplates</code>. AKS comes with a default StorageClass <code class="language-plaintext highlighter-rouge">managed-premium</code> that has caching enabled. Most of these tests require the Azure cache disabled so create a new StorageClass <code class="language-plaintext highlighter-rouge">managed-premium-retain-nocache</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: managed-premium-retain-nocache
provisioner: kubernetes.io/azure-disk
reclaimPolicy: Retain
parameters:
storageaccounttype: Premium_LRS
kind: Managed
cachingmode: None
</code></pre></div></div>
<p>You can add it to your cluster with:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl apply -f https://raw.githubusercontent.com/StianOvrevage/stian.tech/master/images/2019-02-23-disk-performance-on-aks-part-1/storageclass.yaml
</code></pre></div></div>
<hr />
<p>Next we create a StatefulSet that uses a <code class="language-plaintext highlighter-rouge">volumeClaimTemplate</code> to request a 250GB Azure disk. This provisions a P15 Azure Premium SSD with 125MB/s bandwidth and 1100 IOPS:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl apply -f https://raw.githubusercontent.com/StianOvrevage/stian.tech/master/images/2019-02-23-disk-performance-on-aks-part-1/ubuntu-statefulset.yaml
</code></pre></div></div>
<p>Follow the progress of the Pod creation with <code class="language-plaintext highlighter-rouge">kubectl get pods -w</code> and wait until it is <code class="language-plaintext highlighter-rouge">Running</code>.</p>
<hr />
<p>When the Pod is <code class="language-plaintext highlighter-rouge">Running</code> we can start a shell on it with <code class="language-plaintext highlighter-rouge">kubectl exec -it disk-test-0 bash</code></p>
<p>Once inside <code class="language-plaintext highlighter-rouge">bash</code> on the Pod, we install <code class="language-plaintext highlighter-rouge">fio</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt-get update && apt-get install -y fio wget
</code></pre></div></div>
<p>And save the contents of in the Pod:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget https://raw.githubusercontent.com/StianOvrevage/stian.tech/master/images/2019-02-23-disk-performance-on-aks-part-1/jobs.fio
</code></pre></div></div>
<p>Now we can run the different test sections one by one. <strong>PS: If you don’t specify a section <code class="language-plaintext highlighter-rouge">fio</code> will run all the tests <em>simultaneously</em>, which is not what we want.</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fio --section=test1 jobs.fio
fio --section=test2 jobs.fio
fio --section=test3 jobs.fio
fio --section=test4 jobs.fio
fio --section=test5 jobs.fio
fio --section=test6 jobs.fio
fio --section=test7 jobs.fio
fio --section=test8 jobs.fio
fio --section=test9 jobs.fio
</code></pre></div></div>
<p><a id="Tests"></a></p>
<h2 id="test-results">Test results</h2>
<p><a id="Test1"></a></p>
<h3 id="test-1---learning-to-dislike-azure-cache">Test 1 - Learning to dislike Azure Cache</h3>
<p><em>Sequential write, 4K block size, Azure Cache enabled, OS cache disabled. See <a href="/images/2019-02-23-disk-performance-on-aks-part-1/test1.md">full fio test results</a>.</em></p>
<p>I run the first tests on the OS disk of a Kubernetes node. The OS disks have Azure caching enabled.</p>
<p><img src="/images/2019-02-23-disk-performance-on-aks-part-1/test1.png" alt="graph" title="graph" /></p>
<p>The first 1-2 minutes of the test I get very good performance of 45MB/s and ~11.500 IOPS but that drops to 0 very quickly as the cache is full and busy writing things to the underlying disk. When that happens everything freezes and I cannot even execute shell commands. After stopping the test the system still hangs for a bit while the cache empties.</p>
<p>The maximum latency measured by <code class="language-plaintext highlighter-rouge">fio</code> was 108751k usec. Or about 108 seconds!</p>
<blockquote>
<p>For the first try of these tests a 20-30 second period of very fast writes (250MB/s) caused a 7-8 minutes hang while the cache emptied. Trying again caused another pattern of lower peak performance with shorter hangs in between. Very unpredictable.
I’m not sure what to make of this. It’s not acceptable that a Kubernetes node becomes unresponsive for many minutes following a short burst of writing. There are scattered recommendations online of disabling caching for write-heavy applications. Since I have not found any way to measure the Azure cache itself, the results are unpredictable and potentially very impactful as well as making it very hard to use the metrics we do have to evaluate application and storage behaviour I’ve concluded that it’s best to use data disks with caching disabled for our workloads (you cannot disable caching on an AKS node OS disk).</p>
</blockquote>
<p><a id="Test2"></a></p>
<h3 id="test-2---disable-azure-cache---enable-os-cache">Test 2 - Disable Azure Cache - enable OS cache</h3>
<p><em>Sequential write, 4K block size. <strong>Change: Azure cache disabled, OS caching enabled.</strong> See <a href="/images/2019-02-23-disk-performance-on-aks-part-1/test1.md">full fio test results</a>.</em></p>
<p><img src="/images/2019-02-23-disk-performance-on-aks-part-1/test2.png" alt="graph" title="graph" /></p>
<p>If we swap the Azure cache for the Linux OS cache we see that <code class="language-plaintext highlighter-rouge">iowait</code> increases while the writing occurs. The application sees high write performance until the number of <code class="language-plaintext highlighter-rouge">Dirty bytes</code> reaches a threshold of about 3.7GB of memory. The performance of the underlying disk is 125MB/s and 250 IOPS. Here we are throttled by the 125MB/s limit of the Azure P15 Premium SSD.</p>
<p>Also notice that on sequential writes of 4K with OS caching the actual blocks written to disk is 512K which saves us a lot of IOPS. This will become important later.</p>
<p><a id="Test3"></a></p>
<h3 id="test-3---disable-os-cache">Test 3 - Disable OS cache</h3>
<p><em>Sequential write, 4K block size. <strong>Change: OS caching disabled.</strong> See <a href="/images/2019-02-23-disk-performance-on-aks-part-1/test1.md">full fio test results</a>.</em></p>
<p><img src="/images/2019-02-23-disk-performance-on-aks-part-1/test3.png" alt="graph" title="graph" /></p>
<blockquote>
<p>By disabling the OS cache (<code class="language-plaintext highlighter-rouge">direct=1</code>) the results are consistent and predictable. There is no <code class="language-plaintext highlighter-rouge">iowait</code> since the application does not have multiple writes pending at the same time. Because of the 2-3ms latency of the disks we are not able to get more than about 400 IOPS. This gives us a meager 1.5MB/s even though the disk is limited to 1100 IOPS and 125MB/s. To reach that we need multiple simultaneous writes or a bigger IO depth (queue). <code class="language-plaintext highlighter-rouge">Disk active time</code> is also 0% which indicates that the disk is not saturated.</p>
</blockquote>
<p><a id="Test4"></a></p>
<h3 id="test-4---increase-io-depth">Test 4 - Increase IO depth</h3>
<p><em>Sequential write, 4K block size, OS caching disabled. <strong>Change: IO depth 16.</strong> See <a href="/images/2019-02-23-disk-performance-on-aks-part-1/test4.md">full fio test results</a>.</em></p>
<p><img src="/images/2019-02-23-disk-performance-on-aks-part-1/test4.png" alt="graph" title="graph" /></p>
<blockquote>
<p>For this test we only increase the IO depth from 1 to 16. IO depth is the number of write operations <code class="language-plaintext highlighter-rouge">fio</code> will execute simultaneously. Since we are using <code class="language-plaintext highlighter-rouge">direct</code> these will be queued by the OS for writing. We are now able to hit the performance limit of 1100 IOPS. <code class="language-plaintext highlighter-rouge">Disk active time</code> is now steady at 100% indicating that we have saturated the disk.</p>
</blockquote>
<p><a id="Test5"></a></p>
<h3 id="test-5---larger-block-size-smaller-io-depth">Test 5 - Larger block size, smaller IO depth</h3>
<p><em>Sequential write, OS caching disabled. <strong>Change: 128K block size, IO depth 1.</strong> See <a href="/images/2019-02-23-disk-performance-on-aks-part-1/test5.md">full fio test results</a>.</em></p>
<p><img src="/images/2019-02-23-disk-performance-on-aks-part-1/test5.png" alt="graph" title="graph" /></p>
<blockquote>
<p>We increase the block size to 128KB and reduce the IO depth to 1 again. The write latency for larger blocks increase to ~5ms which gives us 200 IOPS and 28MB/s. The disk is not saturated.</p>
</blockquote>
<p><a id="Test6"></a></p>
<h3 id="test-6---enable-os-cache">Test 6 - Enable OS cache</h3>
<p><em>Sequential write, 256K block size, IO depth 1. <strong>Change: OS caching enabled.</strong> See <a href="/images/2019-02-23-disk-performance-on-aks-part-1/test6.md">full fio test results</a>.</em></p>
<p><img src="/images/2019-02-23-disk-performance-on-aks-part-1/test6.png" alt="graph" title="graph" /></p>
<blockquote>
<p>We have now enabled the OS cache/buffer (<code class="language-plaintext highlighter-rouge">direct=0</code>). We can see that the writes hitting the disk are now merged to 512KB blocks. We are hitting the 125MB/s limit with about 250 IOPS. Enabling the cache also has other effects: CPU suddenly shows significant IO wait. The write latency shoots through the roof. Also note that the writing continued for 30-40 seconds after the test was done. <strong>This also means that the bandwidth and IOPS that <code class="language-plaintext highlighter-rouge">fio</code> sees and reports is higher than what is actually hitting the disk.</strong></p>
</blockquote>
<p><a id="Test7"></a></p>
<h3 id="test-7---random-writes-small-block-size">Test 7 - Random writes, small block size</h3>
<p><em>IO depth 1, OS caching enabled. <strong>Change: Random write, 4K block size.</strong> See <a href="/images/2019-02-23-disk-performance-on-aks-part-1/test1.md">full fio test results</a>.</em></p>
<p><img src="/images/2019-02-23-disk-performance-on-aks-part-1/test7.png" alt="graph" title="graph" /></p>
<blockquote>
<p>Here we go from sequential writes to random writes. We are limited by IOPS. The average size of the blocks actually written to disks, and the IOPS required to hit the bandwidth limit is actually varying a bit throughout the test. The time taken to empty the cache is about as long as I ran the test (4-5 minutes).</p>
</blockquote>
<p><a id="Test8"></a></p>
<h3 id="test-8---large-block-size">Test 8 - Large block size</h3>
<p><em>Random write, OS caching enabled. <strong>Change: 256K block size, IO depth 16.</strong> See <a href="/images/2019-02-23-disk-performance-on-aks-part-1/test8.md">full fio test results</a>.</em></p>
<p><img src="/images/2019-02-23-disk-performance-on-aks-part-1/test8.png" alt="graph" title="graph" /></p>
<blockquote>
<p>Increasing the block size to 256K makes us bandwidth limited to 125MB/s.</p>
</blockquote>
<p><a id="Conclusion"></a></p>
<h2 id="conclusion">Conclusion</h2>
<p>Access patterns and block sizes have a tremendous impact on the amount of data we are able to write to disk.</p>Stian ØvrevågeUnderstanding the characteristics of disk performance of a platform might be more important than you think. If disk resources are not correctly matched to your workload, your performance will suffer and might lead you to incorrectly diagnose a problem as being related to CPU or memory. The defaults might also not give you the performance you expect. In this first post on troubleshooting some disk performance issues on Azure Kubernetes Service (AKS) we will benchmark Azure Premium SSD to find how workloads affect performance and which metrics to monitor to know when troubleshooting potential disk issues.Managed Kubernetes on Microsoft Azure (English)2017-12-29T00:00:00+00:002017-12-29T00:00:00+00:00https://stian.tech/managed-kubernetes-on-azure-eng<p><em>A few days ago I wrote a walkthrough of <a href="/2017/12/25/managed-kubernetes-on-azure.html">setting up Azure Container Service (AKS) in Norwegian</a>. Someone asked me for an English version of that, and here it is.</em></p>
<p>Kubernetes(K8s) is becoming the de-facto standard for deploying container-based applications and workloads. Microsoft is currently in preview of their managed Kubernetes offering (Azure Kubernetes Service, AKS) which makes it easy to create a Kubernetes cluster and deploy workloads without the skill and time required to manage day-to-day operations of a Kubernetes-cluster, which today can be complex and time consuming.</p>
<p>In this post we will set up a Kubernetes cluster from scratch using Azure CLI.</p>
<!--more-->
<p>Table of contents</p>
<ul>
<li><a href="#Background">Background</a>
<ul>
<li><a href="#Dockercontainers">Docker containers</a></li>
<li><a href="#Containerorchestration">Container orchestration</a></li>
</ul>
</li>
<li><a href="#QuickstartAKS">Getting started with Azure Kubernetes - AKS</a>
<ul>
<li><a href="#Caveats">Caveats</a></li>
<li><a href="#Preparations">Preparations</a></li>
<li><a href="#AzureLogin">Azure login</a></li>
<li><a href="#ActivateContainerService">Activate ContainerService</a></li>
<li><a href="#CreateResourceGroup">Create a resource group</a></li>
<li><a href="#CreateK8sCluster">Create a Kubernetes cluster</a></li>
<li><a href="#InstallKubectl">Install kubectl</a></li>
<li><a href="#InspectCluster">Inspect cluster</a></li>
<li><a href="#StartNginx">Start some nginx containere</a></li>
<li><a href="#NginxService">Making nginx available with a service</a></li>
<li><a href="#ScaleCluster">Scale cluster</a></li>
<li><a href="#DeleteCluster">Delete cluster</a></li>
</ul>
</li>
<li><a href="#Bonusmaterial">Bonus material</a>
<ul>
<li><a href="#HelmIntro">Deploying services with Helm</a>
<ul>
<li><a href="#HelmMinecraft">Deploy MineCraft with Helm</a></li>
</ul>
</li>
<li><a href="#KubernetesDashboard">Kubernetes Dashboard</a></li>
</ul>
</li>
<li><a href="#Conclusion">Conclusion</a></li>
</ul>
<h4 id="microsoft-azure">Microsoft Azure</h4>
<blockquote>
<p><a href="https://azure.microsoft.com/en-us/free/">If you don’t have a Azure subscription already you can try services for $200 for 30 days.</a> The VM size <strong>Standard_B2s</strong> is Burstable, has 2vCPU, 4GB RAM, 8GB temp storage and costs roughly $38 / month. For $200 you can have a cluster of 3-4 B2s nodes plus traffic, loadbalancers and other additional costs.</p>
</blockquote>
<blockquote>
<p><em>We have no affiliation with Microsoft Azure except their sponsorship of our startup <a href="http://www.datadynamics.no/">DataDynamics</a> with cloud services for 24 months in their <a href="https://bizspark.microsoft.com/">BizSpark program</a>.</em></p>
</blockquote>
<p><a id="Background"></a></p>
<h2 id="background">Background</h2>
<p><a id="Dockercontainers"></a></p>
<h3 id="docker-containers">Docker containers</h3>
<p><em>We will not do a deep dive on Docker containers in this post, but here is a summary for those who are not familiar with it.</em></p>
<p>Docker is a way to package software so that it can run on the most popular platforms without worrying about installation, dependencies and to a certain degree, configuration.</p>
<p>In addition, a Docker container uses the operating system of the host machine when it runs. Because of this it’s possible to run many more containers on the same host machine compared to running virtual machines.</p>
<p>Here is a incomplete and rough comparison between a Docker container and a virtual machine:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Virtual machine</th>
<th>Docker container</th>
</tr>
</thead>
<tbody>
<tr>
<td>Image size</td>
<td>from 200MB to many GB</td>
<td>from 10MB to 3-400MB</td>
</tr>
<tr>
<td>Startup time</td>
<td>60 seconds +</td>
<td>1-10 seconds</td>
</tr>
<tr>
<td>Memory usage</td>
<td>256MB-512MB-1GB +</td>
<td>2MB +</td>
</tr>
<tr>
<td>Security</td>
<td>Good isolation between VMs</td>
<td>Not as good isolation between containers</td>
</tr>
<tr>
<td>Building image</td>
<td>Minutes</td>
<td>Seconds</td>
</tr>
</tbody>
</table>
<blockquote>
<p><strong>PS</strong> The numbers for virtual machines is taken from memory. I tried starting a MySQL virtual appliance on my laptop but VMware Player refuses to run because of Windows Hyper-V incompatibility. VMware Workstation refuses to run because of license issues and Oracle VirtualBox repeatedly gives me a nasty bluescreen. Hooray!</p>
</blockquote>
<blockquote>
<p><strong>Protip</strong> The smallest and fastest Docker images are built on Alpine Linux. For the webserver Nginx the Alpine-based image is 15MB compared to 108MB for the normal Debian-based image. PostgreSQL:Alpine is 38MB compared to 287MB with “full” OS. Last version of MySQL is 343MB but will in version 8 support Alpine Linux as well.</p>
</blockquote>
<p>To recap, some of the advantages of Docker containers are:</p>
<ul>
<li>Compatibility across platforms, Linux, Windows, MacOS.</li>
<li>10-100x smaller size. Faster to download, build and upload.</li>
<li>Memory usage only for application and not base OS.
<ul>
<li>Advantage when developing. Ability to run 10-20-30 containers on a development laptop.</li>
<li>Advantage in production. Can reduce hardware/cloud costs considerably.</li>
</ul>
</li>
<li>Near instant startup. Makes dynamic scaling of applications easier.</li>
</ul>
<p><a href="https://store.docker.com/editions/community/docker-ce-desktop-windows">Download Docker for Windows here.</a></p>
<p>To start a MySQL database container from Windows CMD or Powershell:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker run --name mysql -p 3306:3306 -e MYSQL_RANDOM_ROOT_PASSWORD=true mysql
</code></pre></div></div>
<p>Stop the container with:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker kill mysql
</code></pre></div></div>
<p>You can search for already built Docker images on <a href="https://hub.docker.com/">Docker Hub</a>. It’s also possible to create private Docker repositories for your own software that you don’t want to be publicly available.</p>
<p><a id="Containerorchestration"></a></p>
<h4 id="container-orchestration">Container orchestration</h4>
<p>Now that Docker container images has become the preferred way to package and distribute software on the Linux platform, there has emerged a need for systems to coordinate running and deploying these containers. Similar to the ecosystem of products VMware has built up around development and operation of virtual machines.</p>
<p>Container orchestration systems have the responsibility for:</p>
<ul>
<li>Load balancing.</li>
<li>Service discovery.</li>
<li>Health checks.</li>
<li>Automatic scaling and restarting of host nodes and containers.</li>
<li>Zero downtime upgrades (rolling deploys).</li>
</ul>
<p>Until recently the ecosystem around container orchestration has been fragmented, and the most popular alternatives have been:</p>
<ul>
<li><a href="https://kubernetes.io/">Kubernetes</a> (Originaly from Google, now managed by CNCF, the Cloud Native Computing Foundation)</li>
<li><a href="https://docs.docker.com/engine/swarm/">Swarm</a> (From the maker of Docker)</li>
<li><a href="http://mesos.apache.org/">Mesos</a> (From Apache Software Foundation)</li>
<li><a href="https://github.com/coreos/fleet">Fleet</a> (From CoreOS)</li>
</ul>
<p>But the last year there has been a convergence towards Kubernetes as the preferred solution.</p>
<ul>
<li>7 February
<ul>
<li><a href="https://coreos.com/blog/migrating-from-fleet-to-kubernetes.html">CoreOS announces that they are removing Fleet from Container Linux and recommends Kubernetes</a></li>
</ul>
</li>
<li>27 July
<ul>
<li><a href="https://azure.microsoft.com/en-us/blog/announcing-cncf/">Microsoft joins the CNCF</a></li>
</ul>
</li>
<li>9 August
<ul>
<li><a href="https://techcrunch.com/2017/08/09/aws-joins-the-cloud-native-computing-foundation/">Amazon Web Services join the CNCF</a></li>
</ul>
</li>
<li>29 August
<ul>
<li><a href="https://www.geekwire.com/2017/now-vmware-pivotal-cncf-becoming-hub-enterprise-tech/">VMware and Pivotal joins the CNCF</a></li>
</ul>
</li>
<li>17 September
<ul>
<li><a href="https://techcrunch.com/2017/09/13/oracle-joins-the-cloud-native-computing-foundation-as-a-platinum-member/">Oracle joins the CNCF</a></li>
</ul>
</li>
<li>17 October
<ul>
<li><a href="https://www.theregister.co.uk/2017/10/17/docker_ee_kubernetes_support/">Docker announces native support for Kubernetes in addition to it’s own Swarm product</a></li>
</ul>
</li>
<li>24 October
<ul>
<li><a href="https://azure.microsoft.com/en-us/blog/introducing-azure-container-service-aks-managed-kubernetes-and-azure-container-registry-geo-replication/">Microsoft Azure announces the managed Kubernetes service AKS</a></li>
</ul>
</li>
<li>29 November
<ul>
<li><a href="https://aws.amazon.com/blogs/aws/amazon-elastic-container-service-for-kubernetes/">Amazon Web Services announces the managed Kubernetes service EKS</a></li>
</ul>
</li>
</ul>
<p>Especially the last two news items are important. Deploying and running your own Kubernetes-installation requires time and skills (<a href="https://stripe.com/blog/operating-kubernetes">Read how Stripe used 5 months to trust running Kubernetes in production, just for batch jobs.</a>)</p>
<p>Until now the choice has been running your own Kubernetes cluster or using Google Container Engine which has been <a href="https://cloudplatform.googleblog.com/2014/11/unleashing-containers-and-kubernetes-with-google-compute-engine.html">using Kubernetes since 2014</a>. Many of us feel a certain discomfort by locking ourselves to one provider. But this is now changing when you can develop infrastructure on Kubernetes and choose between the 3 large cloud providers in addition to running your own cluster if wanted. <strong>*</strong></p>
<p><strong>*</strong> Kubernetes is a fast moving project, and features might be available on the different platforms on different timelines.</p>
<p><a id="QuickstartAKS"></a></p>
<h2 id="getting-started-with-azure-kubernetes---aks">Getting started with Azure Kubernetes - AKS</h2>
<p><a id="Caveats"></a></p>
<h3 id="caveats">Caveats</h3>
<blockquote>
<p>This guide is based on the documentation on <a href="https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough">Microsoft.com</a>. Setting up a Azure Kubernetes cluster did not work in the beginning of December, but today, 23. December, it seems to work fairly well. But, upgrading the cluster from Kubernetes 1.7 to 1.8 for example does NOT work.</p>
<p>AKS is in Preview and Azure are working continuously to make AKS stable and to support as many Kubernetes-features as possible. Amazon Web Services has a similar closed invite-only Preview currently while working on stability and features.</p>
<p>Both Azure and AWS expresses expectations about their Kubernetes offerings will be ready for production in 2018.</p>
</blockquote>
<p><a id="Preparations"></a></p>
<h3 id="preparations">Preparations</h3>
<p>You need Azure-CLI (version 2.0.21 or newer) to execute the <code class="language-plaintext highlighter-rouge">az</code> commands:</p>
<ul>
<li><a href="https://aka.ms/InstallAzureCliWindows">Download Azure-CLI here</a></li>
<li><a href="https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest">Information about Azure-CLI on MacOS and Linux here</a></li>
</ul>
<p>All commands executed in Windows PowerShell.</p>
<p><a id="AzureLogin"></a></p>
<h3 id="azure-login">Azure login</h3>
<p>Log on to Azure:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az login
</code></pre></div></div>
<p>You will get a link to open in your browser together with an authentication code. Enter the code on the webpage and <code class="language-plaintext highlighter-rouge">az login</code> will save the login information so that you will not have to authenticate again on the same machine.</p>
<blockquote>
<p><strong>PS</strong> The login information gets saved in <code class="language-plaintext highlighter-rouge">C:\Users\Username\.azure\</code>. You have to make sure nobody can access these files. They will then have full access to your Azure account.</p>
</blockquote>
<p><a id="ActivateContainerService"></a></p>
<h3 id="activate-containerservice">Activate ContainerService</h3>
<p>Since AKS is in Preview/Beta, you explicitly have to activate it in your subscription to get access to the <code class="language-plaintext highlighter-rouge">aks</code> subcommands.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az provider register -n Microsoft.ContainerService
az provider show -n Microsoft.ContainerService
</code></pre></div></div>
<p><a id="CreateResourceGroup"></a></p>
<h3 id="create-a-resource-group">Create a resource group</h3>
<p>Here we create a resource group named “my_aks_rg” in Azure region West Europe.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az group create --name my_aks_rg --location westeurope
</code></pre></div></div>
<blockquote>
<p><strong>Protip</strong>
To see a list of all available Azure regions, use the command <code class="language-plaintext highlighter-rouge">az account list-locations --output table</code>. <strong>PS</strong> AKS might not be available in all regions yet!</p>
</blockquote>
<p><a id="CreateK8sCluster"></a></p>
<h3 id="create-kubernetes-cluster">Create Kubernetes cluster</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az aks create --resource-group my_aks_rg --name my_cluster --node-count 3 --generate-ssh-keys --node-vm-size Standard_B2s --node-osdisk-size 128 --kubernetes-version 1.8.2
</code></pre></div></div>
<ul>
<li><code class="language-plaintext highlighter-rouge">--node-count</code>
<ul>
<li>Number of agent(host) nodes available to run containers</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">--generate-ssh-keys</code>
<ul>
<li>Creates and prints a SSH key which can be used for SSHing directly to the agent nodes.</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">--node-vm-size</code>
<ul>
<li>Which size Azure VMs the agent nodes should be created as. To see available sizes use <code class="language-plaintext highlighter-rouge">az vm list-sizes -l westeurope --output table</code> and <a href="https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes">Microsofts webpages</a>.</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">--node-osdisk-size</code>
<ul>
<li>Disk size of the agent nodes in GB. <strong>PS</strong> Containers can be stopped and moved to another host if Kubernetes finds it necessary or if a agent node disappears. All data saved locally in the container will be gone. If saving data permanently use Kubernetes PersistentVolumes and not the local agent node or container disks.</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">--kubernetes-version</code>
<ul>
<li>Which Kubernetes version to install. Azure does NOT necessarily install the last version by default, and currently upgrading with <code class="language-plaintext highlighter-rouge">az aks upgrade</code> does not work. Latest version available right now is 1.8.2. It’s recommended to use the latest available version since there is a lot of changes from version to version. The documentation is also much better for newer versions.</li>
</ul>
</li>
</ul>
<p>Save the output of the command in a file in a secure location. It contains keys that can be used to connect to the cluster with SSH. Even though that should not in theory be necessary.</p>
<p><a id="InstallKubectl"></a></p>
<h3 id="install-kubectl">Install kubectl</h3>
<p><code class="language-plaintext highlighter-rouge">kubectl</code> is the client which performs all operations against your Kubernetes cluster. Azure CLI can install <code class="language-plaintext highlighter-rouge">kubectl</code> for you:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az aks install-cli
</code></pre></div></div>
<p>After <code class="language-plaintext highlighter-rouge">kubectl</code> is installed we need to get login information so that <code class="language-plaintext highlighter-rouge">kubectl</code> can communicate with the Kubernetes cluster.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az aks get-credentials --resource-group my_aks_rg --name my_cluster
</code></pre></div></div>
<p>The login information is saved in <code class="language-plaintext highlighter-rouge">C:\Users\Username\.kube\config</code>. Keep these files secure as well.</p>
<blockquote>
<p><strong>Protip</strong> When you have several Kubernetes clusters you can change which one <code class="language-plaintext highlighter-rouge">kubectl</code> talks to with <code class="language-plaintext highlighter-rouge">kubectl config get-contexts</code> and <code class="language-plaintext highlighter-rouge">kubectl config set-context my_cluster</code>.</p>
</blockquote>
<p><a id="InspectCluster"></a></p>
<h3 id="inspect-cluster">Inspect cluster</h3>
<p>To check that the cluster and <code class="language-plaintext highlighter-rouge">kubectl</code> works we start with a couple of commands.</p>
<p>See all agent nodes and status:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> kubectl get nodes
NAME STATUS AGE VERSION
aks-nodepool1-16970026-0 Ready 15m v1.8.2
aks-nodepool1-16970026-1 Ready 15m v1.8.2
aks-nodepool1-16970026-2 Ready 15m v1.8.2
</code></pre></div></div>
<p>See all services, pods and deployments:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> kubectl get all --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system po/kubernetes-dashboard-6fc8cf9586-frpkn 1/1 Running 0 3d
NAMESPACE NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-system svc/kubernetes-dashboard 10.0.161.132 <none> 80/TCP 3d
NAMESPACE NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
kube-system deploy/kubernetes-dashboard 1 1 1 1 3d
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system rs/kubernetes-dashboard-6fc8cf9586 1 1 1 3d
</code></pre></div></div>
<p>This is just some of the output from this command. You do not have to know what the resources in the <code class="language-plaintext highlighter-rouge">kube-system</code> namespace does. That is part of the intention when Microsoft is managing our cluster for us.</p>
<blockquote>
<p><strong>Namespaces</strong>
In Kubernetes there is something called Namespaces. Resources in one namespace does not have automatic access to resources in another namespace. The services that runs Kubernetes itself use the namespace <code class="language-plaintext highlighter-rouge">kube-system</code>. The <code class="language-plaintext highlighter-rouge">kubectl</code> command by default only shows you resources in the <code class="language-plaintext highlighter-rouge">default</code> namespace, unless you specify <code class="language-plaintext highlighter-rouge">--all-namespaces</code> or <code class="language-plaintext highlighter-rouge">--namespace=xx</code>.</p>
</blockquote>
<p><a id="StartNginx"></a></p>
<h3 id="start-some-nginx-containers">Start some nginx containers</h3>
<blockquote>
<p>An instance of a running container in Kubernetes is called a <strong>Pod</strong>.</p>
</blockquote>
<blockquote>
<p><code class="language-plaintext highlighter-rouge">nginx</code> is a fast and flexible web server.</p>
</blockquote>
<p>Now that the clsuter is up we can start rolling out services and deployments on it.</p>
<p>Lets start with creating a Deployment consiting of 3 containers all running the <code class="language-plaintext highlighter-rouge">nginx:mainline-alpine</code> image from <a href="https://hub.docker.com/r/_/nginx/">Docker hub</a>.</p>
<p><strong>nginx-dep.yaml</strong> looks like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:mainline-alpine
ports:
- containerPort: 80
</code></pre></div></div>
<p>Load this into the cluster with <code class="language-plaintext highlighter-rouge">kubectl create</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl create -f https://raw.githubusercontent.com/StianOvrevage/stian.tech/master/images/2017-12-23-managed-kubernetes-on-azure/nginx-dep.yaml
</code></pre></div></div>
<p>This command creates the resources described in the file. <code class="language-plaintext highlighter-rouge">kubectl</code> can read files either from your local disk or from a web URL.</p>
<blockquote>
<p>After making changes to a resource definition (<code class="language-plaintext highlighter-rouge">.yaml</code> file), you can update the resources in the cluster with <code class="language-plaintext highlighter-rouge">kubetl replace -f resource.yaml</code>.</p>
</blockquote>
<p>We can verify that the Deployment is ready:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> kubectl get deploy
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
nginx-deployment 3 3 3 3 10m
</code></pre></div></div>
<p>We can also get the actual Pods that are running:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-569477d6d8-dqwx5 1/1 Running 0 10m
nginx-deployment-569477d6d8-xwzpw 1/1 Running 0 10m
nginx-deployment-569477d6d8-z5tfk 1/1 Running 0 10m
</code></pre></div></div>
<blockquote>
<p><strong>Logger</strong> We can view logs from one pod with <code class="language-plaintext highlighter-rouge">kubectl logs nginx-deployment-569477d6d8-xwzpw</code>. But since we in this case don’t know which Pod ends up getting an incomming request we can view logs from all the Pods which have <code class="language-plaintext highlighter-rouge">app=nginx</code> label: <code class="language-plaintext highlighter-rouge">kubectl logs -lapp=nginx</code>. The use of <code class="language-plaintext highlighter-rouge">app=nginx</code> is our choice in <code class="language-plaintext highlighter-rouge">nginx-dep.yaml</code> when we configured <code class="language-plaintext highlighter-rouge">spec.template.metadata.labels: app: nginx</code>.</p>
</blockquote>
<p><a id="NginxService"></a></p>
<h3 id="making-nginx-available-with-a-service">Making nginx available with a service</h3>
<p>To send traffic to our new Pods we need to create a <strong>Service</strong>. A service consists of one or more Pods which are chosen based on different criteria, for example which labels they have and whether the Pods are Running and Ready.</p>
<p>Lets create a service which forwards traffic to all Pods with label <code class="language-plaintext highlighter-rouge">app: nginx</code> and are listening to port 80. In addition we make the service available via a LoadBalancer:</p>
<p><strong>nginx-svc.yaml</strong> looks like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
type: LoadBalancer
ports:
- port: 80
name: http
targetPort: 80
selector:
app: nginx
</code></pre></div></div>
<p>We tell Kubernetes to create our service with <code class="language-plaintext highlighter-rouge">kubectl create</code> as usual:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl create -f https://raw.githubusercontent.com/StianOvrevage/stian.tech/master/images/2017-12-23-managed-kubernetes-on-azure/nginx-svc.yaml
</code></pre></div></div>
<p>We can then wait and see which IP-address Azure assigns our service:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> kubectl get svc -w
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx 10.0.24.11 13.95.173.255 80:31522/TCP 15m
</code></pre></div></div>
<blockquote>
<p><strong>PS</strong> It can take a few minutes for Azure to allocate and assign a Public IP for us. In the mean time <code class="language-plaintext highlighter-rouge"><pending></code> will appear under EXTERNAL-IP.</p>
</blockquote>
<p>A simple <strong>Welcome to nginx</strong> webpage should now be available on http://13.95.173.255 (<em>remember to replace with your own External-IP</em>).</p>
<p>We can also delete the service and deployment afterwards:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl delete svc nginx
kubectl delete deploy nginx-deployment
</code></pre></div></div>
<p><a id="ScaleCluster"></a></p>
<h3 id="scaling-the-cluster">Scaling the cluster</h3>
<p>If we want to change the number of agent nodes running Pods we can do that via Azure-CLI:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az aks scale --name my_cluster --resource-group my_aks_rg --node-count 5
</code></pre></div></div>
<blockquote>
<p>Currently all nodes will be created with the same size as when we created the cluster. AKS will probably get support for <a href="https://cloud.google.com/kubernetes-engine/docs/concepts/node-pools"><strong>node-pools</strong></a> next year. That will allow for creating different groups of nodes with different size and operating systems, both Linux and Windows.</p>
</blockquote>
<p><a id="DeleteCluster"></a></p>
<h3 id="delete-cluster">Delete cluster</h3>
<p>You can delete the whole cluster like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az aks delete --name my_cluster --resource-group my_aks_rg --yes
</code></pre></div></div>
<p><a id="Bonusmaterial"></a></p>
<h2 id="bonus-material">Bonus material</h2>
<p>Here is some bonus material if you want to go a bit further with Kubernetes.</p>
<p><a id="HelmIntro"></a></p>
<h3 id="deploying-services-with-helm">Deploying services with Helm</h3>
<p><a href="https://helm.sh/">Helm</a> is a package manager and library of software that is ready to be deployed on a Kubernetes cluster.</p>
<p>Start by downloading the <a href="https://github.com/kubernetes/helm/releases">Helm-client</a>. It will read login information etc. from the same location as <code class="language-plaintext highlighter-rouge">kubectl</code> automatically.</p>
<p>Install the Helm-server (<strong>Tiller</strong>) on the Kubernetes cluster and update the package library:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>helm init
helm repo update
</code></pre></div></div>
<p>See available packages (<strong>Charts</strong>) with <code class="language-plaintext highlighter-rouge">helm search</code>.</p>
<p><a id="HelmMinecraft"></a></p>
<h4 id="deploy-minecraft-with-helm">Deploy MineCraft with Helm</h4>
<p>Lets deploy a MineCraft server installation on our cluster, just because we can :-)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>helm install --name stians --set minecraftServer.eula=true stable/minecraft
</code></pre></div></div>
<blockquote>
<p><code class="language-plaintext highlighter-rouge">--set</code> overrides one or more of the standard values configured in the package. The MineCraft package is made in a way where it does not start without accepting the user license agreement by setting the variable <code class="language-plaintext highlighter-rouge">minecraftServer.eula</code>. All the variables that can be set in the MineCraft package are <a href="https://github.com/kubernetes/charts/blob/master/stable/minecraft/values.yaml">documented here</a>.</p>
</blockquote>
<p>Then we wait for Azure to assign us a Public IP:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> kubectl get svc -w
stians-minecraft 10.0.237.0 13.95.172.192 25565:30356/TCP 3m
</code></pre></div></div>
<p>Now we can connect to our MineCraft server on <code class="language-plaintext highlighter-rouge">13.95.172.192:25565</code>!</p>
<p><img src="/images/2017-12-23-managed-kubernetes-on-azure/minecraft-k8s.png" alt="Kubernetes in MineCraft on Kubernetes" title="Kubernetes in MineCraft on Kubernetes" /></p>
<p><a id="KubernetesDashboard"></a></p>
<h3 id="kubernetes-dashboard">Kubernetes Dashboard</h3>
<p>Kubernetes also has a graphic web user-interface which makes it a bit easier to see which resources are in the cluster, view logs and even open a remote shell inside a running Pod, among other things.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> kubectl proxy
Starting to serve on 127.0.0.1:8001
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">kubectl</code> encrypts and tunnels the traffic to the Kubernetes API servers. The dashboard is available on <a href="http://127.0.0.1:8001/ui/">http://127.0.0.1:8001/ui/</a>.</p>
<p><img src="/images/2017-12-23-managed-kubernetes-on-azure/k8s-dash.png" alt="Kubernetes Dashboard" title="Kubernetes Dashboard" /></p>
<p><a id="Conclusion"></a></p>
<h2 id="conclusion">Conclusion</h2>
<p>I hope you enjoy Kubernetes as much as I have. The learning curve can be a bit steep in the beginning, but it does not take long before you are productive.</p>
<p>Look at the <a href="https://v1-8.docs.kubernetes.io/docs/tutorials/">official guides on Kubernetes.io</a> to learn more about defining different types of resources and services to run on Kubernetes. <strong>PS: There are big changes from version to version so make sure you use the documentation for the correct version!</strong></p>
<p>Kubernetes also have a very active Slack-community on <a href="http://slack.k8s.io/">kubernetes.slack.com</a> that is worthwhile to check out.</p>Stian ØvrevågeA few days ago I wrote a walkthrough of setting up Azure Container Service (AKS) in Norwegian. Someone asked me for an English version of that, and here it is. Kubernetes(K8s) is becoming the de-facto standard for deploying container-based applications and workloads. Microsoft is currently in preview of their managed Kubernetes offering (Azure Kubernetes Service, AKS) which makes it easy to create a Kubernetes cluster and deploy workloads without the skill and time required to manage day-to-day operations of a Kubernetes-cluster, which today can be complex and time consuming. In this post we will set up a Kubernetes cluster from scratch using Azure CLI.Managed Kubernetes på Microsoft Azure (Norwegian)2017-12-25T00:00:00+00:002017-12-25T00:00:00+00:00https://stian.tech/managed-kubernetes-on-azure<p><em>Update 29. Dec: There is an <a href="/2017/12/29/managed-kubernetes-on-azure-eng.html">English version of this post here.</a></em></p>
<p>Kubernetes (K8s) er i ferd med å bli de-facto standard for deployments av kontainer-baserte applikasjoner. Microsoft har nå preview av deres managed Kubernetes tjeneste (Azure Kubernetes Service, AKS) som gjør det enkelt å opprette et Kubernetes cluster og rulle ut tjenester uten å måtte ha kompetanse og tid til den daglige driften av selve Kubernetes-clusteret, som per i dag kan være relativt komplisert og tidkrevende.</p>
<p>I denne posten setter vi opp et Kubernetes cluster fra scratch ved bruk av Azure CLI.</p>
<!--more-->
<p>Table of contents</p>
<ul>
<li><a href="#Bakgrunn">Bakgrunn</a>
<ul>
<li><a href="#Dockercontainers">Docker containers</a></li>
<li><a href="#Containerorchestration">Container orchestration</a></li>
</ul>
</li>
<li><a href="#OppretteAKS">Kom i gang med Azure Kubernetes - AKS</a>
<ul>
<li><a href="#Forbehold">Forbehold</a></li>
<li><a href="#Forberedelser">Forberedelser</a></li>
<li><a href="#Azureinnlogging">Azure innlogging</a></li>
<li><a href="#AktiverContainerService">Aktiver ContainerService</a></li>
<li><a href="#OpprettResourceGroup">Opprett en resource group</a></li>
<li><a href="#OpprettK8sCluster">Opprette Kubernetes cluster</a></li>
<li><a href="#InstallerKubectl">Installer kubectl</a></li>
<li><a href="#InspiserCluster">Inspiser cluster</a></li>
<li><a href="#StarteNginx">Starte noen nginx containere</a></li>
<li><a href="#NginxService">Gjøre nginx tilgjengelig med en tjeneste</a></li>
<li><a href="#ScaleCluster">Skalere cluster</a></li>
<li><a href="#DeleteCluster">Slette cluster</a></li>
</ul>
</li>
<li><a href="#Bonusmateriale">Bonusmateriale</a>
<ul>
<li><a href="#HelmIntro">Rulle ut tjenester med Helm pakker</a>
<ul>
<li><a href="#HelmMinecraft">MineCraft server med Helm</a></li>
</ul>
</li>
<li><a href="#KubernetesDashboard">Kubernetes Dashboard</a></li>
</ul>
</li>
<li><a href="#Konklusjon">Konklusjon</a></li>
</ul>
<h4 id="microsoft-azure">Microsoft Azure</h4>
<blockquote>
<p><a href="https://azure.microsoft.com/en-us/free/">Hvis du ikke har Azure fra før kan du prøve tjenester for $200 i 30 dager.</a> VM typen <strong>Standard_B2s</strong> er Burstable, har 2vCPU, 4GB RAM, 8GB temp storage og koster ~$38 / mnd. For $200 kan du ha et cluster på 3-4 B2s noder plus trafikkostnad, lastbalanserere og andre nødvendige tjenester.</p>
</blockquote>
<blockquote>
<p><em>Vi har ingen tilknytning til Microsoft bortsett fra at de sponser vår startup <a href="http://www.datadynamics.no/">DataDynamics</a> med cloud-tjenester i 24 mnd i deres <a href="https://bizspark.microsoft.com/">BizSpark program</a>.</em></p>
</blockquote>
<p><a id="Bakgrunn"></a></p>
<h2 id="bakgrunn">Bakgrunn</h2>
<p><a id="Dockercontainers"></a></p>
<h3 id="docker-containers">Docker containers</h3>
<p><em>Vi tar ikke for oss Docker containers i dybden i denne posten, men her er en kort oppsummering for de som ikke er kjent med teknologien.</em></p>
<p>Docker er en måte å pakketere programvare slik at det kan kjøres på samtlige populære platformer uten å måtte bruke mye tid på dependencies, oppsett og konfigurasjon.</p>
<p>I tillegg bruker en Docker container operativsystemet på vertsmaskinen når den kjører. Dette gjør at en kan kjøre mange flere containere på samme vertsmaskin sammenlignet med virtuelle maskiner.</p>
<p>Her er en ufullstendig og grov sammenligning mellom en Docker container og en virtuell maskin:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Virtuel maskin</th>
<th>Docker container</th>
</tr>
</thead>
<tbody>
<tr>
<td>Image størrelse</td>
<td>fra 200MB til mange GB</td>
<td>fra 10MB til 3-400MB</td>
</tr>
<tr>
<td>Oppstartstid</td>
<td>60 sekunder +</td>
<td>1-10 sekunder</td>
</tr>
<tr>
<td>Minnebruk</td>
<td>256MB-512MB-1GB +</td>
<td>2MB +</td>
</tr>
<tr>
<td>Sikkerhet</td>
<td>God isolasjon mellom VM</td>
<td>Dårligere isolasjon mellom containere</td>
</tr>
<tr>
<td>Bygge image</td>
<td>Minutter</td>
<td>Sekunder</td>
</tr>
</tbody>
</table>
<blockquote>
<p><strong>PS</strong> Tallene for virtuelle maskiner er tatt fra hukommelsen. Jeg forsøkte å starte en MySQL virtuell appliance på min laptop men VMware Player nekter å kjøre pga inkompatibilitet med Windows Hyper-V. VMware Workstation nekter å kjøre pga utgått lisens og Oracle VirtualBox gir en nasty bluescreen gang på gang. Hooray!</p>
</blockquote>
<blockquote>
<p><strong>Protip</strong> De minste og raskeste Docker imagene er bygget på Alpine Linux. For webserveren Nginx er det Alpine-baserte imaget 15MB mot det Debian-baserte imaget på 108MB. PostgreSQL:Alpine er 38MB mot 287MB. Siste versjon av MySQL er 343MB men vil i versjon 8 støtte Alpine Linux også.</p>
</blockquote>
<p>Noen av fordelene med Docker containers er altså:</p>
<ul>
<li>Kompatibilitet på tvers av platformer, Linux, Windows og MacOS.</li>
<li>10-100x mindre størrelse. Raskere å laste ned, raskere å bygge, raskere å laste opp.</li>
<li>Minnebruk kun for applikasjon og ikke eget OS.
<ul>
<li>Fordel under utvikling, kan kjøre 10-20-30 Docker containere samtidig på en laptop.</li>
<li>Fordel i produksjon, kan redusere hardware utgifter betraktelig.</li>
</ul>
</li>
<li>Oppstart på få sekunder. Gjør dynamisk skalering av applikasjoner mye enklere.</li>
</ul>
<p><a href="https://store.docker.com/editions/community/docker-ce-desktop-windows">Last ned Docker for Windows her.</a></p>
<p>Og start en MySQL database fra Windows CMD eller Powershell:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker run --name mysql -p 3306:3306 -e MYSQL_RANDOM_ROOT_PASSWORD=true mysql
</code></pre></div></div>
<p>Stop containeren med:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker kill mysql
</code></pre></div></div>
<p>En kan søke etter ferdige Docker images på <a href="https://hub.docker.com/">Docker Hub</a>. Det er også mulig å lage private Docker repositories for egen programvare som ikke skal være tilgjengelig for omverden.</p>
<p><a id="Containerorchestration"></a></p>
<h4 id="container-orchestration">Container orchestration</h4>
<p>Etter som Docker containers har blitt den foretrukne måten å pakke og distribuere programvare på Linux platformen de siste par årene har det vokst frem et behov for systemer som kan samkjøre drift og utrulling av disse containerene. Ikke ulikt det økosystemet av produkter VMware har bygget opp rundt utvikling og drift av virtuelle maskiner.</p>
<p>Container orchestration systemene har som oppgave å sørge for:</p>
<ul>
<li>Lastbalansering.</li>
<li>Service discovery.</li>
<li>Health checks.</li>
<li>Automatisk skalering og restarting av vertsmaskiner og containere.</li>
<li>Oppgraderinger uten nedetid (rolling deploy).</li>
</ul>
<p>Frem til nylig har økosystemet rundt container orchestration vært fragmentert og de mest populære alternativene har vært:</p>
<ul>
<li><a href="https://kubernetes.io/">Kubernetes</a> (Opprinnelig fra Google, nå styrt av CNCF, Cloud Native Computing Foundation)</li>
<li><a href="https://docs.docker.com/engine/swarm/">Swarm</a> (Fra produsenten bak Docker)</li>
<li><a href="http://mesos.apache.org/">Mesos</a> (Fra Apache Software Foundation)</li>
<li><a href="https://github.com/coreos/fleet">Fleet</a> (Fra CoreOS)</li>
</ul>
<p>Men det siste året har det vært en konvergens mot Kubernetes som foretrukket løsning.</p>
<ul>
<li>7 februar
<ul>
<li><a href="https://coreos.com/blog/migrating-from-fleet-to-kubernetes.html">CoreOS annonserer at de fjerner Fleet fra Container Linux og anbefaler Kubernetes</a></li>
</ul>
</li>
<li>27 juli
<ul>
<li><a href="https://azure.microsoft.com/en-us/blog/announcing-cncf/">Microsoft slutter seg til CNCF</a></li>
</ul>
</li>
<li>9 august
<ul>
<li><a href="https://techcrunch.com/2017/08/09/aws-joins-the-cloud-native-computing-foundation/">Amazon Web Services slutter seg til CNCF</a></li>
</ul>
</li>
<li>29 august
<ul>
<li><a href="https://www.geekwire.com/2017/now-vmware-pivotal-cncf-becoming-hub-enterprise-tech/">VMware og Pivotal slutter seg til CNCF</a></li>
</ul>
</li>
<li>17 september
<ul>
<li><a href="https://techcrunch.com/2017/09/13/oracle-joins-the-cloud-native-computing-foundation-as-a-platinum-member/">Oracle slutter seg til CNCF</a></li>
</ul>
</li>
<li>17 oktober
<ul>
<li><a href="https://www.theregister.co.uk/2017/10/17/docker_ee_kubernetes_support/">Docker annonserer native støtte for Kubernetes i tillegg til sitt eget Swarm produkt</a></li>
</ul>
</li>
<li>24 oktober
<ul>
<li><a href="https://azure.microsoft.com/en-us/blog/introducing-azure-container-service-aks-managed-kubernetes-and-azure-container-registry-geo-replication/">Microsoft Azure annonserer managed Kubernetes med tjenesten AKS</a></li>
</ul>
</li>
<li>29 november
<ul>
<li><a href="https://aws.amazon.com/blogs/aws/amazon-elastic-container-service-for-kubernetes/">Amazon Web Services annonserer managed Kubernetes med tjenesten EKS</a></li>
</ul>
</li>
</ul>
<p>De to siste nyhetene er spesielt viktige. Å drifte sin egen Kubernetes-installasjon krever tid og kompetanse. (<a href="https://stripe.com/blog/operating-kubernetes">Les hvordan Stripe brukte 5 måneder på å bli fortrolig med å drifte sitt eget Kubernetes cluster, bare for batch jobs.</a>)</p>
<p>Frem til nå har valget vært mellom å drifte sitt eget Kubernetes cluster eller bruke Google Container Engine som har <a href="https://cloudplatform.googleblog.com/2014/11/unleashing-containers-and-kubernetes-with-google-compute-engine.html">brukt Kubernetes siden 2014</a>. Mange av oss føler et visst ubehag ved å låse oss til én tilbyder. Men dette er nå anderledes når en kan utvikle infrastruktur på Kubernetes, og velge tilnærmet fritt <strong>*</strong> mellom de 3 store cloud-tilbyderene i tillegg til å drifte selv om ønskelig.</p>
<p><strong>*</strong> Kubernetes utvikles raskt, og funksjonalitet blir ofte ikke tilgjengelig på de ulike platformene samtidig.</p>
<p><a id="OppretteAKS"></a></p>
<h2 id="opprette-azure-kubernetes-cluster">Opprette Azure Kubernetes Cluster</h2>
<p><a id="Forbehold"></a></p>
<h3 id="forbehold">Forbehold</h3>
<blockquote>
<p>Denne gjennomgangen tar utgangspunkt i dokumentasjonen på <a href="https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough">Microsoft.com</a>. Å sette opp et Azure Kubernetes cluster fungerte ikke i starten av desember, men per dags dato, 23. desember, ser det ut til å fungere relativt bra. Men, oppgradering av cluster fra Kubernetes 1.7 til 1.8 fungerer for eksempel IKKE.</p>
<p>AKS er i Preview og Azure jobber kontinuerlig med å gjøre AKS stabilt og støtte så mange Kubernetes-funksjoner som mulig. Amazon Web Services har tilsvarende en lukket invite-only Preview per dags dato mens de også jobber med stabilitet og funksjonalitet.</p>
<p>Både Azure og AWS uttrykker forventning om at deres Kubernetes tjenester skal være klare for produksjonsmiljø ila 2018.</p>
</blockquote>
<p><a id="Forberedelser"></a></p>
<h3 id="forberedelser">Forberedelser</h3>
<p>Du behøver Azure-CLI (versjon 2.0.21 eller nyere) for å utføre kommandoene:</p>
<ul>
<li><a href="https://aka.ms/InstallAzureCliWindows">Last ned Azure-CLI her</a></li>
<li><a href="https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest">Informasjon om Azure-CLI på MacOS og Linux finner du her</a></li>
</ul>
<p>Alle kommandoer gjøres i Windows PowerShell.</p>
<p><a id="Azureinnlogging"></a></p>
<h3 id="azure-innlogging">Azure innlogging</h3>
<p>Logg på Azure:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az login
</code></pre></div></div>
<p>Du får en link som du åpner i din browser samt en autentiseringskode. Skriv koden på nettsiden og <code class="language-plaintext highlighter-rouge">az login</code> lagrer påloggingsinformasjonen slik at du ikke behøver å autentisere igjen på samme maskin.</p>
<blockquote>
<p><strong>PS</strong> Pålogingsinformasjonen lagres i <code class="language-plaintext highlighter-rouge">C:\Users\Brukernavn\.azure\</code>. Du må selv passe på at ingen kopierer disse filene. Da får de full tilgang til din Azure konto.</p>
</blockquote>
<p><a id="AktiverContainerService"></a></p>
<h3 id="aktiver-containerservice">Aktiver ContainerService</h3>
<p>Siden AKS er i Preview/Beta må du eksplisitt aktivere det for å få tilgang til <code class="language-plaintext highlighter-rouge">aks</code> kommandoene.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az provider register -n Microsoft.ContainerService
az provider show -n Microsoft.ContainerService
</code></pre></div></div>
<p><a id="OpprettResourceGroup"></a></p>
<h3 id="opprett-en-resource-group">Opprett en resource group</h3>
<p>Her oppretter vi en resource group med navn “min_aks_rg” i Azure region West Europe.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az group create --name min_aks_rg --location westeurope
</code></pre></div></div>
<blockquote>
<p><strong>Protip</strong>
For å se en liste over tilgjengelige Azure regioner, bruk kommandoen <code class="language-plaintext highlighter-rouge">az account list-locations --output table</code>. <strong>PS</strong> Det kan hende AKS ikke er tilgjengelig i alle regioner enda.</p>
</blockquote>
<p><a id="OpprettK8sCluster"></a></p>
<h3 id="opprette-kubernetes-cluster">Opprette Kubernetes cluster</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az aks create --resource-group min_aks_rg --name mitt_cluster --node-count 3 --generate-ssh-keys --node-vm-size Standard_B2s --node-osdisk-size 256 --kubernetes-version 1.8.2
</code></pre></div></div>
<ul>
<li><code class="language-plaintext highlighter-rouge">--node-count</code>
<ul>
<li>Antall vertsmaskiner tilgjengelig for å kjøre containers</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">--generate-ssh-keys</code>
<ul>
<li>Oppretter og outputter en SSH key som kan brukes for å SSHe direkte til vertsmaskinene.</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">--node-vm-size</code>
<ul>
<li>Hvilken type Azure VM clusteret skal bestå av. For å se tilgjengelige størrelser bruk <code class="language-plaintext highlighter-rouge">az vm list-sizes -l westeurope --output table</code> og <a href="https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes">Microsofts nettsider.</a></li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">--node-osdisk-size</code>
<ul>
<li>Disk størrelse på vertsmaskiner i GB. <strong>PS</strong> Conteinere kan bli stoppet og flyttet til en annen host ved behov eller hvis en vertsmaskin forsvinner. Alle data lagret lokalt i conteineren blir da borte. Hvis en skal lagre ting permanent må en bruke PersistentVolumes og ikke lokal disk på vertsmaskin.</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">--kubernetes-version</code>
<ul>
<li>Hvilken Kubernetes versjon som skal installeres. Azure installerer IKKE den siste versjonen som standard, og per dags dato fungerer ikke <code class="language-plaintext highlighter-rouge">az aks upgrade</code> tilstrekkelig. Siste tilgjengelige versjon per dags dato er 1.8.2. Det er en fordel å bruke siste versjon da det skjer store forbedringer i Kubernetes fra versjon til versjon. Dokumentasjon er også mye bedre for nyere versjoner.</li>
</ul>
</li>
</ul>
<p>Lagre teksten som kommandoen spytter ut i en fil på en trygg plass. Den inneholder nøkler som kan brukes for å kople til clusteret med SSH. Selv om det i teorien ikke skal være nødvendig.</p>
<p><a id="InstallerKubectl"></a></p>
<h3 id="installer-kubectl">Installer kubectl</h3>
<p><code class="language-plaintext highlighter-rouge">kubectl</code> er klienten som gjør alle operasjoner mot ditt Kubernetes cluster. Azure CLI kan installere <code class="language-plaintext highlighter-rouge">kubectl</code> for deg:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az aks install-cli
</code></pre></div></div>
<p>Etter <code class="language-plaintext highlighter-rouge">kubectl</code> er installert behøver vi å få påloggingsinformasjon slik at <code class="language-plaintext highlighter-rouge">kubectl</code> kan kommunisere med Kubernetes clusteret.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az aks get-credentials --resource-group min_aks_rg --name mitt_cluster
</code></pre></div></div>
<p>Påloggingsinformasjonen lagres i <code class="language-plaintext highlighter-rouge">C:\Users\Brukernavn\.kube\config</code>. Hold disse filene hemmelig også.</p>
<blockquote>
<p><strong>Protip</strong> Når en har flere ulike Kubernetes clusters kan en bytte hvilken <code class="language-plaintext highlighter-rouge">kubectl</code> skal snakke til med <code class="language-plaintext highlighter-rouge">kubectl config get-contexts</code> og <code class="language-plaintext highlighter-rouge">kubectl config set-context mitt_cluster</code>.</p>
</blockquote>
<p><a id="InspiserCluster"></a></p>
<h3 id="inspiser-cluster">Inspiser cluster</h3>
<p>For å se at clusteret og <code class="language-plaintext highlighter-rouge">kubectl</code> virker begynner vi med noen kommandoer.</p>
<p>Se alle vertsmaskiner og status:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> kubectl get nodes
NAME STATUS AGE VERSION
aks-nodepool1-16970026-0 Ready 15m v1.8.2
aks-nodepool1-16970026-1 Ready 15m v1.8.2
aks-nodepool1-16970026-2 Ready 15m v1.8.2
</code></pre></div></div>
<p>Se alle tjenester, pods, deployments:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> kubectl get all --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system po/kubernetes-dashboard-6fc8cf9586-frpkn 1/1 Running 0 3d
NAMESPACE NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-system svc/kubernetes-dashboard 10.0.161.132 <none> 80/TCP 3d
NAMESPACE NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
kube-system deploy/kubernetes-dashboard 1 1 1 1 3d
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system rs/kubernetes-dashboard-6fc8cf9586 1 1 1 3d
</code></pre></div></div>
<p>Jeg har bare tatt et lite utdrag fra denne kommandoen. Du behøver ikke å forstå hva alle ressursene i <code class="language-plaintext highlighter-rouge">kube-system</code> namespacet gjør. Det er hensikten at du skal slippe det når Microsoft står for management av selve clusteret.</p>
<blockquote>
<p><strong>Namespaces</strong>
I Kubernetes er det noe som heter Namespaces. Ressurser i ett namespace har ikke automatisk tilgang til ressurser i et annet namespace. Tjenestene som Kubernetes selv benytter installeres i namespacet <code class="language-plaintext highlighter-rouge">kube-system</code>. Kommandoen <code class="language-plaintext highlighter-rouge">kubectl</code> viser deg vanligvis bare ressurser i <code class="language-plaintext highlighter-rouge">default</code> namespace med mindre du spesifiserer <code class="language-plaintext highlighter-rouge">--all-namespaces</code> eller <code class="language-plaintext highlighter-rouge">--namespace=xx</code>.</p>
</blockquote>
<p><a id="StarteNginx"></a></p>
<h3 id="starte-noen-nginx-containere">Starte noen nginx containere</h3>
<blockquote>
<p>En instans av en kjørende container kalles i Kubernetes for en <strong>Pod</strong>.</p>
</blockquote>
<blockquote>
<p><code class="language-plaintext highlighter-rouge">nginx</code> er en rask og fleksibel webserver.</p>
</blockquote>
<p>Nå som clusteret er oppe å kjøre kan vi begynne å rulle ut tjenster og deployments på det.</p>
<p>Vi begynner med å lage en Deployment bestående av 3 containere som alle kjører <code class="language-plaintext highlighter-rouge">nginx:mainline-alpine</code> imaget fra <a href="https://hub.docker.com/r/_/nginx/">Docker hub</a>.</p>
<p><strong>nginx-dep.yaml</strong> ser slik ut:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:mainline-alpine
ports:
- containerPort: 80
</code></pre></div></div>
<p>Last denne inn på clusteret med <code class="language-plaintext highlighter-rouge">kubectl create</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl create -f https://raw.githubusercontent.com/StianOvrevage/stian.tech/master/images/2017-12-23-managed-kubernetes-on-azure/nginx-dep.yaml
</code></pre></div></div>
<p>Denne kommandoen oppretter ressursene beskrevet i filen. <code class="language-plaintext highlighter-rouge">kubectl</code> kan lese filer enten lokalt fra din maskin eller fra en URL.</p>
<blockquote>
<p>Etter du har gjort endringer i en ressurs-definisjon (<code class="language-plaintext highlighter-rouge">.yaml</code> fil) kan du oppdatere ressursene i clusteret med <code class="language-plaintext highlighter-rouge">kubectl replace -f ressurs.yaml</code></p>
</blockquote>
<p>Vi kan verifisere at Deployment er klar:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> kubectl get deploy
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
nginx-deployment 3 3 3 3 10m
</code></pre></div></div>
<p>Vi kan også hente de faktiske Pods som er startet:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-569477d6d8-dqwx5 1/1 Running 0 10m
nginx-deployment-569477d6d8-xwzpw 1/1 Running 0 10m
nginx-deployment-569477d6d8-z5tfk 1/1 Running 0 10m
</code></pre></div></div>
<blockquote>
<p><strong>Logger</strong> Vi kan se logger fra én pod med <code class="language-plaintext highlighter-rouge">kubectl logs nginx-deployment-569477d6d8-xwzpw</code>. Men siden vi i dette tilfellet ikke vet hvilken Pod som ender opp med å få innkommende forespørsler kan vi se logger fra alle Pods som har <code class="language-plaintext highlighter-rouge">app=nginx</code> label: <code class="language-plaintext highlighter-rouge">kubectl logs -lapp=nginx</code>. At vi her bruker <code class="language-plaintext highlighter-rouge">app=nginx</code> har vi selv bestemt i <code class="language-plaintext highlighter-rouge">nginx-dep.yaml</code> når vi satt <code class="language-plaintext highlighter-rouge">spec.template.metadata.labels: app: nginx</code>.</p>
</blockquote>
<p><a id="NginxService"></a></p>
<h3 id="gjøre-nginx-tilgjengelig-med-en-tjeneste">Gjøre nginx tilgjengelig med en tjeneste</h3>
<p>For å kommunisere med våre nye Pods behøver vi å opprette en tjeneste (<strong>Service</strong>). En tjeneste består av en eller flere Pods som velges basert på ulike kriterier, blant annet hvilke labels de har og om Podene det gjelder er Running og Ready.</p>
<p>Nå lager vi en tjeneste som ruter trafikk til alle Pods som har label <code class="language-plaintext highlighter-rouge">app: nginx</code> og som lytter på port 80. I tillegg gjør vi tjenesten tilgjengelig via en LoadBalancer:</p>
<p><strong>nginx-svc.yaml</strong> ser slik ut:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
type: LoadBalancer
ports:
- port: 80
name: http
targetPort: 80
selector:
app: nginx
</code></pre></div></div>
<p>Vi ber Kubernetes om å opprette tjeneten vår med <code class="language-plaintext highlighter-rouge">kubectl create</code> som vanlig:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl create -f https://raw.githubusercontent.com/StianOvrevage/stian.tech/master/images/2017-12-23-managed-kubernetes-on-azure/nginx-svc.yaml
</code></pre></div></div>
<p>Deretter kan vi se hvilken IP-adresse tjenesten vår har fått av Azure:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> kubectl get svc -w
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx 10.0.24.11 13.95.173.255 80:31522/TCP 15m
</code></pre></div></div>
<blockquote>
<p><strong>PS</strong> Det kan ta et par minutter for Azure å tildele tjenesten vår en Public IP, i mellomtiden vil det stå <code class="language-plaintext highlighter-rouge"><pending></code> under EXTERNAL-IP.</p>
</blockquote>
<p>En enkel <strong>Welcome to nginx</strong> webside skal nå være tilgjengelig på http://13.95.173.255 (<em>husk å bytt ut med din egen External-IP</em>).</p>
<p><strong>Vi har nå en lastbalansert <code class="language-plaintext highlighter-rouge">nginx</code> tjeneste med 3 servere klar til å ta imot trafikk.</strong></p>
<p>For ordens skyld kan vi slette tjeneste og deployment etterpå:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl delete svc nginx
kubectl delete deploy nginx-deployment
</code></pre></div></div>
<p><a id="ScaleCluster"></a></p>
<h3 id="skalere-cluster">Skalere cluster</h3>
<p>Hvis en ønsker å endre antall vertsmaskiner/noder som kjører Pods kan en gjøre det via Azure-CLI:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az aks scale --name mitt_cluster --resource-group min_aks_rg --node-count 5
</code></pre></div></div>
<blockquote>
<p>For øyeblikket blir alle noder opprettet med samme størrelse som når clusteret ble opprettet. AKS vil antageligvis få støtte for <a href="https://cloud.google.com/kubernetes-engine/docs/concepts/node-pools"><strong>node-pools</strong></a> i løpet av neste år. Da kan en opprette grupper av noder med forskjellig størrelse og operativsystem, både Linux og Windows.</p>
</blockquote>
<p><a id="DeleteCluster"></a></p>
<h3 id="slette-cluster">Slette cluster</h3>
<p>En kan slette hele clusteret slik:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az aks delete --name mitt_cluster --resource-group min_aks_rg --yes
</code></pre></div></div>
<p><a id="Bonusmateriale"></a></p>
<h2 id="bonusmateriale">Bonusmateriale</h2>
<p>Her er litt bonusmateriale dersom du ønsker å gå enda litt videre med Kubernetes.</p>
<p><a id="HelmIntro"></a></p>
<h3 id="rulle-ut-tjenester-med-helm">Rulle ut tjenester med Helm</h3>
<p><a href="https://helm.sh/">Helm</a> er en pakke-behandler og et bibliotek av programvare som er klart for å rulles ut i et Kubernetes-cluster.</p>
<p>Start med å laste ned <a href="https://github.com/kubernetes/helm/releases">Helm-klienten</a>. Den henter påloggingsinformasjon osv fra samme sted som <code class="language-plaintext highlighter-rouge">kubectl</code> automatisk.</p>
<p>Installer Helm-serveren (Tiller) på Kubernetes clusteret og oppdater pakke-biblioteket:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>helm init
helm repo update
</code></pre></div></div>
<p>Se tilgjengelige pakker (<strong>Charts</strong>) med: <code class="language-plaintext highlighter-rouge">helm search</code>.</p>
<p><a id="HelmMinecraft"></a></p>
<h4 id="rulle-ut-minecraft-med-helm">Rulle ut MineCraft med Helm</h4>
<p>La oss rulle ut en MineCraft installasjon på clusteret vårt, fordi vi kan :-)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>helm install --name stian-sin --set minecraftServer.eula=true stable/minecraft
</code></pre></div></div>
<blockquote>
<p><code class="language-plaintext highlighter-rouge">--set</code> overstyrer en eller flere av standardverdiene som er satt i pakken. MineCraft pakken er laget slik at den ikke starter uten å ha sagt seg enig i brukervilkårene i variabelen <code class="language-plaintext highlighter-rouge">minecraftServer.eula</code>. Alle variablene som kan overstyres i MineCraft pakken er <a href="https://github.com/kubernetes/charts/blob/master/stable/minecraft/values.yaml">dokumentert her</a>.</p>
</blockquote>
<p>Så venter vi litt på at Azure skal tildele en Public IP:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> kubectl get svc -w
stian-sin-minecraft 10.0.237.0 13.95.172.192 25565:30356/TCP 3m
</code></pre></div></div>
<p>Og vipps kan vi kople til Minecraft på <code class="language-plaintext highlighter-rouge">13.95.172.192:25565</code>.</p>
<p><img src="/images/2017-12-23-managed-kubernetes-on-azure/minecraft-k8s.png" alt="Kubernetes in MineCraft on Kubernetes" title="Kubernetes in MineCraft on Kubernetes" /></p>
<p><a id="KubernetesDashboard"></a></p>
<h3 id="kubernetes-dashboard">Kubernetes Dashboard</h3>
<p>Kubernetes har også et grafisk web-grensesnitt som gjør det litt lettere å se hvilke ressurser som er i clusteret, se logger og åpne remote-shell inne i en kjørende Pod, blant annet.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> kubectl proxy
Starting to serve on 127.0.0.1:8001
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">kubectl</code> krypterer og tunnelerer trafikken inn til Kubernetes’ API servere. Dashboardet er tilgjengelig på <a href="http://127.0.0.1:8001/ui/">http://127.0.0.1:8001/ui/</a>.</p>
<p><img src="/images/2017-12-23-managed-kubernetes-on-azure/k8s-dash.png" alt="Kubernetes Dashboard" title="Kubernetes Dashboard" /></p>
<p><a id="Konklusjon"></a></p>
<h2 id="konklusjon">Konklusjon</h2>
<p>Jeg håper du har fått mersmak for Kubernetes. Lærekurven kan være litt bratt i begynnelsen men det tar ikke så veldig lang tid før du er produktiv.</p>
<p>Se på de <a href="https://v1-8.docs.kubernetes.io/docs/tutorials/">offisielle guidene på Kubernetes.io</a> for å lære mer om hvordan du definerer forskjellige typer ressurser og tjenester for å kjøre på Kubernetes. <strong>PS: Det gjøres store endringer fra versjon til versjon så sørg for å bruke dokumentasjonen for riktig versjon!</strong></p>
<p>Kubernetes har også et veldig aktivt Slack-miljø på <a href="http://slack.k8s.io/">kubernetes.slack.com</a>. Der er det også en kanal for norske Kubernetes brukere; <strong>#norw-users</strong>.</p>Stian ØvrevågeUpdate 29. Dec: There is an English version of this post here. Kubernetes (K8s) er i ferd med å bli de-facto standard for deployments av kontainer-baserte applikasjoner. Microsoft har nå preview av deres managed Kubernetes tjeneste (Azure Kubernetes Service, AKS) som gjør det enkelt å opprette et Kubernetes cluster og rulle ut tjenester uten å måtte ha kompetanse og tid til den daglige driften av selve Kubernetes-clusteret, som per i dag kan være relativt komplisert og tidkrevende. I denne posten setter vi opp et Kubernetes cluster fra scratch ved bruk av Azure CLI.Next generation monitoring with OpenTSDB2014-06-02T19:56:40+00:002014-06-02T19:56:40+00:00https://stian.tech/next-generation-monitoring-using-opentsdb<p>In this paper we will provide a step by step guide on how to install a single-instance of <strong>OpenTSDB</strong> using the latest versions of the underlying technology, <strong>Hadoop</strong> and <strong>HBase</strong>. We will also provide some background on the state of existing monitoring solutions.</p>
<!--more-->
<p><a id="Abstract"></a></p>
<p>Table of contents</p>
<ul>
<li><a href="#Abstract">Abstract</a></li>
<li><a href="#Background">Background</a>
<ul>
<li><a href="#Performanceproblems">Performance problems - Welcome to I/O-hell</a></li>
<li><a href="#Scaling">Scaling problems</a></li>
<li><a href="#Loss">Loss of detail</a></li>
<li><a href="#flexibility">Lack of flexibility</a></li>
</ul>
</li>
<li><a href="#revolution">The monitoring revolution</a></li>
<li><a href="#Debian">Setting up a single node OpenTSDB instance on Debian 7 Wheezy</a>
<ul>
<li><a href="#Hardware">Hardware requirements</a></li>
<li><a href="#Operating">Operating system requirements</a></li>
<li><a href="#preparations">Pre-setup preparations</a></li>
<li><a href="#java">Installing java from packages</a></li>
<li><a href="#HBase">Installing HBase</a>
<ul>
<li><a href="#snappy">Install snappy</a></li>
<li><a href="#native">Building native libhadoop and libsnappy</a></li>
<li><a href="#ConfiguringHBase">Configuring HBase</a></li>
</ul>
</li>
<li><a href="#compression">Testing HBase and compression</a></li>
<li><a href="#StartingHBase">Starting HBase</a></li>
</ul>
</li>
<li><a href="#InstallingOpenTSDB">Installing OpenTSDB</a>
<ul>
<li><a href="#ConfiguringOpenTSDB">Configuring OpenTSDB</a></li>
<li><a href="#HBasetables">Creating HBase tables</a></li>
<li><a href="#StartingOpenTSDB">Starting OpenTSDB</a></li>
</ul>
</li>
<li><a href="#Feeding">Feeding data into OpenTSDB</a>
<ul>
<li><a href="#tcollector">tcollector</a></li>
<li><a href="#peritus-tc-tools">peritus-tc-tools</a></li>
<li><a href="#collectd-opentsdb">collectd-opentsdb</a></li>
<li><a href="#MonitoringOpenTSDB">Monitoring OpenTSDB</a></li>
</ul>
</li>
<li><a href="#Performancecomparison">Performance comparison</a>
<ul>
<li><a href="#Collection">Collection</a></li>
<li><a href="#Storage">Storage</a></li>
<li><a href="#Conclusion">Conclusion</a></li>
</ul>
</li>
</ul>
<p><a id="Background"></a></p>
<h2 id="background">Background</h2>
<p>Since its inception in 1999 <a href="http://oss.oetiker.ch/rrdtool/"><strong>rrdtool</strong></a> (the underlying storage mechanism of once universal <strong>MRTG</strong>) has been the base of many popular monitoring solutions; <strong>Cacti</strong>, <strong>collectd</strong>, <strong>Ganglia</strong>, <strong>Munin</strong>, <strong>Observium</strong>, <strong>OpenNMS</strong> and <strong>Zenoss</strong>, to name a few.</p>
<p>There are a number of problems with the current approach and we will highlight some of these here.</p>
<p>Please note that this includes <strong>Graphite</strong> and its backend <strong>Whisper</strong>, which is based on the <a href="http://graphite.readthedocs.org/en/0.9.10/whisper.html">same basic design as rrdtool</a> and has <a href="http://dieter.plaetinck.be/on-graphite-whisper-and-influxdb.html">some of the same limitations</a>.</p>
<p><a id="Performanceproblems"></a></p>
<h3 id="performance-problems---welcome-to-io-hell">Performance problems - Welcome to I/O-hell</h3>
<p>When MRTG and rrdtool was created the preservation of disk space was more important than preservation of disk operations and the default collection interval was 5 minutes (which many are still using). The way rrdtool is designed it requires quite a few random reads and writes per datapoint. It also re-reads, computes the average, and writes old data again according to the RRA rules defined which causes additional I/O load. In 2014 memory is cheap, disk storage is cheap and CPU is fairly cheap. Disk I/O operations (IOPS) however are still very expensive in terms of hardware. The recent maturing of SSD provides extreme amounts of IOPS for a reasonable price, but the drive sizes are fractional. The result is that in order to scale IOPS-wise you currently need many low-space SSDs to get the required space, or many low-IOPS spindle drives to get the required IOPS:</p>
<p><a href="http://www.newegg.com/Product/Product.aspx?Item=N82E16820147251">Samsung EVO 840 1TB SSD</a> - 98.000 IOPS - 470 USD</p>
<p><a href="http://www.newegg.com/Product/Product.aspx?Item=N82E16822148844">Seagate Barracuda 3TB</a> - 240 IOPS - 110 USD</p>
<p>You would need $44.880 (408 drives) worth of spindle drives in order to match a single SSD drive in terms of I/O-performance. On the other hand a $2.000 array of spindle drives would get you a net ~54 TB of space. The cost of SSD to reach the same volume would be $25.380. Not to mention the cost of servers, power, provisioning, etc.</p>
<p><strong>Note: This is the cheapest available bulk consumer drives and comparable OEM drives (<a href="http://h30094.www3.hp.com/product/sku/10350615/mfg_partno/632494-B21">SSD</a>, <a href="http://h30094.www3.hp.com/product/sku/10389145/mfg_partno/628061-B21">spindle</a>) for a HP server will be 6 to 30 times more expensive.</strong></p>
<p>In rrdtool version 1.4, released in 2009, <strong>rrdcached</strong> was introduced as a caching daemon for buffering multiple data updates and reducing the number of random I/O operations by writing several related datapoints in sequence. It took a couple of years before this new feature was implemented in most of the common open source monitoring solutions.</p>
<p>For a good introduction into the internals of rrdtool/rrdcached updates and the problems with I/O scaling look at presentation by Sebastian Harl, <a href="http://www.netways.de/index.php?id=2815">How to Escape the I/O Hell</a></p>
<p><a id="Scaling"></a></p>
<h3 id="scaling-problems">Scaling problems</h3>
<p>Most of today’s monitoring systems do not easily scale-out. Scale-out, or scaling horizontally, is when you can add new nodes in response to increased load. Scaling up by replacing existing hardware with state-of-the-art hardware is both expensive and only buys you limited time before the next even more expensive necessary hardware upgrade. Many systems offer distributed polling but none offer the option of spreading out the disk load. For example; you can <a href="http://community.zenoss.org/docs/DOC-2485">scale Zenozz for High Availability</a> but not performance.</p>
<p><a id="Loss"></a></p>
<h3 id="loss-of-detail">Loss of detail</h3>
<p>Current RRD based systems will aggregate old data into averages in order to save storage space. Most technicians do not have the in depth knowledge in order to tune the rules for aggregation and will leave the default values as is. Using cacti as an example and looking at the <a href="http://docs.cacti.net/manual:088:8_rrdtool#rrd_files">cacti documentation</a> we see that in a very short time, 2 months, data is averaged to a single data point PER DAY. For systems such as Internet backbones where traffic vary a lot from bottom (30% utilization for example) to peak (90% utilization for example) during a day only the average of 60% is shown in the graphs. This in turn makes troubleshooting by comparing old data difficult. It makes trending based on peaks/bottoms impossible and it may also lead to wrong or delayed strategic decisions on where to invest in added capacity.</p>
<p><a id="flexibility"></a></p>
<h3 id="lack-of-flexibility">Lack of flexibility</h3>
<p>In order to collect, store and graph new kinds of metrics an operator would need a certain level of programming skills and experience with the internals of the monitoring system. Adding new metrics to the systems would range from hours to weeks depending on the skill and experience of the operator. Creating new graphs based on existing metrics is also very difficult on most systems. And not within reach for the average operator.</p>
<p><a id="revolution"></a></p>
<h2 id="the-monitoring-revolution">The monitoring revolution</h2>
<p>We are currently at the beginning of a monitoring revolution. The advent of cloud computing and big data has created a need for measuring lots of metrics for thousands of machines at small intervals. This has sparked the creation of completely new monitoring components. One of the components where we now have improved alternatives is for efficient metric storage.</p>
<p>The first is <strong><a href="http://opentsdb.net/">OpenTSDB</a></strong>, a “Scalable, Distributed, Time Series Database” that begun development at <a href="https://www.stumbleupon.com/">StumbleUpon</a> in 2011 and aimed at solving some of the problems with existing monitoring systems. OpenTSDB is built in top of Apache HBase which is a scalable and performant database that builds on top of Apache Hadoop. Hadoop is a series of tools for building large and scalable distributed systems. Back in 2010 Facebook already had <a href="http://hadoopblog.blogspot.no/2010/05/facebook-has-worlds-largest-hadoop.html">2000 machines in a Hadoop cluster</a> with 21PB (that is 21.000.000 GB) of combined storage.</p>
<p>The second is an interesting newcommer, <a href="http://influxdb.com/"><strong>InfluxDB</strong></a>, that began development in 2013 and has the goal of offering scalability and performance without the requirements of HBase/Hadoop.</p>
<p>In addition to advances in performance these alternatives also decouple storage of metrics and display of graphs and abstract the interaction in simple and well-defined APIs. This makes it easy for developers to create improved frontends rapidly and this has already resulted in several very attractive open-source frontends such as <strong><a href="https://github.com/Ticketmaster/Metrilyx-2.0">Metrilyx</a></strong> (OpenTSDB), <strong><a href="http://grafana.org/">Grafana</a></strong> (InfluxDB, Graphite, <a href="https://github.com/grafana/grafana/pull/211">soon OpenTSDB</a>), <strong><a href="http://www.statuswolf.com/">StatusWolf</a></strong> (OpenTSDB), <strong><a href="https://github.com/hakobera/influga">Influga</a></strong> (InfluxDB).</p>
<p><a id="Debian"></a></p>
<h2 id="setting-up-a-single-node-opentsdb-instance-on-debian-7-wheezy">Setting up a single node OpenTSDB instance on Debian 7 Wheezy</h2>
<p>In the rest of this paper we will set up a single node OpenTSDB instance. OpenTSDB builds on top of HBase and Hadoop and scales to very large setups easily. But it also delivers substantial performance on a single node which is deployed in <strong>less than an hour</strong>. There are plenty of guides on installing a Hadoop cluster but here we will focus on the natural first step of getting a single node running using <strong>recent releases</strong> of the relevant software:</p>
<ul>
<li>OpenTSDB 2.0.0 - Released 2014-05-05</li>
<li>HBase 0.98.2 - Released 2014-05-01</li>
<li>Hadoop 2.4.0 - Released 2014-04-07</li>
</ul>
<blockquote>
<p>If you later require to deploy a larger cluster consider using a framework such as <a href="http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html"><strong>Cloudera CDH</strong></a> or <a href="http://hortonworks.com/hdp/"><strong>Hortonworks HDP</strong></a> which are open-source platforms which package Apache Hadoop components and provides a fully tested environment and easy-to-use graphical frontends for configuration and management. It is <a href="http://opentsdb.net/setup-hbase.html">recommended to have at least 5 machines</a> in a HBase cluster supporting OpenTSDB.</p>
</blockquote>
<hr />
<blockquote>
<p>This guide assumes you are somewhat familiar with using a Linux shell/command prompt.</p>
</blockquote>
<p><a id="Hardware"></a></p>
<h4 id="hardware-requirements">Hardware requirements</h4>
<ul>
<li>CPU cores: Max (Limit to 50% of your available CPU resources)</li>
<li>RAM: Min 16 GB</li>
<li>Disk 1 - OS: 10 GB - Thin provisioned</li>
<li>Disk 2 - Data: 100 GB - Thin provisioned</li>
</ul>
<p><a id="Operating"></a></p>
<h4 id="operating-system-requirements">Operating system requirements</h4>
<p>This guide is based on a recently installed Debian 7 Wheezy <strong>64bit</strong> installed without any extra packages. See the <a href="https://www.debian.org/releases/stable/amd64/">official documentation</a> for more information.</p>
<p>All commands are entered as <strong>root</strong> user unless otherwise noted.</p>
<p><a id="preparations"></a></p>
<h4 id="pre-setup-preparations">Pre-setup preparations</h4>
<p>We start by installing a few tools that we will need later.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt-get install wget make gcc g++ cmake maven
</code></pre></div></div>
<p>Create a new ext3 partition on the data disk <strong>/dev/sdb</strong>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(echo "n"; echo "p"; echo ""; echo ""; echo ""; echo "t"; echo "83"; echo "w") | fdisk /dev/sdb
mkfs.ext3 /dev/sdb1
</code></pre></div></div>
<blockquote>
<p>ext3 is the <a href="https://wiki.apache.org/hadoop/DiskSetup">recommended filesystem for Hadoop</a>.</p>
</blockquote>
<p>Create a mountpoint <strong>/mnt/data1</strong> and add it to the file system table and mount the disk:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mkdir /mnt/data1
echo "/dev/sdb1 /mnt/data1 ext3 auto,noexec,noatime,nodiratime 0 1" | tee -a /etc/fstab
mount /mnt/data1
</code></pre></div></div>
<blockquote>
<p>Using <strong>noexec</strong> for the data partition will increase security as nothing on the data partition will be allowed to ever execute.
<br />
Using <strong>noatime</strong> and <strong>nodiratime</strong> increases performance since the read access timestamps are not updated on every file access.</p>
</blockquote>
<p><a id="java"></a></p>
<h4 id="installing-java-from-packages">Installing java from packages</h4>
<p>Installing java on Linux can be quite challenging due to licensing issues, but thanks to the guys over at <a href="https://launchpad.net/">Launchpad.net</a> who are providing a repository with a custom java package this can now be done quite easy.</p>
<p>We start by adding the launchpad java repository to our <strong><em>/etc/apt/sources.list</em></strong> file:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu precise main" | tee -a /etc/apt/sources.list
echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu precise main" | tee -a /etc/apt/sources.list
</code></pre></div></div>
<p>Add the signing key and download information from the new repository:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys EEA14886
apt-get update
</code></pre></div></div>
<p>Run the java installer:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt-get install oracle-java7-installer
</code></pre></div></div>
<p>Follow the instructions on screen to complete the Java 7 installation.</p>
<p><a id="HBase"></a></p>
<h3 id="installing-hbase">Installing HBase</h3>
<p>OpenTSDB has its own HBase installation tutorial <a href="http://opentsdb.net/setup-hbase.html">here</a>. It is very brief and does not use the latest versions or snappy compression.</p>
<p>Download and unpack HBase:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cd /opt
wget http://apache.vianett.no/hbase/hbase-0.98.2/hbase-0.98.2-hadoop2-bin.tar.gz
tar xvfz hbase-0.98.2-hadoop2-bin.tar.gz
export HBASEDIR=`pwd`/hbase-0.98.2-hadoop2/
</code></pre></div></div>
<p>Increase the system-wide limitations of open files and processes from the default of 1000 to 32000 by adding a few lines to <strong><em>/etc/security/limits.conf</em></strong>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo "root - nofile 32768" | tee -a /etc/security/limits.conf
echo "root soft/hard nproc 32000" | tee -a /etc/security/limits.conf
echo "* - nofile 32768" | tee -a /etc/security/limits.conf
echo "* soft/hard nproc 32000" | tee -a /etc/security/limits.conf
</code></pre></div></div>
<p>The settings above will only take effect if we also add a line to <strong><em>/etc/pam.d/common-session</em></strong>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo "session required pam_limits.so" | tee -a /etc/pam.d/common-session
</code></pre></div></div>
<p><a id="snappy"></a></p>
<h4 id="install-snappy">Install snappy</h4>
<p><a href="https://code.google.com/p/snappy/">Snappy</a> is a compression algorithm that values speed over compression ratio and this makes it a good choice for high throughput applications such as Hadoop/HBase. Due to licensing issues Snappy does not ship with HBase and need to be installed on top.</p>
<p>The installation process is a bit complicated and has caused headache for many people (me included). Here we will show a method of installing snappy and getting it to work with the latest version of HBase and Hadoop.</p>
<blockquote>
<p><strong>Compression algorithms in HBase</strong>
Compression is the method of reducing the size of a file or text without losing any of the contents. There are many compression algorithms available and some focus on being able to create the smallest compressed file at the cost of time and CPU usage while other achieve <em>reasonable</em> compression ratio while being very fast.
<br /> <br />
Out of the box HBase supports gz(gzip/zlib), snappy and lzo. Only gz is included due to licensing issues.
Unfortunately gz is a slow and costly algorithm compared to snappy and lzo. In a test performed by Yahoo (see <a href="http://www.slideshare.net/Hadoop_Summit/singh-kamat-june27425pmroom210c">slides here</a>, page 8) gz achieves 64% compression in 32 seconds. lzo 47% in 4.8 seconds and snappy 42% in 4.0 seconds. lz4 is another protocol <a href="http://search-hadoop.com/m/KFLWV1PFVhp1">considered for inclusion</a> that is even faster (2.4 seconds) but requires much more memory.
<br /> <br />
<em>For more information look at the <a href="https://hbase.apache.org/book/compression.html">Apache HBase Handbook - Appendix C - Compression</a></em></p>
</blockquote>
<p><a id="native"></a></p>
<h4 id="building-native-libhadoop-and-libsnappy">Building native libhadoop and libsnappy</h4>
<p>In order to use compression we need the common Hadoop library, libhadoop.so, and the snappy library, libsnappy.so. HBase ships without libhadoop.so and the libhadoop.so that ships in the Hadoop Package is only for 32 bit OS. So we need to compile these files ourself.</p>
<p>Start by downloading and installing ProtoBuf. Hadoop requres version 2.5+ which is not available as a Debian package unfortunately.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget --no-check-certificate https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz
tar zxvf protobuf-2.5.0.tar.gz
cd protobuf-2.5.0
./configure; make; make install
export LD_LIBRARY_PATH=/usr/local/lib/
</code></pre></div></div>
<p>Download and compile Hadoop:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt-get install zlib1g-dev
wget http://apache.uib.no/hadoop/common/hadoop-2.4.0/hadoop-2.4.0-src.tar.gz
tar zxvf hadoop-2.4.0-src.tar.gz
cd hadoop-2.4.0-src/hadoop-common-project/
mvn package -Pdist,native -Dskiptests -Dtar -Drequire.snappy -DskipTests
</code></pre></div></div>
<p>Copy the newly compiled native libhadoop library into /usr/local/lib, then create the folder in which HBase looks for it and create a shortcut from there to /usr/local/lib/libhadoop.so:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cp hadoop-common/target/native/target/usr/local/lib/libhadoop.* /usr/local/lib
mkdir -p $HBASEDIR/lib/native/Linux-amd64-64/
cd $HBASEDIR/lib/native/Linux-amd64-64/
ln -s /usr/local/lib/libhadoop.so* .
</code></pre></div></div>
<p>Install snappy from Debian packages:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt-get install libsnappy-dev
</code></pre></div></div>
<p><a id="ConfiguringHBase"></a></p>
<h4 id="configuring-hbase">Configuring HBase</h4>
<p>Now we need to do some basic configuration before we can start HBase. The configuration files are in $HBASEDIR/conf/.</p>
<p><a id="hbase-env.sh"></a></p>
<h4 id="confhbase-envsh"><strong>conf/hbase-env.sh</strong></h4>
<p>A shell script setting various environment variables related to how HBase and Java should behave. The file contains a lot of options and they are all documented by comments so feel free to look around in it.</p>
<p>Start by setting the JAVA_HOME, which points to where Java is installed:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>export JAVA_HOME=/usr/lib/jvm/java-7-oracle/
</code></pre></div></div>
<p>Then increase the size of the <a href="http://pubs.vmware.com/vfabric52/index.jsp?topic=/com.vmware.vfabric.em4j.1.2/em4j/conf-heap-management.html">Java Heap</a> from the default of 1000 which is a bit low:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>export HBASE_HEAPSIZE=8000
</code></pre></div></div>
<p><a id="Background"></a></p>
<h4 id="confhbase-sitexml"><strong>conf/hbase-site.xml</strong></h4>
<p>An XML file containing HBase specific configuration parameters.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><configuration>
<property>
<name>hbase.rootdir</name>
<value>/mnt/data1/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/mnt/data1/zookeeper</value>
</property>
</configuration>
</code></pre></div></div>
<p><a id="compression"></a></p>
<h4 id="testing-hbase-and-compression">Testing HBase and compression</h4>
<p>Now that we have installed snappy and configured HBase we can verify that HBase is working and that the compression is loaded by doing:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$HBASEDIR/bin/hbase org.apache.hadoop.hbase.util.CompressionTest /tmp/test.txt snappy
</code></pre></div></div>
<p>This should output some lines with information and end with <strong>SUCCESS</strong>.</p>
<p><a id="StartingHBase"></a></p>
<h4 id="starting-hbase">Starting HBase</h4>
<p>HBase ships with scripts for starting and stopping it, namely start-hbase.sh and stop-hbase.sh. You start HBase with</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$HBASEDIR/bin/start-hbase.sh
</code></pre></div></div>
<p>Then look at the log to ensure it has started without any serious errors:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tail -fn100 $HBASEDIR/bin/../logs/hbase-root-master-opentsdb.log
</code></pre></div></div>
<p>If you want HBase to start automatically on boot you can use a process management tool such as <a href="http://mmonit.com/monit/">Monit</a> or simply put it in <strong><em>/etc/rc.local</em></strong>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/opt/hbase-0.98.2-hadoop2/bin/start-hbase.sh
</code></pre></div></div>
<p><a id="InstallingOpenTSDB"></a></p>
<h3 id="installing-opentsdb">Installing OpenTSDB</h3>
<p>Start by installing gnuplot, which is used by the native webui to draw graphs:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt-get install gnuplot
</code></pre></div></div>
<p>Then download and install OpenTSDB:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget https://github.com/OpenTSDB/opentsdb/releases/download/v2.0.0/opentsdb-2.0.0_all.deb
dpkg -i opentsdb-2.0.0_all.deb
</code></pre></div></div>
<p><a id="ConfiguringOpenTSDB"></a></p>
<h4 id="configuring-opentsdb">Configuring OpenTSDB</h4>
<p>The configuration file is <strong><em>/etc/opentsdb/opentsdb.conf</em></strong>. It has some of the basic configuration parameters but not nearly all of them. <a href="http://opentsdb.net/docs/build/html/user_guide/configuration.html">Here is the official documentation with all configuration parameters</a>.</p>
<p>The defaults are reasonable but we need to make a few tweaks, the first is to add this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tsd.core.auto_create_metrics = true
</code></pre></div></div>
<p>This will make OpenTSDB accept previously unseen metrics and add them to the database. This is very useful in the beginning when feeding data into OpenTSDB. Without this you will have to use the command <strong><em>mkmetric</em></strong> for each metric you will store and get errors that might be hard to trace if the metric you create do not match what is actually sent.</p>
<p>Then we will add support for chunked requests via the HTTP API:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tsd.http.request.enable_chunked = true
tsd.http.request.max_chunk = 16000
</code></pre></div></div>
<p>Some tools and plugins (such as our own <a href="https://github.com/PeritusConsulting/collectd-opentsdb">improved collectd to OpenTSDB plugin</a>) send multiple data points in a single HTTP request for increased efficiency and requires this setting to be enabled.</p>
<p><a id="HBasetables"></a></p>
<h4 id="creating-hbase-tables">Creating HBase tables</h4>
<p>Before we start OpenTSDB we need to create the necessary tables in HBase:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>env COMPRESSION=SNAPPY HBASE_HOME=$HBASEDIR /usr/share/opentsdb/tools/create_table.sh
</code></pre></div></div>
<p><a id="StartingOpenTSDB"></a></p>
<h4 id="starting-opentsdb">Starting OpenTSDB</h4>
<p>Since version 2.0.0 OpenTSDB ships as a Debian package and includes SysV init scripts. To start OpenTSDB as a daemon running in the background we run:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>service opentsdb start
</code></pre></div></div>
<p>And then check the logs for any errors or other relevant information:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tail -f /var/log/opentsdb/opentsdb.log
</code></pre></div></div>
<p>If the server is started successfully the last line of the log should say:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>13:42:30.900 INFO [TSDMain.main] - Ready to serve on /0.0.0.0:4242
</code></pre></div></div>
<p>And you can now browse to your new OpenTSDB in a browser using http://hostname:4242 !</p>
<p><a id="Feeding"></a></p>
<h3 id="feeding-data-into-opentsdb">Feeding data into OpenTSDB</h3>
<p>It is not within the scope of this paper to go into details about how to feed data into OpenTSDB but we will give a quick introduction here to get you started.</p>
<blockquote>
<p><strong>A note on metric naming in OpenTSDB</strong>
<br /> <br />
Each datapoint has a metric name such as <strong><em>df.bytes.free</em></strong> and one or more tags such as <strong><em>host=server1</em></strong> and <strong><em>mount=/mnt/data1</em></strong>. This is closer to the proposed <a href="http://metrics20.org/">Metrics 2.0</a> standard for naming metrics than the traditional naming of <strong><em>df.bytes.free.server1.mnt-data</em></strong>. This makes it possible to create aggregates across tags and combine data easily using the tags.
<br /> <br />
OpenTSDB stores each datapoint with a given metric and tags in one HBase row per hour. But due to a HBase issue it still has to scan every row that matches the metric, ignoring the tags. Even though it will only return the data also matching the tags. This results in very much data being read and it will be very slow to read if there is a large number of data points for a given metric. The default for the collectd-opentsdb plugin is to use the read plugin name as metric, and other values as tags. In my case this results in 72.000.000 datapoints per hour for this metric. When generating a graph all of this data has to be read and evaluated before drawing a graph. 24 hours of data is over 1.7 billion datapoints for this single metric and results in a read performance of 5-15 <strong>minutes</strong> for a simple graph.
<br /> <br />
A solution to this is to use <em>shift-to-metric</em>, as <a href="http://opentsdb.net/docs/build/html/user_guide/writing.html">mentioned in the OpenTSDB user guide</a>. Shift-to-metric is simply moving one or more data identifiers from tags to the metric in order to reduce the cardinality (number of values) for a metric, and hence the time required to read out the data we want. We have modified the collectd-opentsdb java plugin in order to shift the tags to metrics, and this increases read-performance by ~1000x down to 10-100ms. Read the section about collectd below for more information on our modified plugin.</p>
</blockquote>
<p><a id="tcollector"></a></p>
<h4 id="tcollector">tcollector</h4>
<p><a href="http://opentsdb.net/docs/build/html/user_guide/utilities/tcollector.html">tcollector</a> is the default agent for collecting and sending data from a Linux server to a OpenTSDB server. It is based on Python and plugins / addons can be written in any language. It ships with the most common plugins to collect information about disk usage and performance, cpu and memory statistics and also for some specific systems such as mysql, mongodb, riak, varnish, postgresql and others. tcollector is very lightweight and features advanced de-duplication in order to reduce unneeded network traffic.</p>
<p>The commands for installing dependencies and downloading tcollector are</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aptitude install git python
cd /opt
git clone git://github.com/OpenTSDB/tcollector.git
</code></pre></div></div>
<p>Configuration is in the startup script <strong><em>tcollector/startstop</em></strong>, you will need to uncomment and set the value of TSD_HOST to point to your OpenTSDB server.</p>
<p>To start it run</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/opt/tcollector/startstop start
</code></pre></div></div>
<p>This is also the command you want to add to <strong><em>/etc/rc.local</em></strong> in order to have the agent automatically start at boot. Logfiles are saved in <strong><em>/var/log/tcollector.log</em></strong> and they are rotated automatically.</p>
<p><a id="peritus-tc-tools"></a></p>
<h4 id="peritus-tc-tools">peritus-tc-tools</h4>
<p>We have developed a set of <strong>tcollector</strong> plugins for collecting statistics from</p>
<ul>
<li><strong><a href="https://www.isc.org/downloads/dhcp/">ISC DHCPd server</a></strong>, about number of DHCP events and DHCP pool sizes</li>
<li><strong><a href="http://www.opensips.org/">OpenSIPS</a></strong>, total number of subscribers and registered user agents</li>
<li><strong><a href="http://atmail.com/">Atmail</a></strong>, number of users, admins, sent and received emails, logins and errors</li>
</ul>
<p>As well as a high performance replacement for <strong><a href="http://oss.oetiker.ch/smokeping/">smokeping</a></strong> called <strong>tc-ping</strong>.</p>
<p>These plugins are available for download from our <strong><a href="https://github.com/PeritusConsulting/peritus-tc-tools">GitHub page</a></strong>.</p>
<p><a id="collectd-opentsdb"></a></p>
<h4 id="collectd-opentsdb">collectd-opentsdb</h4>
<p><a href="http://collectd.org/">collectd</a> is the <em>system statistics collection daemon</em> and is a widely used system for collecting metrics from various sources. There are several options for sending data from collectd to OpenTSDB but one way that works well is to use the <a href="https://github.com/auxesis/collectd-opentsdb">collectd-opentsdb java write plugin</a>.</p>
<p>Since collectd is a generic metric collection tool the original collectd-opentsdb plugin will use the plugin name (such as <strong>snmp</strong>) as the metric, and use tags such as <strong>host=servername</strong>, <strong>plugin_instance=ifHcInOctets</strong> and <strong>type_instance=FastEthernet0/1</strong>.</p>
<p>As mentioned in the <strong><em>note on metric naming in OpenTSDB</em></strong> this can be very inefficient when data needs to be read again resulting in read performance potentially thousands of times slower than optimal (<100ms). To alleviate this we have modified the original collectd-opentsdb plugin to store all metadata as part of the metric. This gives metric names such as ifHCInBroadcastPkts.sw01.GigabitEthernet0 and very good read performance.</p>
<p>The modified collectd-opentsdb plugin can be downloaded from our <a href="https://github.com/PeritusConsulting/collectd-opentsdb">GitHub repository</a>.</p>
<p><a id="MonitoringOpenTSDB"></a></p>
<h4 id="monitoring-opentsdb">Monitoring OpenTSDB</h4>
<p>To monitor OpenTSDB itself install tcollector as described above on the OpenTSDB server and set <strong><em>TSD_HOST</em></strong> to <strong><em>localhost</em></strong> in <strong><em>/opt/tcollector/startstop</em></strong>.</p>
<p>You can then go to http://opentsdb-server:4242/#start=1h-ago&end=1s-ago&m=sum:rate:tsd.rpc.received%7Btype=*%7D&o=&yrange=%5B0:%5D&wxh=1200x600 to view a graph of amount of data received in the last hour.</p>
<p><a id="Performancecomparison"></a></p>
<h3 id="performance-comparison">Performance comparison</h3>
<p>Lastly we include a little performance comparison between the latest version of OpenTSDB+HBase+Hadoop, a previous version of OpenTSDB+HBase+Hadoop that we have used for a while as well as rrdcached which ran in production for 4 years at a client.</p>
<p>The workload is gathering and storing metrics from 150 Cisco switches with 8200 ports/interfaces every 5 seconds. This equals about 15.000 points per second.</p>
<div style="text-align:center;"> <img src="/images/2014-06-02-next-generation-monitoring-using-opentsdb-images/Figure1.png" /> </div>
<p><em>Figure 1 - Data received by OpenTSDB per second</em></p>
<p><a id="Collection"></a></p>
<h4 id="collection">Collection</h4>
<p>Even though it is not the primary focus, we include some data about collection performance for completeness. Collection is done using the latest version of <a href="http://collectd.org/">collectd</a> and the builtin SNMP plugin.</p>
<blockquote>
<p><strong>NB #1:</strong> There is a <a href="https://github.com/collectd/collectd/issues/610">memory leak</a> in the way collectd’s SNMP plugin uses the underlying libsnmp library and you might need to schedule a restart of the collectd service as a workaround for that if handling large workloads.</p>
</blockquote>
<blockquote>
<p><strong>NB #2:</strong> Due to <a href="http://comments.gmane.org/gmane.comp.monitoring.collectd/5061">limitations in the libnetsnmp library</a> you will run into problems if polling many (1000+) devices with a single collectd instance. A workaround is to run multiple collectd instances with fewer hosts.</p>
</blockquote>
<p>Figure 2 shows that collection through SNMP polling consumes about 2200Mhz. We optimized some of the data types and definitions in collectd when moving to OpenTSDB and achieved a 20% performance increase in the polling as seen in Figure 3.</p>
<div style="text-align:center;"> <img src="/images/2014-06-02-next-generation-monitoring-using-opentsdb-images/Figure2.png" /> </div>
<p><em>Figure 2 - CPU Usage - SNMP polling and writing to RRDcached</em></p>
<div style="text-align:center;"> <img src="/images/2014-06-02-next-generation-monitoring-using-opentsdb-images/Figure3.png" /> </div>
<p><em>Figure 3 - CPU Usage - SNMP polling and sending to OpenTSDB</em></p>
<p>Writing to the native rrdcached write plugin consumes 1300Mhz while our modified collectd-opentsdb plugin consumes 1450Mhz. It is probably possible to create a much more efficient write plugin with more advanced knowledge of concurrency and using a lower level language such as C.</p>
<p><a id="Storage"></a></p>
<h4 id="storage">Storage</h4>
<p>When considering storage performance we will look at CPU usage and disk IOPS since these are the primary drivers of cost in today’s datacenters.</p>
<h4 id="collectd--rrdcached">collectd + rrdcached</h4>
<p>CPU usage - 1300Mhz, see Figure 2 above.
<br /></p>
<div style="text-align:center;"> <img src="/images/2014-06-02-next-generation-monitoring-using-opentsdb-images/Figure4.png" /> </div>
<p><em>Figure 4 - Disk write IOPS - Fluctuating between 10 and 170 IOPS during the 1 hour flush period.</em></p>
<h4 id="opentsdb--hbase-096--hadoop-1">OpenTSDB + Hbase 0.96 + Hadoop 1</h4>
<div style="text-align:center;"> <img src="/images/2014-06-02-next-generation-monitoring-using-opentsdb-images/Figure5.png" /> </div>
<p><em>Figure 5 - CPU usage - 1700Mhz baseline with peaks of 7000Mhz during <a href="http://www.oracle.com/webfolder/technetwork/tutorials/obe/java/gc01/index.html">Java Garbage Collection (GC)</a> (untuned).</em>
<br /></p>
<div style="text-align:center;"> <img src="/images/2014-06-02-next-generation-monitoring-using-opentsdb-images/Figure6.png" /> </div>
<p><em>Figure 6 - Disk write IOPS - 5 IOPS average with peaks of 25 IOPS during Java GC. We also see that disk read IOPS are much higher and this is due to regular compaction of the database and can be tuned. Reads in general can be reduced by increasing caching with more RAM if necessary.</em></p>
<h4 id="opentsdb--hbase-098--hadoop-2">OpenTSDB + HBase 0.98 + Hadoop 2</h4>
<div style="text-align:center;"> <img src="/images/2014-06-02-next-generation-monitoring-using-opentsdb-images/Figure7.png" /> </div>
<p><em>Figure 7 - CPU usage - 1200Mhz baseline with peaks of 5000-6000Mhz during Java GC (untuned).</em>
<br /></p>
<div style="text-align:center;"> <img src="/images/2014-06-02-next-generation-monitoring-using-opentsdb-images/Figure8.png" /> </div>
<p><em>Figure 8 - Disk write IOPS - < 5 IOPS average with peaks of 25 IOPS during Java GC. Much less read IOPS during compaction compared to HBase 0.96.</em></p>
<p><a id="Conclusion"></a></p>
<h4 id="conclusion">Conclusion</h4>
<p>Even without tuning, a single instance OpenTSDB installation is able to handle significant amounts of data before running into IO problems. This comes at a cost of CPU, currently OpenTSDB will consume > 300% the amount of CPU cycles compared to rrdcached for storage. But this is offset by a 85-95% reduction in disk load. In absolute terms for our particular set up (one 2 year old HP DL360p Gen8 running VMware vSphere 5.5) CPU usage increased from 15% to 25% while reducing IOPS load from 70% to < 10%.</p>
<p><br />
<br />
<em>Fine tuning of parameters (such as Java GC) as well as detailed analysis of memory usage is outside the scope of this brief paper and detailed information may be found elsewhere (<a href="https://hbase.apache.org/book/performance.html">51</a>,<a href="http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html">52</a>,<a href="http://www.cubrid.org/blog/textyle/428187">53</a>) for those interested.</em>
<br />
<br /></p>
<hr />
<blockquote>
<p><strong>Stian Ovrevage</strong></p>
<table style="border: 0"><tr><td width="100px"> <img src="/images/2014-06-02-next-generation-monitoring-using-opentsdb-images/stianovrevage.jpg" /> </td>
<td>Stian is a senior consultant and founder at Peritus Consulting AS. He is currently managing the technical systems for a small FTTH ISP in Norway. He also does consulting for other clients when time permits. When not digging deep into technical challenges he enjoys the outdoors.<br /><br />Also on [GitHub][59], [LinkedIn][58], [Facebook][56], [Google+][57] and [Twitter][55].</td></tr></table>
</blockquote>
<hr />
<blockquote>
<p><strong>Peritus Consulting Technical Reports</strong></p>
<table style="border: 0"><tr><td width="100px"> <img src="/images/2014-06-02-next-generation-monitoring-using-opentsdb-images/PeritusConsulting_small.png" /> </td>
<td> Technical reports are in-depth articles aimed at giving actionable advice on new technologies as well as recommended best practices based on tried and true solutions. We cover areas that are lacking of good in depth coverage online but will not re-write topics that are already covered in a satisfactory way elsewhere.
We also write tech notes which are shorter pieces with thoughts and tips on both technology and the way technology should be used optimally.
Our official webpage (in Norwegian) is at [www.peritusconsulting.no][60], articles are published on our [GitHub Page][61], we are also on [Twitter][62], [LinkedIn][64] and [Facebook][63].
</td></tr></table>
</blockquote>
<hr />Stian ØvrevågeIn this paper we will provide a step by step guide on how to install a single-instance of OpenTSDB using the latest versions of the underlying technology, Hadoop and HBase. We will also provide some background on the state of existing monitoring solutions.