Stian Øvrevåge

A side quest in API development, observability, Kubernetes and cloud with a hint of database

2021-03-06T00:00:00+00:00

Quite often people ask me what I actually do. I have a hard time giving a short answer. Even to colleagues and friends in the industry.

Here I will try to show and tell how I spent an evening digging around in a system I helped build for a client.

Table of contents

Background
The (initial) problem
Side quest: Database optimizations
Determining the next bottleneck
Side quest: Cluster resources and burstable VMs
Conclusion

Background

I’m a consultant doing development, DevOps and cloud infrastructure.

For this specific client I mainly develop APIs using Golang to support new products and features as well as various exporting, importing and processing of data in the background.

I’m also the “ops” guy handling everything in AWS, setting up and maintaing databases, making sure the “DevOps” works and the frontend and analytics people can do their work with little friction. 99% of the time things work just fine. No data is lost. The systems very rarely have unforeseen downtime and the users can access the data they want with acceptable latency rarely exceeding 500ms.

A couple of times a year I assess the status of the architecture and set up new environments from scratch and update any documentation that has drifted. This is also a good time to do changes and add or remove constraints in anticipation of future business needs.

In short, the current tech stack that has evolved over a couple of years is:

Everything hosted on Amazon Web Services (AWS).
AWS managed Elastic Kubernetes Service (EKS) currently on K8s 1.18.
GitHub Actions for building Docker images for frontends, backends and other systems.
AWS Elastic Container Registry for storing Docker images.
Deployment of each system defined as a Helm chart alongside source code.
Actual environment configuration (Helm values) stored in repo along source code. Updated by GitHub Actions.
ArgoCD in cluster to manage status of all environments and deployments. Development environments usually automatically deployed on change. Push a button to deploy to Production.
Prometheus for storing metrics from the cluster and nodes itself as well as custom metrics for our own systems.
Loki for storing logs. Makes it easier to retrieve logs from past Pods and aggregate across multiple Pods.
Elastic APM server for tracing.
Pyroscope for live CPU profiling/tracing of Go applications.
Betteruptime.com for tracking uptime and hosting status pages.

I might write up a longer post about the details if anyone is interested.

The (initial) problem

A week ago I upgraded our API from version 1, that was deployed in January, to version 2 with new features and better architecture.

One of the endpoints of the API returns an analysis of an object we track. I have previously reduced the amount of database queries by 90% but it still requires about 50 database calls from three different databases. Getting and analyzing the data usually completes in about 3-400 milliseconds returning an 11.000 line JSON.

It’s also possible to just call /objects/analysis to get the analysis for all the 500 objects we are tracking. It takes 20 seconds but is meant for exports to other processes and not interactive use, so not a problem.

Since the product is under very active development the frontend guys just download the whole analysis for an object to show certain relevant information to users. It’s too early to decide on which information is needed more often and how to optimize for that. Not a problem.

So we need an overview of some fields from multiple objects in a dashboard / list. We can easily pull analysis from 20 objects without any noticable delay.

But what if we just want to show more, 50? 200? 500? The frontend already have the IDs for all the objects and fetches them from /objects/id/analysis. So they loop over the IDs and fire of requests simultaneously.

Analyzing the network waterfall in Chrome DevTools indicated that the requests now took 20-30 seconds to complete! But looking closer most of the time they were actually queued up in the browser. This is because Chrome only allows 6 concurrent TCP connection to the same origin when using HTTP1 (https://developers.google.com/web/tools/chrome-devtools/network/understanding-resource-timing).

Fixing the (initial) problem

HTTP2 should fix this problem easily. By default HTTP2 is disabled in nginx-ingress. I add a couple of lines enabling it and update the Helm deployment of the ingress controller.

Verifying the (initial) fix

Some common development tools doesn’t support HTTP2, such as Postman. So I found h2load which can both help me verify HTTP2 is working and I also get to measure the improvement, nice!

Note that I’m not using the analysis endpoint since I want to measure the change from HTTP1 to HTTP2 and it will become apparent later that there are other bottlenecks preventing us from a linear performance increase when just changing from HTTP1 to HTTP2.

Also note that this is somewhat naive since it requests the same URL over and over which can give false results due to any caching. But fortunately we don’t do any caching yet.

Baseline simple request - HTTP1 1 connections, 20000 requests

Using 1 concurrent streams, 1 client and HTTP1 I get an estimate of performance pre-http2:

h2load --h1 --requests=20000 --clients=1 --max-concurrent-streams=1 https://api.x.com/api/v1/objects/1

The results are as expected:

finished in 1138.99s, 17.56 req/s, 18.41KB/s
requests: 20000 total, 20000 started, 20000 done, 19995 succeeded, 5 failed, 0 errored, 0 timeout

_Overview from Elastic APM. Duration is very acceptable at around 20ms. No errors. And about 25% of the time spent doing database queries._ _Container CPU usage. Nothing special._ _Database query latency. The vast majority under 5ms. Acceptable._ _Number of DB queries per second._ _HTTP response latency._ _Number of HTTP requests per second. Unsurprisingly the number of database queries are identical to the number of HTTP requests. Latency of HTTP requests also tracks the latency of the (single) database query._

For http2 we set max concurrent streams to the same as number of requests:

h2load --requests=200 --clients=1 --max-concurrent-streams=200 https://api.x.com/api/v1/objects/1

Which results in almost half the latency:

finished in 1.23s, 162.65 req/s, 158.06KB/s
requests: 200 total, 200 started, 200 done, 200 succeeded, 0 failed, 0 errored, 0 timeout

So HTTP2 is working and providing significant latency improvements. Success!

Baseline complex request - HTTP1 1 connections, 20000 requests

We start by establishing a baseline with 1 connection querying over and over.

h2load --h1 --requests=20000 --clients=1 --max-concurrent-streams=1

Latency increases as much more computation is done and data is returned. But the latency is consistent which is good. We also see that the database is becomming the bottleneck for where most time is spent.

CPU usage increased to 15%. Lower increase than expected considering the complexity involved in serving the requests.

Database query latency still mostly under 5ms.

Number of database queries increases by a factor of 10 compared to HTTP requests.

HTTP latency.

HTTP requests per second.

Verifying the fix for assumed workload

So we verified that HTTP2 gives us a performance boost. But what happens when we fire away 500 requests to the much heavier /analysis endpoint?

These graphs are not as pretty since the ones above. This is mainly due to the sampling interval of the metrics and that we need several datapoints to accurately determine the rate() of a counter.

Complex request - HTTP1 6 connections, 500 requests

finished in 32.25s, 14.88 req/s, 2.29MB/s
requests: 500 total, 500 started, 500 done, 500 succeeded, 0 failed, 0 errored, 0 timeout

In summary it so far seems to scale linearly with load. Most of the time is spent fetching data from the database. Still very predictable low latency on database queries and the resulting HTTP response.

Complex request - HTTP2 500 “connections”, 500 requests

So now we unleash the beast. Firing all 500 requests at the same time.

finished in 16.66s, 30.02 req/s, 3.55MB/s
requests: 500 total, 500 started, 500 done, 500 succeeded, 0 failed, 0 errored, 0 timeout

CPU on API still doing good. A slight hint of CPU throttling due to CFS, which is used when you set CPU limits in Kubernetes.

Important about Kubernetes and CPU limits
Even with CPU limits set to 1 (100% of one CPU), your container can still be throttled at much lower CPU usage. Check out this article for more information.

Whopsie. The average database query latency has increased drastically, and we have a long tail of very slow queries. Looks like we are starting to see signs of bottlenecks on the database. This might also be affected by our maximum of 60 concurrent connections to the database, resulting in queries having to wait their turn before executing.

It’s hard to judge the peak rate of database queries due to limited sampling of the metrics.

Now individual HTTP requests are much slower due to waiting for the database.

Here is just a random trace from Elastic APM to see if the increased database latency is concentrated to specific queries or tables or just general saturation. Indeed there is a single query responsible for half the time taken for the entire query! We better get back to that in a bit and dig further.

In an ideal world all 500 requests should start and complete in 2-300ms regardless. Since that is not happening it’s an indication that we are now hitting some other bottleneck.

Looking at the graphs it seems we are starting to saturate the database. The latency for every request is now largely dependent on the slowest of the 10-12 database queries it depends on. And as we are stressing the database the probability of slow queries increase. The latency for the whole process of fetching 500 requests are again largely dependent on the slowest requests.

So this optimization gives on average better performance, but more variability of the individual requests, when the system is under heavy load.

Side quest: Database optimizations

It seems we are saturating the database. Before throwing more money at the problem (by increasing database size) I like to know what the bottlenecks are. Looking at the traces from APM I see one query that is consistently taking 10x longer than the rest. I also confirm this in the AWS RDS Performance Insights that show the top SQL queries by load.

When designing the database schema I came up with the idea of having immutability for certain data types. So instead of overwriting row with ID 1, we add a row with ID 1 Revision 2. Now we have the history of who did what to the data and can easily track changes and roll back if needed. The most common use case is just fetching the last revision. So for simplicity I created a PostgreSQL view that only shows the last revision. That way clients don’t have to worry about the existense of revisions at all. That is now just an implementation detail.

When it comes to performance that turns out to be an important implementation detail. The view is using SELECT DISTINCT ON (id) ... ORDER BY id, revision DESC. However many of the queries to the view is ordering the returned data by time, and expect the data returned from database to already be ordered chronologically. Using EXPLAIN ANALYZE on the queries this always results in a full table scan instead of using indexes, and is what’s causing this specific query to be slow. Without going into details it seems there is no simple and efficient way of having a view with the last revision and query that for a subset of rows ordered again by time.

For the forseable future this does not actually impact real world usage. It’s only apparent under artificially large loads under the worst conditions. But now we know where we need to refactor things if performance actually becomes a problem.

Determining the next bottleneck

Whenever I fix one problem I like to know where, how and when the next problem or limit is likely to appear. When increasing the number of requests and streams I expected to see increasing latency. But instead I see errors appear like a cliff:

finished in 27.33s, 36.59 req/s, 5.64MB/s
requests: 5000 total, 1002 started, 1002 done, 998 succeeded, 4002 failed, 4000 errored, 0 timeout

Consulting the logs for both the nginx load balancer and the API there are no records of failing requests. Since nginx does not pass the HTTP2 connection directly to the API, but instead “unbundles” them into HTTP1 requests I suspect there might be issues with connection limits or even available ports from nginx to the API. But maybe it’s a configuration issue. By default nginx does not limit the number of connections to a backend (our API). . But, there is actually a default limit to the number of HTTP2 requests that can be served over a single connection - And it happens to be 1000.

I leave it at that. It’s very unlikely we’ll be hitting these limits any time soon.

Side quest: Cluster resources and burstable VMs

When load testing the first time around sometimes Grafana would also become unresponsive. That’s usually a bad sign. It might indicate that the underlying infrastructure is also reaching saturation. That is not good since it can impact what should be independent services.

Our Kubernetes cluster is composed of 2x t3a.medium on demand nodes and 2x t3a.medium spot nodes. These VM types are burstable. You can use 20% per vCPU sustained over time without problems. If you exceed those 20% you start consuming CPU credits faster than they are granted and once you run out of CPU credits processes will be forcibly throttled.

Of course Kubernetes does not know about this and expects 1 CPU to actually be 1 CPU. In addition Kubernetes will decide where to place workloads based on their stated resource requirements and limits, and not their actual resource usage.

When looking at the actual metrics two of our nodes are indeed out of CPU credits and being throttled. The sum of factors leading to this is:

We have not yet set resource requests and limits making it harder for Kubernetes to intelligently place workloads
Using burstable nodes having some additional constraints not visible to Kubernetes
Old deployments laying around consuming unnecessary resources
Adding costly features without assessing the overall impact

I have not touched on the last point yet. I started adding Pyroscope to our systems since I simply love monitoring All The Things. The documentation does not go into specifics but emphasizes that it’s “low overhead”. Remember that our budget for CPU usage is actually 40% per node, not 200%. The Pyroscope server itself consumes 10-15% CPU which seems fair. But investigating further the Pyroscope agent also consumes 5-6% CPU per instance. This graph shows the CPU usage of a single Pod before and after turning of Pyroscope profiling.

5-6% CPU overhead on a highly utilized service is probably worth it. But when the baseline CPU usage is 0% CPU and we have multiple services and deployments in different environments we are suddenly using 40-60% CPU on profiling and less than 1% on actual work!

The outcome of this is that we need to separate burstable and stable load deployments. Monitoring and supporting systems are usually more stable resource wise while the actual business systems much more variable, and suitable for burst nodes. In practice we add a node pool of non-burst VMs and use NodeAffinity to stick Prometheus, Pyroscope etc to those nodes. Another benefit of this is that the supporting systems needed to troubleshoot problems are now less likely to be impacted by the problem itself, making troubleshooting much easier.

Conclusion

This whole adventure only took a few hours but resulted in some specific and immediate performance gains. It also highlighted the weakest links in our application, database and infrastructure architecture.

End of 2020 rough database landscape

2020-11-27T00:00:00+00:00

There seems to exist a database for every niche, mood or emotion. And they seem to change just as fast.

How do you balance the urge for the new and shiny but without risking too much headache down the road?

This post is an attempt to lay out the rough landscape of databases that you might encounter or consider as of late 2020.

There will be broad generalizations for brevity.

The goal is not to be exhaustive or take all possible precautions. Consider it a starting point for further research and planning.

TLDR: Scroll to the diagrams or view the big picture.

Table of contents

Background
- Project phase overview
Planning
- Database categories
  - SQL
  - NoSQL
  - KeyValue
  - Timeseries
  - Graph
  - Other nice things
The Landscape
- SQL
- NoSQL
- KeyValue
- Timeseries
- Graph
Further reading
Conclusion

Background

I’m a consultant doing development, DevOps and cloud infrastructure. I also have the occasional side project trying out the Tech Flavor of the Month.

Project phase overview

The typical phases in projects I’m involved in follow no scientific or trademarked methodology, so YMMV:

Starting out

Get something working as fast as possible. Take all the shortcuts. Use some opinionated framework or platform.

Moving from development to production

People like it, people use it. Move the thing from a single “pet server” to a more robust cloud environment.

Scaling production

Bottlenecks and scaling problems start to emerge. Refactor or replace some pieces to remove the bottlenecks.

Challenges

Moving between these phases might be a major PITA if the wrong shortcuts were taken in the previous phases.

This of course applies to all technology choices and not just databases. But we have to start somewhere, right?

Planning

When starting out I try to envision all the phases of the project and which directions it may take in the future.

First I want the technology or software I choose to be instantly usable. A Docker image. Great. An apt-get install. Sweet. npm install. Sure, why not. Downloading a tarball. Installing some C dependencies. Setting some flags. Compiling. Symlinking and fixing permissions. Creating some configuration from scratch. Making my own systemd service definitions. Going back and doing every step again because it failed. Mkay, no thanks, I’m out.

At least for me it’s a plus if it’s easy to deploy on Kubernetes since I use it for everything already. I always have a cluster or three laying around so I can get a prototype or five up and running quickly before later spending money for cloud hosting.

Does the thing have momentum and a community? If it does it probably has high quality tooling either by the vendor or the open source community (preferably both). It probably also has lots of common questions answered on blogs and StackOverflow and Github issues.

So we managed to build something and the audience likes it.

How easy is it to move it from a production environment into something stable and low-maintenance? For databases that would typically involve using a managed service for hosting it. You do not want to be responsible for operating your own databases. Is it common enough that there are competitors in the marketplace offering it as a managed service? If there is only a single option expect prices to be very steep. Preferably also a managed service by one of the big known cloud platforms. They are usually cheaper. They are less likely to vanish. It might make integration with other systems easier later.

We hit some problems either because of raw scale or some type of usage we did not anticipate in the beginning.

Are there compatible implementations that might solve some common problems? Typically this is because an implementation has to make a decision about it’s trade-offs. For a database system this is usually around the CAP theorem. A database system (or anything that keeps state) can be:

Partition Tolerant - The system still works if a node or the network between nodes fail.
Available - All requests receive a response.
Consistent - The data we read is the current data and not an earlier state.

But, you can only have two at the same time. And distributed systems tends to need to be partition tolerant. So we are stuck between consistency and availability.

It might be a good to have an idea of the CAP tradeoffs an implementation has done, and whether there are compatible implementations with different tradeoffs that can be used if later we find out we need to tweak our trade-offs for speed and/or scale.

More information about CAP theorem here and here. Jepsen have also extensively tested many popular databases to see how they break and if they are true to their stated trade-offs.

Database categories

Databases can be roughly sorted into categories. I’ll keep it simple and use the everyday lingo and not go into details about semantics and definitions (forgive me).

https://www.prisma.io/dataguide/intro/comparing-database-types

SQL

The oldest category is the relational database, also known as SQL based on the typical interface used to access these databases.

In general these databases have tables with names, a set of pre-defined columns and an arbitrary number of rows. You should have an idea of the data types to be stored in each column (such as text or numbers).

The downside of this is that you have to start with a rough model of the data you want to store and work with. The benefit of this is that later you know something about the model of the data you are working with. Most of the time I’ll happily do this in the database rather than handle all the potential inconsistencies in all systems that use that database.

Main contenders: PostgreSQL. MySQL & MariaDB.

NoSQL

All the rage the last decade. You put data in you get data out. The data is structured but not necessarily predefined. Think JSON object with values, arrays and lists.

The benefit is productivity when developing. The drawback is that you may pay a price for those shortcuts later if you’re not careful.

Main contender: MongoDB.

KeyValue

Technically a sub-category of NoSQL, and should probably be called caches. But I feel it deserves it’s own category.

A hyper-fast hyper-simple type of database. It has two columns. A key (ID) and value. The value can be anything, a string, a number, an entire JSON object or a blob containing binary data.

These are typically used in combination with another type of database. Either by storing very commonly used data for even quicker access. Or for certain types of simple data that requires insane speed or throughput and you don’t want to overload the main database.

Main contender: Redis.

Timeseries

A lesser known type of database optimized for storing a time series. A time series is a specific data type where the index is typically the time of a measurement. And the measurement is a number.

A time series is almost never changed after the fact. So these databases can be optimized for writing huge amounts of new data and reading and calculating on existing data. At the cost of performance for deleting or updating old data which is sloooow. Since the values are always numbers that tend to change somewhat predictably compression and deduplication can save us massive amounts of storage.

Main contenders: Prometheus, InfluxDB, TimescaleDB (plugin for PostgreSQL).

Graph

Graph databases are cool. In a graph database the relationship between objects is a primary feature. Whereas in SQL you need to join an element from one table with another object in another table with some kind of common identifier.

For most simple use cases a regular SQL database will do fine. But when the number of objects stored (rows) and the number of intermediary tables (joins) become large it gets slow, or expensive, or both.

I don’t have much experience with graph databases but I suspect they are less suited to general tasks and should be reserved for solving specific problems.

Main contenders: Neo4j. Redis + RedisGraph.

PS: Graph databases and GraphQL are completely separate things.

Other nice things

When researching this post I’ve come across things that look promising but are hard to categorize or fall in their own very niche categories.

Dgraph - A GraphQL and backend in one.
PrestoDB - An SQL interface on top of whatever database or storage you want to connect.
RethinkDB - A NoSQL database focused on real-time streaming/updating clients.
FoundationDB - A transactional key-value store by Apple.
ClickHouse - An SQL database that stores data (on disk) in columns instead of rows. Makes for blazingly fast analytical and aggregation queries.
Amazon Quantum Ledger Database - A managed distributed ledger database (aka blockchain).
EDB Postgres Advanced Server - An Oracle compatible PostgreSQL variant.

The Landscape

How to use these maps:

Version compatibility are in parenthesis. I have not mapped every version and how much breaking they are compared to previous versions but included some notes where I know there might be issues.

API/Protocol/Interface - This is decided by the framework, tool or driver you want to use. Sometimes it might be easier to choose the framework first and then a fitting database protocol. Or you might be lucky to choose the database features you need first and then select frameworks, tools and drivers that support it.

I think interfaces are really important when creating and choosing technology. I had a presentation about it a while ago and I think it’s still relevant.

Engine - Database implementations that are independent but try to be compatible. If there are alternatives to the “original” implementation they might have done different tradeoffs with regards to the CAP theorem or solve other specific problems.

Big three managed - Available managed services by the big three clouds, Amazon (AWS), Google (GCP) or Microsoft (Azure). Having an option to host in the big three is most likely the cheapest method as well as having a variety of other managed services to build a complete system in a single cloud.

Vendor managed - If the database vendor or backing company offers an Official managed service. They are usually hosted on the big three. Potentially a large cost premium over the raw compute power.

Self-hosted - Implementations you can run on your own computer or server.

Legend
	The checklist icon marks potential compatibility issues. For most use cases not a problem. PS: The absence of this icon does not automatically mean compatibility.

	I put the lightning icon on the self-hosted implementations that have what seems to be stable Kubernetes operators available. In short, a Kubernetes operator makes running a stateful system, such as a database, on Kubernetes much easier. It might allow for longer time before migrating to a managed service.

SQL

Compatibility:

PostgreSQL - Yugabyte

PostgreSQL - CockroachDB

MySQL - MariaDB

Kubernetes Operators:

PostgreSQL (CrunchyData)

PostgreSQL (Zalando)

Yugabyte

CockroachDB

Percona PostgreSQL for MySQL & XtraDB

NoSQL

PS: There are some breaking changes from MongoDB 3.6 to 4 so make sure the tools you intend to use are compatible with the database version you intend on using.

Kubernetes Operators:

MongoDB

Percona Distribution for MongoDB

ScyllaDB

Elastic Stack

KeyValue

Kubernetes Operators:

Redis (Spotahome)

Timeseries

Kubernetes Operators:

Prometheus-Stack

VictoriaMetrics

Graph

Kubernetes Operators:

ArangoDB

Conclusion

Congratulations if you made it this far!

I did this research primarily to reduce my own analysis paralysis on various projects so I can get-back-to-building. If you learned something as well, great stuff!

And if you want my advice, just use PostgreSQL unless you really know about some special requirements that necessitates using something else :-)

Mini-post: Down-scaling Azure Kubernetes Service (AKS)

2019-06-04T00:00:00+00:00

We discovered today that some implicit assumptions we had about AKS at smaller scales were incorrect.

Suddenly new workloads and jobs in our Radix CI/CD could not start due to insufficient resources (CPU & memory).

Even though it only caused problems in development environments with smaller node sizes it still surprised some of our developers, since we expected the size of development clusters to have enough resources.

I thought it would be a good chance to go a bit deeper and verify some of our assumptions and also learn more about various components that usually “just works” and isn’t really given much thought.

First I do a kubectl describe node <node> on 2-3 of the nodes to get an idea of how things are looking:

Resource                       Requests          Limits
--------                       --------          ------
cpu                            930m (98%)        5500m (585%)
memory                         1659939584 (89%)  4250M (228%)

So we are obviously hitting the roof when it comes to resources. But why?

Node overhead

We use Standard DS1 v2 instances as AKS nodes and they have 1 CPU core and 3.5 GiB memory.

The output of kubectl describe node also gives us info on the Capacity (total node size) and Allocatable (resources available to run Pods).

Capacity:
 cpu:                            1
 memory:                         3500452Ki
Allocatable:
 cpu:                            940m
 memory:                         1814948Ki

So we have lost 60 millicores / 6% of CPU and 1685Mi‬B / 48% of memory. The next question is if this increases linearly with node size (the percentage of resources lost is the same regardless of node size) or is fixed (always reserves 60 millicores and 1685Mi of memory), or a combination.

I connect to another cluster that has double the node size (Standard DS2 v2) and compare:

Capacity:
 cpu:                            2
 memory:                         7113160Ki
Allocatable:
 cpu:                            1931m
 memory:                         4667848Ki

So for this the loss is 69 millicores / 3.5% of CPU and 2445MiB / 35% of memory.

So CPU reservations are close to fixed regardless of node size while memory reservations are influenced by node size but luckily not linearly.

What causes this “waste”? Reading up on kubernetes.io gives a few clues. Kubelet will reserve CPU and memory resources for itself and other Kubernetes processes. It will also reserve a portion of memory to act as a buffer whenever a Pod is going beyond it’s memory limits to avoid risking System OOM, potentially making the whole node unstable.

To figure out what these are configured to we log in to an actual AKS node’s console and run ps ax|grep kube and the output looks like this:

/usr/local/bin/kubelet --enable-server --node-labels=node-role.kubernetes.io/agent=,kubernetes.io/role=agent,agentpool=nodepool1,storageprofile=managed,storagetier=Premium_LRS,kubernetes.azure.com/cluster=MC_clusters_weekly-22_northeurope --v=2 --volume-plugin-dir=/etc/kubernetes/volumeplugins --address=0.0.0.0 --allow-privileged=true --anonymous-auth=false --authorization-mode=Webhook --azure-container-registry-config=/etc/kubernetes/azure.json --cgroups-per-qos=true --client-ca-file=/etc/kubernetes/certs/ca.crt --cloud-config=/etc/kubernetes/azure.json --cloud-provider=azure --cluster-dns=10.2.0.10 --cluster-domain=cluster.local --enforce-node-allocatable=pods --event-qps=0 --eviction-hard=memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5% --feature-gates=PodPriority=true,RotateKubeletServerCertificate=true --image-gc-high-threshold=85 --image-gc-low-threshold=80 --image-pull-progress-deadline=30m --keep-terminated-pod-volumes=false --kube-reserved=cpu=60m,memory=896Mi --kubeconfig=/var/lib/kubelet/kubeconfig --max-pods=110 --network-plugin=cni --node-status-update-frequency=10s --non-masquerade-cidr=0.0.0.0/0 --pod-infra-container-image=k8s.gcr.io/pause-amd64:3.1 --pod-manifest-path=/etc/kubernetes/manifests --pod-max-pids=-1 --rotate-certificates=false --streaming-connection-idle-timeout=5m

To log in to the console of a node, go to the MC_resourcegroup_clustername_region resource-group and select the VM. Then go to Boot diagnostics and enable it. Go to Reset password to create yourself a user and then Serial console to log in and execute commands.

We can see --kube-reserved=cpu=60m,memory=896Mi and --eviction-hard=memory.available<750Mi which adds up to 1646Mi which is pretty close to the 1685Mi that was the gap between Capacity and Allocatable.

We also do this on a Standard DS2 v2 node and get --kube-reserved=cpu=69m,memory=1638Mi and --eviction-hard=memory.available<750Mi.

So we can see that the memory of kube-reserved grows almost linearly and seems to always be about 20-25% while CPU reservations are almost the same. The memory eviction buffer is always fixed at 750Mi which would mean bigger resource waste as nodes decrease in size.

CPU

	Standard DS1 v2	Standard DS2 v2
VM capacity	1.000m	2.000m
kube-reserved	-60m	-69m
Allocatable	940m	1.931m
Allocatable %	94%	96.5%

Memory

	Standard DS1 v2	Standard DS2 v2
VM capacity	3.500Mi	7.113Mi
kube-reserved	-896Mi	-1.638Mi
Eviction buf	-750Mi	-750Mi
Allocatable	1.814Mi	4.667Mi
Allocatable %	52%	65%

Node pods (DaemonSets)

We have some Pods that run on every node, and they are installed by default by AKS. We get the resource limits of these by describing either the pods or the daemonsets.

CPU

	Standard DS1 v2	Standard DS2 v2
Allocatable	940m	1.931m
kube-system/calico-node	-250m	-250m
kube-system/kube-proxy	-100m	-100m
kube-system/kube-svc-redirect	-5m	-5m
Available	585m	1.576m
Available %	58%	81%

Memory

	Standard DS1 v2	Standard DS2 v2
Allocatable	1.814Mi	4.667Mi
kube-system/kube-svc-redirect	-32Mi	-32Mi
Available	1.782Mi	4.635Mi
Available %	50%	61%

So for Standard DS1 v2 nodes we have about 0.5 CPU and 1.7GiB memory per node for pods. And for Standard DS2 v2 nodes it’s about 1.5 CPU and 4.6GiB memory.

kube-system pods

Now lets add some standard Kubernetes pods we need to run. As far as I know these are pretty much fixed for a cluster and not related to node size or count.

Deployment	CPU	Memory
kube-system/kubernetes-dashboard	100m	50Mi
kube-system/tunnelfront	10m	64Mi
kube-system/coredns (x2)	200m	140Mi
kube-system/coredns-autoscaler	20m	10Mi
kube-system/heapster	130m	230Mi
Sum	460m	494Mi

Third party pods

Deployment	CPU	Memory
grafana	200m	500Mi
prometheus-operator	500m	1.000Mi
prometheus-alertmanager	100m	225Mi
flux	50m	64Mi
flux-helm-operator	50m	64Mi
Sum	900m	1.853Mi

Radix platform pods

Deployment	CPU	Memory
radix-api-prod/server (x2)	200m	400Mi
radix-api-qa/server (x2)	100m	200Mi
radix-canary-golang-dev/www	40m	500Mi
radix-canary-golang-prod/www	40m	500Mi
radix-platform-prod/public-site	5m	10Mi
radix-web-console-prod/web	10m	42Mi
radix-web-console-qa/web	5m	21Mi
radix-github-webhook-prod/webhook	10m	30Mi
radix-github-webhook-prod/webhook	5m	15Mi
Sum	415m	1.718Mi

If we add up the resource usage of these groups of workloads and see the total available resources on our 4 node Standard DS1 v2 clusters we are left with 0.56 CPU cores (14%) and 3GB of memory (22%):

Workload	CPU	Memory
kube-system	460m	494Mi
third-party	900m	1.853Mi
radix-platform	415m	1.718Mi
Sum	1.760m	4.020Mi
Available on 4x DS1	2.340m	7.128Mi
Available for workloads	565m	3.063Mi

Though surprising that we lost this much resources before being able to deploy our actual customer applications, it should still be a bit of headroom.

Going further I checked the resource requests on 8 customer pods deployed in 4 environments (namespaces). Even though none of them had a resource configuration in their radixconfig.yaml files they still had resource requests and limits. Not surprising since we use LimitRange to set default resource requests and limits. The surprise was that half of them had 50Mi of memory and the other half 500Mi, seemingly at random.

It turns out that we did an update to the LimitRange values a few days ago but that only applies to new Pods, so depending on if the Pods got re-created for any reason they may or may not have the old request of 500Mi, which in our case of small clusters will quickly drain the available resources.

Read more about LimitRange here: kubernetes.io , and here is the commit that eventually trickled down to reduce memory usage: github.com

Pod scheduling

Depending on the weight between CPU and memory requests and how often things get destroyed and re-created you may find yourself in a situation where you have enough resources in your cluster but new workloads are still Pending. This can happen when one resource type (e.g. CPU) is filled before another (e.g. memory), leading one or more resources to be stranded and unlikely to be utilized.

Imagine for example a cluster that is already utilized like this:

	CPU	Memory
node0	94%	86%
node1	80%	89%
node2	98%	60%

Scheduling a workload that requests 15% CPU and 20% memory cannot be scheduled since there are no nodes fulfilling both requirements. In theory there is probably a CPU intensive Pod on node2 that could be moved to node1 but Kubernetes does not do re-scheduling to optimize utilization. It can do re-scheduling based on Pod priority (medium.com) and there is an incubator project (akomljen.com) that can try to drain nodes with low utilization.

So for the foreseable future keeping in mind that resources can get stranded and that looking at the sum of cluster resources and sum of cluster resource demand might be misleading.

calico-node

The biggest source of waste on our small clusters is calico-node which is installed on every node and requests 25% of a CPU core while only using 2.5-3% CPU:

The request is originally set here github.com but I have not got into why that number was choosen. Next steps would be to do some benchmarking of calico-node to smoke out it’s performance characteristics to see if it would be safe to lower the resource requests, but that is out of scope for now.

Conclusion

By increasing node size from Standard DS1 v2 to Standard DS2 v2 we also increase the available CPU from 58% per node to 81% per node. Available memory increases from 50% to 61% per node.
With a total platform requirement of 3-4GB of memory and 4.6GB available on Standard DS2 v2 we might have more resources for actual workloads on a 1-node Standard DS2 v2 cluster than a 3-node Standard DS1 v2 cluster!
Beware of stranded resources limiting the utilization you can achieve across a cluster.

Disk performance on Azure Kubernetes Service (AKS) - Part 1: Benchmarking

2019-02-23T00:00:00+00:00

Understanding the characteristics of disk performance of a platform might be more important than you think. If disk resources are not correctly matched to your workload, your performance will suffer and might lead you to incorrectly diagnose a problem as being related to CPU or memory.

The defaults might also not give you the performance you expect.

In this first post on troubleshooting some disk performance issues on Azure Kubernetes Service (AKS) we will benchmark Azure Premium SSD to find how workloads affect performance and which metrics to monitor to know when troubleshooting potential disk issues.

TLDR:

Disable Azure cache for workloads with high number of random writes
Use a P15 (256GB) or larger Premium SSD even though you might only need a fraction of it.

Table of contents

Background
- Metric Methodologies
- Storage Background
What to measure?
How to measure disk
How to measure disk on Azure Kubernetes Service
Test results
Conclusion

Microsoft Azure

If you don’t have a Azure subscription already you can try services for $200 for 30 days. The VM size Standard_B2s is Burstable, has 2vCPU, 4GB RAM, 8GB temp storage and costs roughly $38 / month. For $200 you can have a cluster of 3-4 B2s nodes plus traffic, loadbalancers and other additional costs.

See my blog post Managed Kubernetes on Microsoft Azure (English) for information on how to get up and running with Kubernetes on Azure.

I have no affiliation with Microsoft Azure except using them through work.

Corrections

February 2020: Some of my previous knowledge and assumptions were not correct when applied to a cloud + Docker environment, as explained by AKS PM Jesse Noller on GitHub.

One of the issues is that even accessing a “data disk” will incur IOPS on the OS disk, and throttling of the OS disk will also constraint IOPS on the data disks.

Background

I’m part of a team at Equinor building an internal PaaS based on Kubernetes running on AKS (Azure managed Kubernetes). We use Prometheus for monitoring each cluster as well as InfluxDB for collecting metrics from k6io which runs continous tests on our public endpoints.

A couple of weeks ago we discovered some potential problems with both Prometheus and InfluxDB with memory usage and restarts. High CPU usage of type iowait suggested that there might be some disk issues contributing to the problems.

iowait: “Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.” (hpe.com). You can see iowait on your Linux system by running top and looking at the wa percentage.

PS: You can have a disk IO bottleneck even with low iowait, and a high iowait does not always indicate a disk IO bottleneck (ibm.com).

First off we need to benchmark the underlying disk to get an understanding of it’s performance limits and characteristics. That is what we will cover in this post.

Metric Methodologies

There are two helpful methodologies when monitoring information systems. The first one is Utilization, Saturation and Errors (USE) from Brendan Gregg and the second one is Rate, Errors, Duration (RED) from Tom Wilkie. RED is best suited when observing workloads and transactions while USE is best suited for observing resources.

I’ll be using the USE method here. USE can be summarised as:

For every resource, check utilization, saturation, and errors.
- resource: all physical server functional components (CPUs, disks, busses, …)
- utilization: the average time that the resource was busy servicing work
- saturation: the degree to which the resource has extra work which it can’t service, often queued
- errors: the count of error events

Storage Background

Disk usage has two dimensions, throughput/bandwidth(BW) and operations per second (IOPS), and the underlying storage system will have upper limits of how much data it can receive (BW) and the number of operations it can perform per second (IOPS).

Background - harddrive types: harddrives come in two types, Solid State Disks (SSD) and spindle (HDD). A SSD disk is a microship capable of permanently storing data while a HDD uses spinning platters to store data. HDDs have a fixed rate of rotation (RPM), typically 5.400 and 7.200 RPM for lower cost drives for home use and higher cost 10.000 and 15.000 RPM drives for server use. Over the last 20 years of HDDs their storage density has increased, but the RPM has largely stayed the same. A disk with twice the density (500GB to 1TB for example) can read twice as much data on a single rotation and thus increase the bandwidth significantly. However, reading or writing a random block still requires waiting for the disk to spin enough to reach the relevant sector on the disk. So IOPS has not increased much for HDDs and is still a low 125-150 IOPS for a 10.000 RPM enterprise disk. A SSD does not have any moving parts so is able to reach MUCH higher IOPS. A low end Samsung 960 EVO with 500GB capacity costs $150 and can achieve a whopping 330.000 IOPS! (wikipedia.com)

Background - access patterns: The way a program uses storage also has a huge impact on the performance one can achieve. Sequential access is when we read or write a large file. When this happens the operating system and harddrive can optimize and “merge” operations so that we can read or write a much bigger chunk of data at a time. If we can read 1MB at a time 150 times per second we get 150MB/s of bandwidth. However, fully random access where the smallest chunk we read or write is a 4KB block the same 150 IOPS would only give a bandwidth of 0.6MB/s!

Background - cloud vs physical: Now we know what HDDs are limited to a low IOPS and low IOPS combined with a random access pattern gives us a low overall bandwidth. There is a huge gotcha here when it comes to cloud. On Azure when using Premium Managed SSD the IOPS you are given is a factor of the disk size you provision (microsoft.com). A 512GB disk is limited to 2.300 IOPS and 150MB/s. With 100% random access that only gives about 9MB/s of bandwidth!

Background - OS caching: To overcome some of the limitations of the underlying disk (mostly IOPS) there are potentially several layers of caching involved. Linux file systems can have writeback enabled which causes Linux to temporarily store data that is going to be written to disk in memory. This can give a big performance increase when there are sudden spikes of writes exceeding the performance of the underlying disk. It also increases the chance that operations can be merged where several write operations to areas of the disk that are nearby can be executed as one. This caching works best for sudden peaks and will not necessarily be enough if there is continous random writes to disk. This caching also means that even though an application thinks it has saved some data to disk it can be lost in the case of a power outage or other failure. Applications can also explicitly request direct access where every operation is persisted to disk before receiving a confirmation. This is a trade-off between performance and durability that needs to be decided based on the application itself and the environment.

Background - Azure caching: Azure also provides read and write cache for its disks which is enabled by default. As we will see soon for our use case it’s not a good idea to use.

What to measure?

These metrics are collected by the Prometheus node-exporter and follows it’s naming. I’ve also created a dashboard that is available on Grafana.com.

With the USE methodology as a guideline and the two separate but related “resources”, bandwidth and IOPS we can look for some useful metrics.

Utilization:

rate(node_disk_written_bytes_total) - Write bandwidth. The maximum is given by Azure and is 25MB/s for our disk size.
rate(node_disk_writes_completed_total) - Write operations. The maximum is given by Azure and is 120 IOPS for our disk size.
rate(node_disk_io_time_seconds_total) - Disk active time in percent. The time the disk was busy servicing requests. 100% means fully utilized.

Saturation:

rate(node_cpu_seconds_total{mode="iowait"} - CPU iowait. The percentage of time a CPU core is blocked from doing useful work because it’s waiting for an IO operation to complete (typically disk, but can also be network).

Useful calculated metrics:

rate(node_disk_write_time_seconds_total) / rate(node_disk_writes_completed_total) - Write latency. How long from a write is requested until it’s completed.
rate(node_disk_written_bytes_total) / rate(node_disk_writes_completed_total) - Write size. How big the average write operation is. 4KB is minimum and indicates 100% random access while 512KB is maximum and indicates sequential access.

How to measure disk

The best tool for measuring disk performance is fio, even though it might seem a bit intimidating at first due to it’s insane number of options.

Installing fio on Ubuntu:

apt-get install fio

fio executes jobs described in a file. Here is the top of our jobs file:

[global]
ioengine=libaio   # sync|libaio|mmap
group_reporting
thread
size=10g          # Size of test file
cpus_allowed=1    # Only use this CPU core
runtime=300s      # Run test for 5 minutes

[test1]
filename=/tmp/fio-test-file
direct=1          # If value is true, use non-buffered I/O. Non-buffered I/O usually means O_DIRECT
readwrite=write   # read|write|randread|randwrite|readwrite|randrw
iodepth=1         # How many operations to queue to the disk
blocksize=4k

The fields we will be changing for the various tests are direct, readwrite, iodepth and blocksize. Save the contents in a file named jobs.fio and we run a test with fio --sector test1 jobs.fio and wait until the test completes.

PS: To run these tests on higher performance hardware and better caching you might want to set runtime to 0 to have the test run continously and monitor the metrics until performance reaches a steady-state.

How to measure disk on Azure Kubernetes Service

For this testing we use a standard Prometheus installation collecting data from node-exporter and visualizing data in Grafana. The dashboard I created for the testing can be found here: https://grafana.com/dashboards/9852.

By default Kubernetes will schedule a Pod to any node that has enough memory and CPU for our workload. Since one of the tests we are going to run are on the OS disk we do not want the Pod to run on the same node as any other disk-intensive application, such as Prometheus.

Look at which Pods are running with kubectl get pods -o wide and look for a node that does not have any disk-intensive application.

Then we tag that node with kubectl label nodes aks-nodepool1-37707184-2 tag=disktest. This allows us later to specify that we want to run our testing Pod on that specific node.

A StorageClass in Kubernetes is a specification of a underlying disk that Pods can request usage of through volumeClaimTemplates. AKS comes with a default StorageClass managed-premium that has caching enabled. Most of these tests require the Azure cache disabled so create a new StorageClass managed-premium-retain-nocache:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: managed-premium-retain-nocache
provisioner: kubernetes.io/azure-disk
reclaimPolicy: Retain
parameters:
  storageaccounttype: Premium_LRS
  kind: Managed
  cachingmode: None

You can add it to your cluster with:

kubectl apply -f https://raw.githubusercontent.com/StianOvrevage/stian.tech/master/images/2019-02-23-disk-performance-on-aks-part-1/storageclass.yaml

Next we create a StatefulSet that uses a volumeClaimTemplate to request a 250GB Azure disk. This provisions a P15 Azure Premium SSD with 125MB/s bandwidth and 1100 IOPS:

kubectl apply -f https://raw.githubusercontent.com/StianOvrevage/stian.tech/master/images/2019-02-23-disk-performance-on-aks-part-1/ubuntu-statefulset.yaml

Follow the progress of the Pod creation with kubectl get pods -w and wait until it is Running.

When the Pod is Running we can start a shell on it with kubectl exec -it disk-test-0 bash

Once inside bash on the Pod, we install fio:

apt-get update && apt-get install -y fio wget

And save the contents of in the Pod:

wget https://raw.githubusercontent.com/StianOvrevage/stian.tech/master/images/2019-02-23-disk-performance-on-aks-part-1/jobs.fio

Now we can run the different test sections one by one. PS: If you don’t specify a section fio will run all the tests simultaneously, which is not what we want.

fio --section=test1 jobs.fio
fio --section=test2 jobs.fio
fio --section=test3 jobs.fio
fio --section=test4 jobs.fio
fio --section=test5 jobs.fio
fio --section=test6 jobs.fio
fio --section=test7 jobs.fio
fio --section=test8 jobs.fio
fio --section=test9 jobs.fio

Test results

Test 1 - Learning to dislike Azure Cache

Sequential write, 4K block size, Azure Cache enabled, OS cache disabled. See full fio test results.

I run the first tests on the OS disk of a Kubernetes node. The OS disks have Azure caching enabled.

The first 1-2 minutes of the test I get very good performance of 45MB/s and ~11.500 IOPS but that drops to 0 very quickly as the cache is full and busy writing things to the underlying disk. When that happens everything freezes and I cannot even execute shell commands. After stopping the test the system still hangs for a bit while the cache empties.

The maximum latency measured by fio was 108751k usec. Or about 108 seconds!

For the first try of these tests a 20-30 second period of very fast writes (250MB/s) caused a 7-8 minutes hang while the cache emptied. Trying again caused another pattern of lower peak performance with shorter hangs in between. Very unpredictable. I’m not sure what to make of this. It’s not acceptable that a Kubernetes node becomes unresponsive for many minutes following a short burst of writing. There are scattered recommendations online of disabling caching for write-heavy applications. Since I have not found any way to measure the Azure cache itself, the results are unpredictable and potentially very impactful as well as making it very hard to use the metrics we do have to evaluate application and storage behaviour I’ve concluded that it’s best to use data disks with caching disabled for our workloads (you cannot disable caching on an AKS node OS disk).

Test 2 - Disable Azure Cache - enable OS cache

Sequential write, 4K block size. Change: Azure cache disabled, OS caching enabled. See full fio test results.

If we swap the Azure cache for the Linux OS cache we see that iowait increases while the writing occurs. The application sees high write performance until the number of Dirty bytes reaches a threshold of about 3.7GB of memory. The performance of the underlying disk is 125MB/s and 250 IOPS. Here we are throttled by the 125MB/s limit of the Azure P15 Premium SSD.

Also notice that on sequential writes of 4K with OS caching the actual blocks written to disk is 512K which saves us a lot of IOPS. This will become important later.

Test 3 - Disable OS cache

Sequential write, 4K block size. Change: OS caching disabled. See full fio test results.

By disabling the OS cache (direct=1) the results are consistent and predictable. There is no iowait since the application does not have multiple writes pending at the same time. Because of the 2-3ms latency of the disks we are not able to get more than about 400 IOPS. This gives us a meager 1.5MB/s even though the disk is limited to 1100 IOPS and 125MB/s. To reach that we need multiple simultaneous writes or a bigger IO depth (queue). Disk active time is also 0% which indicates that the disk is not saturated.

Test 4 - Increase IO depth

Sequential write, 4K block size, OS caching disabled. Change: IO depth 16. See full fio test results.

For this test we only increase the IO depth from 1 to 16. IO depth is the number of write operations fio will execute simultaneously. Since we are using direct these will be queued by the OS for writing. We are now able to hit the performance limit of 1100 IOPS. Disk active time is now steady at 100% indicating that we have saturated the disk.

Test 5 - Larger block size, smaller IO depth

Sequential write, OS caching disabled. Change: 128K block size, IO depth 1. See full fio test results.

We increase the block size to 128KB and reduce the IO depth to 1 again. The write latency for larger blocks increase to ~5ms which gives us 200 IOPS and 28MB/s. The disk is not saturated.

Test 6 - Enable OS cache

Sequential write, 256K block size, IO depth 1. Change: OS caching enabled. See full fio test results.

We have now enabled the OS cache/buffer (direct=0). We can see that the writes hitting the disk are now merged to 512KB blocks. We are hitting the 125MB/s limit with about 250 IOPS. Enabling the cache also has other effects: CPU suddenly shows significant IO wait. The write latency shoots through the roof. Also note that the writing continued for 30-40 seconds after the test was done. This also means that the bandwidth and IOPS that fio sees and reports is higher than what is actually hitting the disk.

Test 7 - Random writes, small block size

IO depth 1, OS caching enabled. Change: Random write, 4K block size. See full fio test results.

Here we go from sequential writes to random writes. We are limited by IOPS. The average size of the blocks actually written to disks, and the IOPS required to hit the bandwidth limit is actually varying a bit throughout the test. The time taken to empty the cache is about as long as I ran the test (4-5 minutes).

Test 8 - Large block size

Random write, OS caching enabled. Change: 256K block size, IO depth 16. See full fio test results.

Increasing the block size to 256K makes us bandwidth limited to 125MB/s.

Conclusion

Access patterns and block sizes have a tremendous impact on the amount of data we are able to write to disk.

Managed Kubernetes on Microsoft Azure (English)

2017-12-29T00:00:00+00:00

A few days ago I wrote a walkthrough of setting up Azure Container Service (AKS) in Norwegian. Someone asked me for an English version of that, and here it is.

Kubernetes(K8s) is becoming the de-facto standard for deploying container-based applications and workloads. Microsoft is currently in preview of their managed Kubernetes offering (Azure Kubernetes Service, AKS) which makes it easy to create a Kubernetes cluster and deploy workloads without the skill and time required to manage day-to-day operations of a Kubernetes-cluster, which today can be complex and time consuming.

In this post we will set up a Kubernetes cluster from scratch using Azure CLI.

Table of contents

Background
- Docker containers
- Container orchestration
Getting started with Azure Kubernetes - AKS
Bonus material
- Deploying services with Helm
  - Deploy MineCraft with Helm
- Kubernetes Dashboard
Conclusion

Microsoft Azure

If you don’t have a Azure subscription already you can try services for $200 for 30 days. The VM size Standard_B2s is Burstable, has 2vCPU, 4GB RAM, 8GB temp storage and costs roughly $38 / month. For $200 you can have a cluster of 3-4 B2s nodes plus traffic, loadbalancers and other additional costs.

We have no affiliation with Microsoft Azure except their sponsorship of our startup DataDynamics with cloud services for 24 months in their BizSpark program.

Background

Docker containers

We will not do a deep dive on Docker containers in this post, but here is a summary for those who are not familiar with it.

Docker is a way to package software so that it can run on the most popular platforms without worrying about installation, dependencies and to a certain degree, configuration.

In addition, a Docker container uses the operating system of the host machine when it runs. Because of this it’s possible to run many more containers on the same host machine compared to running virtual machines.

Here is a incomplete and rough comparison between a Docker container and a virtual machine:

	Virtual machine	Docker container
Image size	from 200MB to many GB	from 10MB to 3-400MB
Startup time	60 seconds +	1-10 seconds
Memory usage	256MB-512MB-1GB +	2MB +
Security	Good isolation between VMs	Not as good isolation between containers
Building image	Minutes	Seconds

PS The numbers for virtual machines is taken from memory. I tried starting a MySQL virtual appliance on my laptop but VMware Player refuses to run because of Windows Hyper-V incompatibility. VMware Workstation refuses to run because of license issues and Oracle VirtualBox repeatedly gives me a nasty bluescreen. Hooray!

Protip The smallest and fastest Docker images are built on Alpine Linux. For the webserver Nginx the Alpine-based image is 15MB compared to 108MB for the normal Debian-based image. PostgreSQL:Alpine is 38MB compared to 287MB with “full” OS. Last version of MySQL is 343MB but will in version 8 support Alpine Linux as well.

To recap, some of the advantages of Docker containers are:

Compatibility across platforms, Linux, Windows, MacOS.
10-100x smaller size. Faster to download, build and upload.
Memory usage only for application and not base OS.
- Advantage when developing. Ability to run 10-20-30 containers on a development laptop.
- Advantage in production. Can reduce hardware/cloud costs considerably.
Near instant startup. Makes dynamic scaling of applications easier.

Download Docker for Windows here.

To start a MySQL database container from Windows CMD or Powershell:

docker run --name mysql -p 3306:3306 -e MYSQL_RANDOM_ROOT_PASSWORD=true mysql

Stop the container with:

docker kill mysql

You can search for already built Docker images on Docker Hub. It’s also possible to create private Docker repositories for your own software that you don’t want to be publicly available.

Container orchestration

Now that Docker container images has become the preferred way to package and distribute software on the Linux platform, there has emerged a need for systems to coordinate running and deploying these containers. Similar to the ecosystem of products VMware has built up around development and operation of virtual machines.

Container orchestration systems have the responsibility for:

Load balancing.
Service discovery.
Health checks.
Automatic scaling and restarting of host nodes and containers.
Zero downtime upgrades (rolling deploys).

Until recently the ecosystem around container orchestration has been fragmented, and the most popular alternatives have been:

Kubernetes (Originaly from Google, now managed by CNCF, the Cloud Native Computing Foundation)
Swarm (From the maker of Docker)
Mesos (From Apache Software Foundation)
Fleet (From CoreOS)

But the last year there has been a convergence towards Kubernetes as the preferred solution.

7 February
- CoreOS announces that they are removing Fleet from Container Linux and recommends Kubernetes
27 July
- Microsoft joins the CNCF
9 August
- Amazon Web Services join the CNCF
29 August
- VMware and Pivotal joins the CNCF
17 September
- Oracle joins the CNCF
17 October
- Docker announces native support for Kubernetes in addition to it’s own Swarm product
24 October
- Microsoft Azure announces the managed Kubernetes service AKS
29 November
- Amazon Web Services announces the managed Kubernetes service EKS

Especially the last two news items are important. Deploying and running your own Kubernetes-installation requires time and skills (Read how Stripe used 5 months to trust running Kubernetes in production, just for batch jobs.)

Until now the choice has been running your own Kubernetes cluster or using Google Container Engine which has been using Kubernetes since 2014. Many of us feel a certain discomfort by locking ourselves to one provider. But this is now changing when you can develop infrastructure on Kubernetes and choose between the 3 large cloud providers in addition to running your own cluster if wanted. *

* Kubernetes is a fast moving project, and features might be available on the different platforms on different timelines.

Getting started with Azure Kubernetes - AKS

Caveats

This guide is based on the documentation on Microsoft.com. Setting up a Azure Kubernetes cluster did not work in the beginning of December, but today, 23. December, it seems to work fairly well. But, upgrading the cluster from Kubernetes 1.7 to 1.8 for example does NOT work.

AKS is in Preview and Azure are working continuously to make AKS stable and to support as many Kubernetes-features as possible. Amazon Web Services has a similar closed invite-only Preview currently while working on stability and features.

Both Azure and AWS expresses expectations about their Kubernetes offerings will be ready for production in 2018.

Preparations

You need Azure-CLI (version 2.0.21 or newer) to execute the az commands:

All commands executed in Windows PowerShell.

Log on to Azure:

az login

You will get a link to open in your browser together with an authentication code. Enter the code on the webpage and az login will save the login information so that you will not have to authenticate again on the same machine.

PS The login information gets saved in C:\Users\Username\.azure\. You have to make sure nobody can access these files. They will then have full access to your Azure account.

Activate ContainerService

Since AKS is in Preview/Beta, you explicitly have to activate it in your subscription to get access to the aks subcommands.

az provider register -n Microsoft.ContainerService
az provider show -n Microsoft.ContainerService

Create a resource group

Here we create a resource group named “my_aks_rg” in Azure region West Europe.

az group create --name my_aks_rg --location westeurope

Protip To see a list of all available Azure regions, use the command az account list-locations --output table. PS AKS might not be available in all regions yet!

Create Kubernetes cluster

az aks create --resource-group my_aks_rg --name my_cluster --node-count 3 --generate-ssh-keys --node-vm-size Standard_B2s --node-osdisk-size 128 --kubernetes-version 1.8.2

--node-count
- Number of agent(host) nodes available to run containers
--generate-ssh-keys
- Creates and prints a SSH key which can be used for SSHing directly to the agent nodes.
--node-vm-size
- Which size Azure VMs the agent nodes should be created as. To see available sizes use az vm list-sizes -l westeurope --output table and Microsofts webpages.
--node-osdisk-size
- Disk size of the agent nodes in GB. PS Containers can be stopped and moved to another host if Kubernetes finds it necessary or if a agent node disappears. All data saved locally in the container will be gone. If saving data permanently use Kubernetes PersistentVolumes and not the local agent node or container disks.
--kubernetes-version
- Which Kubernetes version to install. Azure does NOT necessarily install the last version by default, and currently upgrading with az aks upgrade does not work. Latest version available right now is 1.8.2. It’s recommended to use the latest available version since there is a lot of changes from version to version. The documentation is also much better for newer versions.

Save the output of the command in a file in a secure location. It contains keys that can be used to connect to the cluster with SSH. Even though that should not in theory be necessary.

Install kubectl

kubectl is the client which performs all operations against your Kubernetes cluster. Azure CLI can install kubectl for you:

az aks install-cli

After kubectl is installed we need to get login information so that kubectl can communicate with the Kubernetes cluster.

az aks get-credentials --resource-group my_aks_rg --name my_cluster

The login information is saved in C:\Users\Username\.kube\config. Keep these files secure as well.

Protip When you have several Kubernetes clusters you can change which one kubectl talks to with kubectl config get-contexts and kubectl config set-context my_cluster.

Inspect cluster

To check that the cluster and kubectl works we start with a couple of commands.

See all agent nodes and status:

> kubectl get nodes
NAME                       STATUS    AGE       VERSION
aks-nodepool1-16970026-0   Ready     15m       v1.8.2
aks-nodepool1-16970026-1   Ready     15m       v1.8.2
aks-nodepool1-16970026-2   Ready     15m       v1.8.2

See all services, pods and deployments:

> kubectl get all --all-namespaces

NAMESPACE     NAME                                          READY     STATUS    RESTARTS   AGE
kube-system   po/kubernetes-dashboard-6fc8cf9586-frpkn      1/1       Running   0          3d

NAMESPACE     NAME                          CLUSTER-IP     EXTERNAL-IP     PORT(S)           AGE
kube-system   svc/kubernetes-dashboard      10.0.161.132   <none>          80/TCP            3d

NAMESPACE     NAME                             DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kube-system   deploy/kubernetes-dashboard      1         1         1            1           3d

NAMESPACE     NAME                                    DESIRED   CURRENT   READY     AGE
kube-system   rs/kubernetes-dashboard-6fc8cf9586      1         1         1         3d

This is just some of the output from this command. You do not have to know what the resources in the kube-system namespace does. That is part of the intention when Microsoft is managing our cluster for us.

Namespaces In Kubernetes there is something called Namespaces. Resources in one namespace does not have automatic access to resources in another namespace. The services that runs Kubernetes itself use the namespace kube-system. The kubectl command by default only shows you resources in the default namespace, unless you specify --all-namespaces or --namespace=xx.

Start some nginx containers

An instance of a running container in Kubernetes is called a Pod.

nginx is a fast and flexible web server.

Now that the clsuter is up we can start rolling out services and deployments on it.

Lets start with creating a Deployment consiting of 3 containers all running the nginx:mainline-alpine image from Docker hub.

nginx-dep.yaml looks like this:

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:mainline-alpine
        ports:
        - containerPort: 80

Load this into the cluster with kubectl create:

kubectl create -f https://raw.githubusercontent.com/StianOvrevage/stian.tech/master/images/2017-12-23-managed-kubernetes-on-azure/nginx-dep.yaml

This command creates the resources described in the file. kubectl can read files either from your local disk or from a web URL.

After making changes to a resource definition (.yaml file), you can update the resources in the cluster with kubetl replace -f resource.yaml.

We can verify that the Deployment is ready:

> kubectl get deploy
NAME               DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   3         3         3            3           10m

We can also get the actual Pods that are running:

> kubectl get pods
NAME                                READY     STATUS    RESTARTS   AGE
nginx-deployment-569477d6d8-dqwx5   1/1       Running   0          10m
nginx-deployment-569477d6d8-xwzpw   1/1       Running   0          10m
nginx-deployment-569477d6d8-z5tfk   1/1       Running   0          10m

Logger We can view logs from one pod with kubectl logs nginx-deployment-569477d6d8-xwzpw. But since we in this case don’t know which Pod ends up getting an incomming request we can view logs from all the Pods which have app=nginx label: kubectl logs -lapp=nginx. The use of app=nginx is our choice in nginx-dep.yaml when we configured spec.template.metadata.labels: app: nginx.

Making nginx available with a service

To send traffic to our new Pods we need to create a Service. A service consists of one or more Pods which are chosen based on different criteria, for example which labels they have and whether the Pods are Running and Ready.

Lets create a service which forwards traffic to all Pods with label app: nginx and are listening to port 80. In addition we make the service available via a LoadBalancer:

nginx-svc.yaml looks like this:

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  type: LoadBalancer
  ports:
  - port: 80
    name: http
    targetPort: 80
  selector:
    app: nginx

We tell Kubernetes to create our service with kubectl create as usual:

kubectl create -f https://raw.githubusercontent.com/StianOvrevage/stian.tech/master/images/2017-12-23-managed-kubernetes-on-azure/nginx-svc.yaml

We can then wait and see which IP-address Azure assigns our service:

> kubectl get svc -w
NAME         CLUSTER-IP   EXTERNAL-IP     PORT(S)        AGE
nginx        10.0.24.11   13.95.173.255   80:31522/TCP   15m

PS It can take a few minutes for Azure to allocate and assign a Public IP for us. In the mean time <pending> will appear under EXTERNAL-IP.

A simple Welcome to nginx webpage should now be available on http://13.95.173.255 (remember to replace with your own External-IP).

We can also delete the service and deployment afterwards:

kubectl delete svc nginx
kubectl delete deploy nginx-deployment

Scaling the cluster

If we want to change the number of agent nodes running Pods we can do that via Azure-CLI:

az aks scale --name my_cluster --resource-group my_aks_rg --node-count 5

Currently all nodes will be created with the same size as when we created the cluster. AKS will probably get support for node-pools next year. That will allow for creating different groups of nodes with different size and operating systems, both Linux and Windows.

Delete cluster

You can delete the whole cluster like this:

az aks delete --name my_cluster --resource-group my_aks_rg --yes

Bonus material

Here is some bonus material if you want to go a bit further with Kubernetes.

Deploying services with Helm

Helm is a package manager and library of software that is ready to be deployed on a Kubernetes cluster.

Start by downloading the Helm-client. It will read login information etc. from the same location as kubectl automatically.

Install the Helm-server (Tiller) on the Kubernetes cluster and update the package library:

helm init
helm repo update

See available packages (Charts) with helm search.

Deploy MineCraft with Helm

Lets deploy a MineCraft server installation on our cluster, just because we can :-)

helm install --name stians --set minecraftServer.eula=true stable/minecraft

--set overrides one or more of the standard values configured in the package. The MineCraft package is made in a way where it does not start without accepting the user license agreement by setting the variable minecraftServer.eula. All the variables that can be set in the MineCraft package are documented here.

Then we wait for Azure to assign us a Public IP:

> kubectl get svc -w
stians-minecraft   10.0.237.0   13.95.172.192   25565:30356/TCP   3m

Now we can connect to our MineCraft server on 13.95.172.192:25565!

Kubernetes Dashboard

Kubernetes also has a graphic web user-interface which makes it a bit easier to see which resources are in the cluster, view logs and even open a remote shell inside a running Pod, among other things.

> kubectl proxy
Starting to serve on 127.0.0.1:8001

kubectl encrypts and tunnels the traffic to the Kubernetes API servers. The dashboard is available on http://127.0.0.1:8001/ui/.

Conclusion

I hope you enjoy Kubernetes as much as I have. The learning curve can be a bit steep in the beginning, but it does not take long before you are productive.

Look at the official guides on Kubernetes.io to learn more about defining different types of resources and services to run on Kubernetes. PS: There are big changes from version to version so make sure you use the documentation for the correct version!

Kubernetes also have a very active Slack-community on kubernetes.slack.com that is worthwhile to check out.

Managed Kubernetes på Microsoft Azure (Norwegian)

2017-12-25T00:00:00+00:00

Update 29. Dec: There is an English version of this post here.

Kubernetes (K8s) er i ferd med å bli de-facto standard for deployments av kontainer-baserte applikasjoner. Microsoft har nå preview av deres managed Kubernetes tjeneste (Azure Kubernetes Service, AKS) som gjør det enkelt å opprette et Kubernetes cluster og rulle ut tjenester uten å måtte ha kompetanse og tid til den daglige driften av selve Kubernetes-clusteret, som per i dag kan være relativt komplisert og tidkrevende.

I denne posten setter vi opp et Kubernetes cluster fra scratch ved bruk av Azure CLI.

Table of contents

Bakgrunn
- Docker containers
- Container orchestration
Kom i gang med Azure Kubernetes - AKS
Bonusmateriale
- Rulle ut tjenester med Helm pakker
  - MineCraft server med Helm
- Kubernetes Dashboard
Konklusjon

Microsoft Azure

Hvis du ikke har Azure fra før kan du prøve tjenester for $200 i 30 dager. VM typen Standard_B2s er Burstable, har 2vCPU, 4GB RAM, 8GB temp storage og koster ~$38 / mnd. For $200 kan du ha et cluster på 3-4 B2s noder plus trafikkostnad, lastbalanserere og andre nødvendige tjenester.

Vi har ingen tilknytning til Microsoft bortsett fra at de sponser vår startup DataDynamics med cloud-tjenester i 24 mnd i deres BizSpark program.

Bakgrunn

Docker containers

Vi tar ikke for oss Docker containers i dybden i denne posten, men her er en kort oppsummering for de som ikke er kjent med teknologien.

Docker er en måte å pakketere programvare slik at det kan kjøres på samtlige populære platformer uten å måtte bruke mye tid på dependencies, oppsett og konfigurasjon.

I tillegg bruker en Docker container operativsystemet på vertsmaskinen når den kjører. Dette gjør at en kan kjøre mange flere containere på samme vertsmaskin sammenlignet med virtuelle maskiner.

Her er en ufullstendig og grov sammenligning mellom en Docker container og en virtuell maskin:

	Virtuel maskin	Docker container
Image størrelse	fra 200MB til mange GB	fra 10MB til 3-400MB
Oppstartstid	60 sekunder +	1-10 sekunder
Minnebruk	256MB-512MB-1GB +	2MB +
Sikkerhet	God isolasjon mellom VM	Dårligere isolasjon mellom containere
Bygge image	Minutter	Sekunder

PS Tallene for virtuelle maskiner er tatt fra hukommelsen. Jeg forsøkte å starte en MySQL virtuell appliance på min laptop men VMware Player nekter å kjøre pga inkompatibilitet med Windows Hyper-V. VMware Workstation nekter å kjøre pga utgått lisens og Oracle VirtualBox gir en nasty bluescreen gang på gang. Hooray!

Protip De minste og raskeste Docker imagene er bygget på Alpine Linux. For webserveren Nginx er det Alpine-baserte imaget 15MB mot det Debian-baserte imaget på 108MB. PostgreSQL:Alpine er 38MB mot 287MB. Siste versjon av MySQL er 343MB men vil i versjon 8 støtte Alpine Linux også.

Noen av fordelene med Docker containers er altså:

Kompatibilitet på tvers av platformer, Linux, Windows og MacOS.
10-100x mindre størrelse. Raskere å laste ned, raskere å bygge, raskere å laste opp.
Minnebruk kun for applikasjon og ikke eget OS.
- Fordel under utvikling, kan kjøre 10-20-30 Docker containere samtidig på en laptop.
- Fordel i produksjon, kan redusere hardware utgifter betraktelig.
Oppstart på få sekunder. Gjør dynamisk skalering av applikasjoner mye enklere.

Last ned Docker for Windows her.

Og start en MySQL database fra Windows CMD eller Powershell:

docker run --name mysql -p 3306:3306 -e MYSQL_RANDOM_ROOT_PASSWORD=true mysql

Stop containeren med:

docker kill mysql

En kan søke etter ferdige Docker images på Docker Hub. Det er også mulig å lage private Docker repositories for egen programvare som ikke skal være tilgjengelig for omverden.

Container orchestration

Etter som Docker containers har blitt den foretrukne måten å pakke og distribuere programvare på Linux platformen de siste par årene har det vokst frem et behov for systemer som kan samkjøre drift og utrulling av disse containerene. Ikke ulikt det økosystemet av produkter VMware har bygget opp rundt utvikling og drift av virtuelle maskiner.

Container orchestration systemene har som oppgave å sørge for:

Lastbalansering.
Service discovery.
Health checks.
Automatisk skalering og restarting av vertsmaskiner og containere.
Oppgraderinger uten nedetid (rolling deploy).

Frem til nylig har økosystemet rundt container orchestration vært fragmentert og de mest populære alternativene har vært:

Kubernetes (Opprinnelig fra Google, nå styrt av CNCF, Cloud Native Computing Foundation)
Swarm (Fra produsenten bak Docker)
Mesos (Fra Apache Software Foundation)
Fleet (Fra CoreOS)

Men det siste året har det vært en konvergens mot Kubernetes som foretrukket løsning.

7 februar
- CoreOS annonserer at de fjerner Fleet fra Container Linux og anbefaler Kubernetes
27 juli
- Microsoft slutter seg til CNCF
9 august
- Amazon Web Services slutter seg til CNCF
29 august
- VMware og Pivotal slutter seg til CNCF
17 september
- Oracle slutter seg til CNCF
17 oktober
- Docker annonserer native støtte for Kubernetes i tillegg til sitt eget Swarm produkt
24 oktober
- Microsoft Azure annonserer managed Kubernetes med tjenesten AKS
29 november
- Amazon Web Services annonserer managed Kubernetes med tjenesten EKS

De to siste nyhetene er spesielt viktige. Å drifte sin egen Kubernetes-installasjon krever tid og kompetanse. (Les hvordan Stripe brukte 5 måneder på å bli fortrolig med å drifte sitt eget Kubernetes cluster, bare for batch jobs.)

Frem til nå har valget vært mellom å drifte sitt eget Kubernetes cluster eller bruke Google Container Engine som har brukt Kubernetes siden 2014. Mange av oss føler et visst ubehag ved å låse oss til én tilbyder. Men dette er nå anderledes når en kan utvikle infrastruktur på Kubernetes, og velge tilnærmet fritt * mellom de 3 store cloud-tilbyderene i tillegg til å drifte selv om ønskelig.

* Kubernetes utvikles raskt, og funksjonalitet blir ofte ikke tilgjengelig på de ulike platformene samtidig.

Opprette Azure Kubernetes Cluster

Forbehold

Denne gjennomgangen tar utgangspunkt i dokumentasjonen på Microsoft.com. Å sette opp et Azure Kubernetes cluster fungerte ikke i starten av desember, men per dags dato, 23. desember, ser det ut til å fungere relativt bra. Men, oppgradering av cluster fra Kubernetes 1.7 til 1.8 fungerer for eksempel IKKE.

AKS er i Preview og Azure jobber kontinuerlig med å gjøre AKS stabilt og støtte så mange Kubernetes-funksjoner som mulig. Amazon Web Services har tilsvarende en lukket invite-only Preview per dags dato mens de også jobber med stabilitet og funksjonalitet.

Både Azure og AWS uttrykker forventning om at deres Kubernetes tjenester skal være klare for produksjonsmiljø ila 2018.

Forberedelser

Du behøver Azure-CLI (versjon 2.0.21 eller nyere) for å utføre kommandoene:

Alle kommandoer gjøres i Windows PowerShell.

Azure innlogging

Logg på Azure:

az login

Du får en link som du åpner i din browser samt en autentiseringskode. Skriv koden på nettsiden og az login lagrer påloggingsinformasjonen slik at du ikke behøver å autentisere igjen på samme maskin.

PS Pålogingsinformasjonen lagres i C:\Users\Brukernavn\.azure\. Du må selv passe på at ingen kopierer disse filene. Da får de full tilgang til din Azure konto.

Aktiver ContainerService

Siden AKS er i Preview/Beta må du eksplisitt aktivere det for å få tilgang til aks kommandoene.

az provider register -n Microsoft.ContainerService
az provider show -n Microsoft.ContainerService

Opprett en resource group

Her oppretter vi en resource group med navn “min_aks_rg” i Azure region West Europe.

az group create --name min_aks_rg --location westeurope

Protip For å se en liste over tilgjengelige Azure regioner, bruk kommandoen az account list-locations --output table. PS Det kan hende AKS ikke er tilgjengelig i alle regioner enda.

Opprette Kubernetes cluster

az aks create --resource-group min_aks_rg --name mitt_cluster --node-count 3 --generate-ssh-keys --node-vm-size Standard_B2s --node-osdisk-size 256 --kubernetes-version 1.8.2

--node-count
- Antall vertsmaskiner tilgjengelig for å kjøre containers
--generate-ssh-keys
- Oppretter og outputter en SSH key som kan brukes for å SSHe direkte til vertsmaskinene.
--node-vm-size
- Hvilken type Azure VM clusteret skal bestå av. For å se tilgjengelige størrelser bruk az vm list-sizes -l westeurope --output table og Microsofts nettsider.
--node-osdisk-size
- Disk størrelse på vertsmaskiner i GB. PS Conteinere kan bli stoppet og flyttet til en annen host ved behov eller hvis en vertsmaskin forsvinner. Alle data lagret lokalt i conteineren blir da borte. Hvis en skal lagre ting permanent må en bruke PersistentVolumes og ikke lokal disk på vertsmaskin.
--kubernetes-version
- Hvilken Kubernetes versjon som skal installeres. Azure installerer IKKE den siste versjonen som standard, og per dags dato fungerer ikke az aks upgrade tilstrekkelig. Siste tilgjengelige versjon per dags dato er 1.8.2. Det er en fordel å bruke siste versjon da det skjer store forbedringer i Kubernetes fra versjon til versjon. Dokumentasjon er også mye bedre for nyere versjoner.

Lagre teksten som kommandoen spytter ut i en fil på en trygg plass. Den inneholder nøkler som kan brukes for å kople til clusteret med SSH. Selv om det i teorien ikke skal være nødvendig.

Installer kubectl

kubectl er klienten som gjør alle operasjoner mot ditt Kubernetes cluster. Azure CLI kan installere kubectl for deg:

az aks install-cli

Etter kubectl er installert behøver vi å få påloggingsinformasjon slik at kubectl kan kommunisere med Kubernetes clusteret.

az aks get-credentials --resource-group min_aks_rg --name mitt_cluster

Påloggingsinformasjonen lagres i C:\Users\Brukernavn\.kube\config. Hold disse filene hemmelig også.

Protip Når en har flere ulike Kubernetes clusters kan en bytte hvilken kubectl skal snakke til med kubectl config get-contexts og kubectl config set-context mitt_cluster.

Inspiser cluster

For å se at clusteret og kubectl virker begynner vi med noen kommandoer.

Se alle vertsmaskiner og status:

> kubectl get nodes
NAME                       STATUS    AGE       VERSION
aks-nodepool1-16970026-0   Ready     15m       v1.8.2
aks-nodepool1-16970026-1   Ready     15m       v1.8.2
aks-nodepool1-16970026-2   Ready     15m       v1.8.2

Se alle tjenester, pods, deployments:

> kubectl get all --all-namespaces

NAMESPACE     NAME                                          READY     STATUS    RESTARTS   AGE
kube-system   po/kubernetes-dashboard-6fc8cf9586-frpkn      1/1       Running   0          3d

NAMESPACE     NAME                          CLUSTER-IP     EXTERNAL-IP     PORT(S)           AGE
kube-system   svc/kubernetes-dashboard      10.0.161.132   <none>          80/TCP            3d

NAMESPACE     NAME                             DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kube-system   deploy/kubernetes-dashboard      1         1         1            1           3d

NAMESPACE     NAME                                    DESIRED   CURRENT   READY     AGE
kube-system   rs/kubernetes-dashboard-6fc8cf9586      1         1         1         3d

Jeg har bare tatt et lite utdrag fra denne kommandoen. Du behøver ikke å forstå hva alle ressursene i kube-system namespacet gjør. Det er hensikten at du skal slippe det når Microsoft står for management av selve clusteret.

Namespaces I Kubernetes er det noe som heter Namespaces. Ressurser i ett namespace har ikke automatisk tilgang til ressurser i et annet namespace. Tjenestene som Kubernetes selv benytter installeres i namespacet kube-system. Kommandoen kubectl viser deg vanligvis bare ressurser i default namespace med mindre du spesifiserer --all-namespaces eller --namespace=xx.

Starte noen nginx containere

En instans av en kjørende container kalles i Kubernetes for en Pod.

nginx er en rask og fleksibel webserver.

Nå som clusteret er oppe å kjøre kan vi begynne å rulle ut tjenster og deployments på det.

Vi begynner med å lage en Deployment bestående av 3 containere som alle kjører nginx:mainline-alpine imaget fra Docker hub.

nginx-dep.yaml ser slik ut:

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:mainline-alpine
        ports:
        - containerPort: 80

Last denne inn på clusteret med kubectl create:

kubectl create -f https://raw.githubusercontent.com/StianOvrevage/stian.tech/master/images/2017-12-23-managed-kubernetes-on-azure/nginx-dep.yaml

Denne kommandoen oppretter ressursene beskrevet i filen. kubectl kan lese filer enten lokalt fra din maskin eller fra en URL.

Etter du har gjort endringer i en ressurs-definisjon (.yaml fil) kan du oppdatere ressursene i clusteret med kubectl replace -f ressurs.yaml

Vi kan verifisere at Deployment er klar:

> kubectl get deploy
NAME               DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   3         3         3            3           10m

Vi kan også hente de faktiske Pods som er startet:

> kubectl get pods
NAME                                READY     STATUS    RESTARTS   AGE
nginx-deployment-569477d6d8-dqwx5   1/1       Running   0          10m
nginx-deployment-569477d6d8-xwzpw   1/1       Running   0          10m
nginx-deployment-569477d6d8-z5tfk   1/1       Running   0          10m

Logger Vi kan se logger fra én pod med kubectl logs nginx-deployment-569477d6d8-xwzpw. Men siden vi i dette tilfellet ikke vet hvilken Pod som ender opp med å få innkommende forespørsler kan vi se logger fra alle Pods som har app=nginx label: kubectl logs -lapp=nginx. At vi her bruker app=nginx har vi selv bestemt i nginx-dep.yaml når vi satt spec.template.metadata.labels: app: nginx.

Gjøre nginx tilgjengelig med en tjeneste

For å kommunisere med våre nye Pods behøver vi å opprette en tjeneste (Service). En tjeneste består av en eller flere Pods som velges basert på ulike kriterier, blant annet hvilke labels de har og om Podene det gjelder er Running og Ready.

Nå lager vi en tjeneste som ruter trafikk til alle Pods som har label app: nginx og som lytter på port 80. I tillegg gjør vi tjenesten tilgjengelig via en LoadBalancer:

nginx-svc.yaml ser slik ut:

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  type: LoadBalancer
  ports:
  - port: 80
    name: http
    targetPort: 80
  selector:
    app: nginx

Vi ber Kubernetes om å opprette tjeneten vår med kubectl create som vanlig:

kubectl create -f https://raw.githubusercontent.com/StianOvrevage/stian.tech/master/images/2017-12-23-managed-kubernetes-on-azure/nginx-svc.yaml

Deretter kan vi se hvilken IP-adresse tjenesten vår har fått av Azure:

> kubectl get svc -w
NAME         CLUSTER-IP   EXTERNAL-IP     PORT(S)        AGE
nginx        10.0.24.11   13.95.173.255   80:31522/TCP   15m

PS Det kan ta et par minutter for Azure å tildele tjenesten vår en Public IP, i mellomtiden vil det stå <pending> under EXTERNAL-IP.

En enkel Welcome to nginx webside skal nå være tilgjengelig på http://13.95.173.255 (husk å bytt ut med din egen External-IP).

Vi har nå en lastbalansert nginx tjeneste med 3 servere klar til å ta imot trafikk.

For ordens skyld kan vi slette tjeneste og deployment etterpå:

kubectl delete svc nginx
kubectl delete deploy nginx-deployment

Skalere cluster

Hvis en ønsker å endre antall vertsmaskiner/noder som kjører Pods kan en gjøre det via Azure-CLI:

az aks scale --name mitt_cluster --resource-group min_aks_rg --node-count 5

For øyeblikket blir alle noder opprettet med samme størrelse som når clusteret ble opprettet. AKS vil antageligvis få støtte for node-pools i løpet av neste år. Da kan en opprette grupper av noder med forskjellig størrelse og operativsystem, både Linux og Windows.

Slette cluster

En kan slette hele clusteret slik:

az aks delete --name mitt_cluster --resource-group min_aks_rg --yes

Bonusmateriale

Her er litt bonusmateriale dersom du ønsker å gå enda litt videre med Kubernetes.

Rulle ut tjenester med Helm

Helm er en pakke-behandler og et bibliotek av programvare som er klart for å rulles ut i et Kubernetes-cluster.

Start med å laste ned Helm-klienten. Den henter påloggingsinformasjon osv fra samme sted som kubectl automatisk.

Installer Helm-serveren (Tiller) på Kubernetes clusteret og oppdater pakke-biblioteket:

helm init
helm repo update

Se tilgjengelige pakker (Charts) med: helm search.

Rulle ut MineCraft med Helm

La oss rulle ut en MineCraft installasjon på clusteret vårt, fordi vi kan :-)

helm install --name stian-sin --set minecraftServer.eula=true stable/minecraft

--set overstyrer en eller flere av standardverdiene som er satt i pakken. MineCraft pakken er laget slik at den ikke starter uten å ha sagt seg enig i brukervilkårene i variabelen minecraftServer.eula. Alle variablene som kan overstyres i MineCraft pakken er dokumentert her.

Så venter vi litt på at Azure skal tildele en Public IP:

> kubectl get svc -w
stian-sin-minecraft   10.0.237.0   13.95.172.192   25565:30356/TCP   3m

Og vipps kan vi kople til Minecraft på 13.95.172.192:25565.

Kubernetes Dashboard

Kubernetes har også et grafisk web-grensesnitt som gjør det litt lettere å se hvilke ressurser som er i clusteret, se logger og åpne remote-shell inne i en kjørende Pod, blant annet.

> kubectl proxy
Starting to serve on 127.0.0.1:8001

kubectl krypterer og tunnelerer trafikken inn til Kubernetes’ API servere. Dashboardet er tilgjengelig på http://127.0.0.1:8001/ui/.

Konklusjon

Jeg håper du har fått mersmak for Kubernetes. Lærekurven kan være litt bratt i begynnelsen men det tar ikke så veldig lang tid før du er produktiv.

Se på de offisielle guidene på Kubernetes.io for å lære mer om hvordan du definerer forskjellige typer ressurser og tjenester for å kjøre på Kubernetes. PS: Det gjøres store endringer fra versjon til versjon så sørg for å bruke dokumentasjonen for riktig versjon!

Kubernetes har også et veldig aktivt Slack-miljø på kubernetes.slack.com. Der er det også en kanal for norske Kubernetes brukere; #norw-users.

Next generation monitoring with OpenTSDB

2014-06-02T19:56:40+00:00

In this paper we will provide a step by step guide on how to install a single-instance of OpenTSDB using the latest versions of the underlying technology, Hadoop and HBase. We will also provide some background on the state of existing monitoring solutions.

Table of contents

Abstract
Background
The monitoring revolution
Setting up a single node OpenTSDB instance on Debian 7 Wheezy
Installing OpenTSDB
Feeding data into OpenTSDB
Performance comparison

Background

Since its inception in 1999 rrdtool (the underlying storage mechanism of once universal MRTG) has been the base of many popular monitoring solutions; Cacti, collectd, Ganglia, Munin, Observium, OpenNMS and Zenoss, to name a few.

There are a number of problems with the current approach and we will highlight some of these here.

Please note that this includes Graphite and its backend Whisper, which is based on the same basic design as rrdtool and has some of the same limitations.

Performance problems - Welcome to I/O-hell

When MRTG and rrdtool was created the preservation of disk space was more important than preservation of disk operations and the default collection interval was 5 minutes (which many are still using). The way rrdtool is designed it requires quite a few random reads and writes per datapoint. It also re-reads, computes the average, and writes old data again according to the RRA rules defined which causes additional I/O load. In 2014 memory is cheap, disk storage is cheap and CPU is fairly cheap. Disk I/O operations (IOPS) however are still very expensive in terms of hardware. The recent maturing of SSD provides extreme amounts of IOPS for a reasonable price, but the drive sizes are fractional. The result is that in order to scale IOPS-wise you currently need many low-space SSDs to get the required space, or many low-IOPS spindle drives to get the required IOPS:

Samsung EVO 840 1TB SSD - 98.000 IOPS - 470 USD

Seagate Barracuda 3TB - 240 IOPS - 110 USD

You would need $44.880 (408 drives) worth of spindle drives in order to match a single SSD drive in terms of I/O-performance. On the other hand a $2.000 array of spindle drives would get you a net ~54 TB of space. The cost of SSD to reach the same volume would be $25.380. Not to mention the cost of servers, power, provisioning, etc.

Note: This is the cheapest available bulk consumer drives and comparable OEM drives (SSD, spindle) for a HP server will be 6 to 30 times more expensive.

In rrdtool version 1.4, released in 2009, rrdcached was introduced as a caching daemon for buffering multiple data updates and reducing the number of random I/O operations by writing several related datapoints in sequence. It took a couple of years before this new feature was implemented in most of the common open source monitoring solutions.

For a good introduction into the internals of rrdtool/rrdcached updates and the problems with I/O scaling look at presentation by Sebastian Harl, How to Escape the I/O Hell

Scaling problems

Most of today’s monitoring systems do not easily scale-out. Scale-out, or scaling horizontally, is when you can add new nodes in response to increased load. Scaling up by replacing existing hardware with state-of-the-art hardware is both expensive and only buys you limited time before the next even more expensive necessary hardware upgrade. Many systems offer distributed polling but none offer the option of spreading out the disk load. For example; you can scale Zenozz for High Availability but not performance.

Loss of detail

Current RRD based systems will aggregate old data into averages in order to save storage space. Most technicians do not have the in depth knowledge in order to tune the rules for aggregation and will leave the default values as is. Using cacti as an example and looking at the cacti documentation we see that in a very short time, 2 months, data is averaged to a single data point PER DAY. For systems such as Internet backbones where traffic vary a lot from bottom (30% utilization for example) to peak (90% utilization for example) during a day only the average of 60% is shown in the graphs. This in turn makes troubleshooting by comparing old data difficult. It makes trending based on peaks/bottoms impossible and it may also lead to wrong or delayed strategic decisions on where to invest in added capacity.

Lack of flexibility

In order to collect, store and graph new kinds of metrics an operator would need a certain level of programming skills and experience with the internals of the monitoring system. Adding new metrics to the systems would range from hours to weeks depending on the skill and experience of the operator. Creating new graphs based on existing metrics is also very difficult on most systems. And not within reach for the average operator.

The monitoring revolution

We are currently at the beginning of a monitoring revolution. The advent of cloud computing and big data has created a need for measuring lots of metrics for thousands of machines at small intervals. This has sparked the creation of completely new monitoring components. One of the components where we now have improved alternatives is for efficient metric storage.

The first is OpenTSDB, a “Scalable, Distributed, Time Series Database” that begun development at StumbleUpon in 2011 and aimed at solving some of the problems with existing monitoring systems. OpenTSDB is built in top of Apache HBase which is a scalable and performant database that builds on top of Apache Hadoop. Hadoop is a series of tools for building large and scalable distributed systems. Back in 2010 Facebook already had 2000 machines in a Hadoop cluster with 21PB (that is 21.000.000 GB) of combined storage.

The second is an interesting newcommer, InfluxDB, that began development in 2013 and has the goal of offering scalability and performance without the requirements of HBase/Hadoop.

In addition to advances in performance these alternatives also decouple storage of metrics and display of graphs and abstract the interaction in simple and well-defined APIs. This makes it easy for developers to create improved frontends rapidly and this has already resulted in several very attractive open-source frontends such as Metrilyx (OpenTSDB), Grafana (InfluxDB, Graphite, soon OpenTSDB), StatusWolf (OpenTSDB), Influga (InfluxDB).

Setting up a single node OpenTSDB instance on Debian 7 Wheezy

In the rest of this paper we will set up a single node OpenTSDB instance. OpenTSDB builds on top of HBase and Hadoop and scales to very large setups easily. But it also delivers substantial performance on a single node which is deployed in less than an hour. There are plenty of guides on installing a Hadoop cluster but here we will focus on the natural first step of getting a single node running using recent releases of the relevant software:

OpenTSDB 2.0.0 - Released 2014-05-05
HBase 0.98.2 - Released 2014-05-01
Hadoop 2.4.0 - Released 2014-04-07

If you later require to deploy a larger cluster consider using a framework such as Cloudera CDH or Hortonworks HDP which are open-source platforms which package Apache Hadoop components and provides a fully tested environment and easy-to-use graphical frontends for configuration and management. It is recommended to have at least 5 machines in a HBase cluster supporting OpenTSDB.

This guide assumes you are somewhat familiar with using a Linux shell/command prompt.

Hardware requirements

CPU cores: Max (Limit to 50% of your available CPU resources)
RAM: Min 16 GB
Disk 1 - OS: 10 GB - Thin provisioned
Disk 2 - Data: 100 GB - Thin provisioned

Operating system requirements

This guide is based on a recently installed Debian 7 Wheezy 64bit installed without any extra packages. See the official documentation for more information.

All commands are entered as root user unless otherwise noted.

Pre-setup preparations

We start by installing a few tools that we will need later.

apt-get install wget make gcc g++ cmake maven

Create a new ext3 partition on the data disk /dev/sdb:

(echo "n"; echo "p"; echo ""; echo ""; echo ""; echo "t"; echo "83"; echo "w") | fdisk /dev/sdb

mkfs.ext3 /dev/sdb1

ext3 is the recommended filesystem for Hadoop.

Create a mountpoint /mnt/data1 and add it to the file system table and mount the disk:

mkdir /mnt/data1
echo "/dev/sdb1     /mnt/data1    ext3    auto,noexec,noatime,nodiratime   0   1" | tee -a /etc/fstab
mount /mnt/data1

Using noexec for the data partition will increase security as nothing on the data partition will be allowed to ever execute.
Using noatime and nodiratime increases performance since the read access timestamps are not updated on every file access.

Installing java from packages

Installing java on Linux can be quite challenging due to licensing issues, but thanks to the guys over at Launchpad.net who are providing a repository with a custom java package this can now be done quite easy.

We start by adding the launchpad java repository to our /etc/apt/sources.list file:

echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu precise main" | tee -a /etc/apt/sources.list
echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu precise main" | tee -a /etc/apt/sources.list

Add the signing key and download information from the new repository:

apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys EEA14886
apt-get update

Run the java installer:

apt-get install oracle-java7-installer

Follow the instructions on screen to complete the Java 7 installation.

Installing HBase

OpenTSDB has its own HBase installation tutorial here. It is very brief and does not use the latest versions or snappy compression.

Download and unpack HBase:

cd /opt
wget http://apache.vianett.no/hbase/hbase-0.98.2/hbase-0.98.2-hadoop2-bin.tar.gz
tar xvfz hbase-0.98.2-hadoop2-bin.tar.gz
export HBASEDIR=`pwd`/hbase-0.98.2-hadoop2/

Increase the system-wide limitations of open files and processes from the default of 1000 to 32000 by adding a few lines to /etc/security/limits.conf:

echo "root    -               nofile  32768" | tee -a /etc/security/limits.conf
echo "root    soft/hard       nproc   32000" | tee -a /etc/security/limits.conf
echo "*    -               nofile  32768" | tee -a /etc/security/limits.conf
echo "*    soft/hard       nproc   32000" | tee -a /etc/security/limits.conf

The settings above will only take effect if we also add a line to /etc/pam.d/common-session:

echo "session required  pam_limits.so" | tee -a /etc/pam.d/common-session

Install snappy

Snappy is a compression algorithm that values speed over compression ratio and this makes it a good choice for high throughput applications such as Hadoop/HBase. Due to licensing issues Snappy does not ship with HBase and need to be installed on top.

The installation process is a bit complicated and has caused headache for many people (me included). Here we will show a method of installing snappy and getting it to work with the latest version of HBase and Hadoop.

Compression algorithms in HBase Compression is the method of reducing the size of a file or text without losing any of the contents. There are many compression algorithms available and some focus on being able to create the smallest compressed file at the cost of time and CPU usage while other achieve reasonable compression ratio while being very fast.

Out of the box HBase supports gz(gzip/zlib), snappy and lzo. Only gz is included due to licensing issues. Unfortunately gz is a slow and costly algorithm compared to snappy and lzo. In a test performed by Yahoo (see slides here, page 8) gz achieves 64% compression in 32 seconds. lzo 47% in 4.8 seconds and snappy 42% in 4.0 seconds. lz4 is another protocol considered for inclusion that is even faster (2.4 seconds) but requires much more memory.

For more information look at the Apache HBase Handbook - Appendix C - Compression

Building native libhadoop and libsnappy

In order to use compression we need the common Hadoop library, libhadoop.so, and the snappy library, libsnappy.so. HBase ships without libhadoop.so and the libhadoop.so that ships in the Hadoop Package is only for 32 bit OS. So we need to compile these files ourself.

Start by downloading and installing ProtoBuf. Hadoop requres version 2.5+ which is not available as a Debian package unfortunately.

wget --no-check-certificate https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz
tar zxvf protobuf-2.5.0.tar.gz
cd protobuf-2.5.0
./configure; make; make install
export LD_LIBRARY_PATH=/usr/local/lib/

Download and compile Hadoop:

apt-get install zlib1g-dev
wget http://apache.uib.no/hadoop/common/hadoop-2.4.0/hadoop-2.4.0-src.tar.gz
tar zxvf hadoop-2.4.0-src.tar.gz
cd hadoop-2.4.0-src/hadoop-common-project/
mvn package -Pdist,native -Dskiptests -Dtar -Drequire.snappy -DskipTests

Copy the newly compiled native libhadoop library into /usr/local/lib, then create the folder in which HBase looks for it and create a shortcut from there to /usr/local/lib/libhadoop.so:

cp hadoop-common/target/native/target/usr/local/lib/libhadoop.* /usr/local/lib
mkdir -p $HBASEDIR/lib/native/Linux-amd64-64/
cd $HBASEDIR/lib/native/Linux-amd64-64/
ln -s /usr/local/lib/libhadoop.so* .

Install snappy from Debian packages:

apt-get install libsnappy-dev

Configuring HBase

Now we need to do some basic configuration before we can start HBase. The configuration files are in $HBASEDIR/conf/.

conf/hbase-env.sh

A shell script setting various environment variables related to how HBase and Java should behave. The file contains a lot of options and they are all documented by comments so feel free to look around in it.

Start by setting the JAVA_HOME, which points to where Java is installed:

export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Then increase the size of the Java Heap from the default of 1000 which is a bit low:

export HBASE_HEAPSIZE=8000

conf/hbase-site.xml

An XML file containing HBase specific configuration parameters.

<configuration>

   <property>
    <name>hbase.rootdir</name>
    <value>/mnt/data1/hbase</value>
  </property>
  
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/mnt/data1/zookeeper</value>
  </property>

</configuration>

Testing HBase and compression

Now that we have installed snappy and configured HBase we can verify that HBase is working and that the compression is loaded by doing:

$HBASEDIR/bin/hbase org.apache.hadoop.hbase.util.CompressionTest /tmp/test.txt snappy

This should output some lines with information and end with SUCCESS.

Starting HBase

HBase ships with scripts for starting and stopping it, namely start-hbase.sh and stop-hbase.sh. You start HBase with

$HBASEDIR/bin/start-hbase.sh

Then look at the log to ensure it has started without any serious errors:

tail -fn100 $HBASEDIR/bin/../logs/hbase-root-master-opentsdb.log

If you want HBase to start automatically on boot you can use a process management tool such as Monit or simply put it in /etc/rc.local:

/opt/hbase-0.98.2-hadoop2/bin/start-hbase.sh

Installing OpenTSDB

Start by installing gnuplot, which is used by the native webui to draw graphs:

apt-get install gnuplot

Then download and install OpenTSDB:

wget https://github.com/OpenTSDB/opentsdb/releases/download/v2.0.0/opentsdb-2.0.0_all.deb
dpkg -i opentsdb-2.0.0_all.deb

Configuring OpenTSDB

The configuration file is /etc/opentsdb/opentsdb.conf. It has some of the basic configuration parameters but not nearly all of them. Here is the official documentation with all configuration parameters.

The defaults are reasonable but we need to make a few tweaks, the first is to add this:

tsd.core.auto_create_metrics = true

This will make OpenTSDB accept previously unseen metrics and add them to the database. This is very useful in the beginning when feeding data into OpenTSDB. Without this you will have to use the command mkmetric for each metric you will store and get errors that might be hard to trace if the metric you create do not match what is actually sent.

Then we will add support for chunked requests via the HTTP API:

tsd.http.request.enable_chunked = true
tsd.http.request.max_chunk = 16000

Some tools and plugins (such as our own improved collectd to OpenTSDB plugin) send multiple data points in a single HTTP request for increased efficiency and requires this setting to be enabled.

Creating HBase tables

Before we start OpenTSDB we need to create the necessary tables in HBase:

env COMPRESSION=SNAPPY HBASE_HOME=$HBASEDIR /usr/share/opentsdb/tools/create_table.sh

Starting OpenTSDB

Since version 2.0.0 OpenTSDB ships as a Debian package and includes SysV init scripts. To start OpenTSDB as a daemon running in the background we run:

service opentsdb start

And then check the logs for any errors or other relevant information:

tail -f /var/log/opentsdb/opentsdb.log

If the server is started successfully the last line of the log should say:

13:42:30.900 INFO  [TSDMain.main] - Ready to serve on /0.0.0.0:4242

And you can now browse to your new OpenTSDB in a browser using http://hostname:4242 !

Feeding data into OpenTSDB

It is not within the scope of this paper to go into details about how to feed data into OpenTSDB but we will give a quick introduction here to get you started.

A note on metric naming in OpenTSDB

Each datapoint has a metric name such as df.bytes.free and one or more tags such as host=server1 and mount=/mnt/data1. This is closer to the proposed Metrics 2.0 standard for naming metrics than the traditional naming of df.bytes.free.server1.mnt-data. This makes it possible to create aggregates across tags and combine data easily using the tags.

OpenTSDB stores each datapoint with a given metric and tags in one HBase row per hour. But due to a HBase issue it still has to scan every row that matches the metric, ignoring the tags. Even though it will only return the data also matching the tags. This results in very much data being read and it will be very slow to read if there is a large number of data points for a given metric. The default for the collectd-opentsdb plugin is to use the read plugin name as metric, and other values as tags. In my case this results in 72.000.000 datapoints per hour for this metric. When generating a graph all of this data has to be read and evaluated before drawing a graph. 24 hours of data is over 1.7 billion datapoints for this single metric and results in a read performance of 5-15 minutes for a simple graph.

A solution to this is to use shift-to-metric, as mentioned in the OpenTSDB user guide. Shift-to-metric is simply moving one or more data identifiers from tags to the metric in order to reduce the cardinality (number of values) for a metric, and hence the time required to read out the data we want. We have modified the collectd-opentsdb java plugin in order to shift the tags to metrics, and this increases read-performance by ~1000x down to 10-100ms. Read the section about collectd below for more information on our modified plugin.

tcollector

tcollector is the default agent for collecting and sending data from a Linux server to a OpenTSDB server. It is based on Python and plugins / addons can be written in any language. It ships with the most common plugins to collect information about disk usage and performance, cpu and memory statistics and also for some specific systems such as mysql, mongodb, riak, varnish, postgresql and others. tcollector is very lightweight and features advanced de-duplication in order to reduce unneeded network traffic.

The commands for installing dependencies and downloading tcollector are

aptitude install git python
cd /opt
git clone git://github.com/OpenTSDB/tcollector.git

Configuration is in the startup script tcollector/startstop, you will need to uncomment and set the value of TSD_HOST to point to your OpenTSDB server.

To start it run

/opt/tcollector/startstop start

This is also the command you want to add to /etc/rc.local in order to have the agent automatically start at boot. Logfiles are saved in /var/log/tcollector.log and they are rotated automatically.

peritus-tc-tools

We have developed a set of tcollector plugins for collecting statistics from

ISC DHCPd server, about number of DHCP events and DHCP pool sizes
OpenSIPS, total number of subscribers and registered user agents
Atmail, number of users, admins, sent and received emails, logins and errors

As well as a high performance replacement for smokeping called tc-ping.

These plugins are available for download from our GitHub page.

collectd-opentsdb

collectd is the system statistics collection daemon and is a widely used system for collecting metrics from various sources. There are several options for sending data from collectd to OpenTSDB but one way that works well is to use the collectd-opentsdb java write plugin.

Since collectd is a generic metric collection tool the original collectd-opentsdb plugin will use the plugin name (such as snmp) as the metric, and use tags such as host=servername, plugin_instance=ifHcInOctets and type_instance=FastEthernet0/1.

As mentioned in the note on metric naming in OpenTSDB this can be very inefficient when data needs to be read again resulting in read performance potentially thousands of times slower than optimal (<100ms). To alleviate this we have modified the original collectd-opentsdb plugin to store all metadata as part of the metric. This gives metric names such as ifHCInBroadcastPkts.sw01.GigabitEthernet0 and very good read performance.

The modified collectd-opentsdb plugin can be downloaded from our GitHub repository.

Monitoring OpenTSDB

To monitor OpenTSDB itself install tcollector as described above on the OpenTSDB server and set TSD_HOST to localhost in /opt/tcollector/startstop.

You can then go to http://opentsdb-server:4242/#start=1h-ago&end=1s-ago&m=sum:rate:tsd.rpc.received%7Btype=*%7D&o=&yrange=%5B0:%5D&wxh=1200x600 to view a graph of amount of data received in the last hour.

Performance comparison

Lastly we include a little performance comparison between the latest version of OpenTSDB+HBase+Hadoop, a previous version of OpenTSDB+HBase+Hadoop that we have used for a while as well as rrdcached which ran in production for 4 years at a client.

The workload is gathering and storing metrics from 150 Cisco switches with 8200 ports/interfaces every 5 seconds. This equals about 15.000 points per second.

Figure 1 - Data received by OpenTSDB per second

Collection

Even though it is not the primary focus, we include some data about collection performance for completeness. Collection is done using the latest version of collectd and the builtin SNMP plugin.

NB #1: There is a memory leak in the way collectd’s SNMP plugin uses the underlying libsnmp library and you might need to schedule a restart of the collectd service as a workaround for that if handling large workloads.

NB #2: Due to limitations in the libnetsnmp library you will run into problems if polling many (1000+) devices with a single collectd instance. A workaround is to run multiple collectd instances with fewer hosts.

Figure 2 shows that collection through SNMP polling consumes about 2200Mhz. We optimized some of the data types and definitions in collectd when moving to OpenTSDB and achieved a 20% performance increase in the polling as seen in Figure 3.

Figure 2 - CPU Usage - SNMP polling and writing to RRDcached

Figure 3 - CPU Usage - SNMP polling and sending to OpenTSDB

Writing to the native rrdcached write plugin consumes 1300Mhz while our modified collectd-opentsdb plugin consumes 1450Mhz. It is probably possible to create a much more efficient write plugin with more advanced knowledge of concurrency and using a lower level language such as C.

Storage

When considering storage performance we will look at CPU usage and disk IOPS since these are the primary drivers of cost in today’s datacenters.

collectd + rrdcached

CPU usage - 1300Mhz, see Figure 2 above.

Figure 4 - Disk write IOPS - Fluctuating between 10 and 170 IOPS during the 1 hour flush period.

OpenTSDB + Hbase 0.96 + Hadoop 1

Figure 5 - CPU usage - 1700Mhz baseline with peaks of 7000Mhz during Java Garbage Collection (GC) (untuned).

Figure 6 - Disk write IOPS - 5 IOPS average with peaks of 25 IOPS during Java GC. We also see that disk read IOPS are much higher and this is due to regular compaction of the database and can be tuned. Reads in general can be reduced by increasing caching with more RAM if necessary.

OpenTSDB + HBase 0.98 + Hadoop 2

Figure 7 - CPU usage - 1200Mhz baseline with peaks of 5000-6000Mhz during Java GC (untuned).

Figure 8 - Disk write IOPS - < 5 IOPS average with peaks of 25 IOPS during Java GC. Much less read IOPS during compaction compared to HBase 0.96.

Conclusion

Even without tuning, a single instance OpenTSDB installation is able to handle significant amounts of data before running into IO problems. This comes at a cost of CPU, currently OpenTSDB will consume > 300% the amount of CPU cycles compared to rrdcached for storage. But this is offset by a 85-95% reduction in disk load. In absolute terms for our particular set up (one 2 year old HP DL360p Gen8 running VMware vSphere 5.5) CPU usage increased from 15% to 25% while reducing IOPS load from 70% to < 10%.

Fine tuning of parameters (such as Java GC) as well as detailed analysis of memory usage is outside the scope of this brief paper and detailed information may be found elsewhere (51,52,53) for those interested.

Stian Ovrevage

Stian is a senior consultant and founder at Peritus Consulting AS. He is currently managing the technical systems for a small FTTH ISP in Norway. He also does consulting for other clients when time permits. When not digging deep into technical challenges he enjoys the outdoors.

Also on [GitHub][59], [LinkedIn][58], [Facebook][56], [Google+][57] and [Twitter][55].

Peritus Consulting Technical Reports

Technical reports are in-depth articles aimed at giving actionable advice on new technologies as well as recommended best practices based on tried and true solutions. We cover areas that are lacking of good in depth coverage online but will not re-write topics that are already covered in a satisfactory way elsewhere. We also write tech notes which are shorter pieces with thoughts and tips on both technology and the way technology should be used optimally. Our official webpage (in Norwegian) is at [www.peritusconsulting.no][60], articles are published on our [GitHub Page][61], we are also on [Twitter][62], [LinkedIn][64] and [Facebook][63].

Stian Øvrevåge

A side quest in API development, observability, Kubernetes and cloud with a hint of database

Background

The (initial) problem

Fixing the (initial) problem

Verifying the (initial) fix

Baseline simple request - HTTP1 1 connections, 20000 requests

Baseline complex request - HTTP1 1 connections, 20000 requests

Verifying the fix for assumed workload

Complex request - HTTP1 6 connections, 500 requests

Complex request - HTTP2 500 “connections”, 500 requests

Side quest: Database optimizations

Determining the next bottleneck

Side quest: Cluster resources and burstable VMs

Conclusion

End of 2020 rough database landscape

Background

Project phase overview

Starting out

Moving from development to production

Scaling production

Challenges

Planning

Database categories

SQL

NoSQL

KeyValue

Timeseries

Graph

Other nice things

The Landscape

SQL

NoSQL

KeyValue

Timeseries

Graph

Further reading

Conclusion

Mini-post: Down-scaling Azure Kubernetes Service (AKS)

Node overhead

CPU

Memory

Node pods (DaemonSets)

CPU

Memory

kube-system pods

Third party pods

Radix platform pods

Pod scheduling

calico-node

Conclusion

Disk performance on Azure Kubernetes Service (AKS) - Part 1: Benchmarking

Microsoft Azure

Corrections

Background

Metric Methodologies

Storage Background

What to measure?

How to measure disk

How to measure disk on Azure Kubernetes Service

Test results

Test 1 - Learning to dislike Azure Cache

Test 2 - Disable Azure Cache - enable OS cache

Test 3 - Disable OS cache

Test 4 - Increase IO depth

Test 5 - Larger block size, smaller IO depth

Test 6 - Enable OS cache

Test 7 - Random writes, small block size

Test 8 - Large block size

Conclusion

Managed Kubernetes on Microsoft Azure (English)

Microsoft Azure

Background

Docker containers

Container orchestration

Getting started with Azure Kubernetes - AKS

Caveats

Preparations

Azure login

Activate ContainerService