As part as my never-ending quest to improve how I build cool things, I've been working for some time on building out infrastructure to help automate and monitor how my apps and servers are doing. I've written about horizontal scaling before[1], but today I'd like to get into one specific facet of its implementation:
automated network discovery, and how we use it at FarmGeek[2] to build reliable applications.
1: I've written about horizontal scaling before
So lets say you have a few servers - a node balancer, two application servers and a database server, for example. Everything's working fine until BAM, one of your application servers crashes. To make things worse, in this scenario for an unexplained reason nobody finds out about this. However your HAProxy checks work and so the node leaves the connection pool as expected.
Your server capacity just silently halfed in size, without any notifications and with no way of recovering from the problem. That's not good.
There are a bunch of problems with the "standard" setup being described here:
- There's no way of understanding what resources are available among the servers currently switched on - every server suffers from a "Network Blindness".
- HAProxy's checks fail silently.
- There's no way of handling IP changes or new servers without manually editing HAProxy's config.
Using Consul, and with some help from Diplomat and Envoy, we aim to fix all three of these issues.
The first problem on this list can be solved with the help of a handy little idea known as Automated Service Discovery. One such implementation is Consul[1] by the lovely fellows at Hashicorp[2], which is our weapon of choice at FarmGeek[3].
There are three core things which Consul can do which helps us:
- It provides a distributed Key-Value store which allows us to persist configuration data across a network, thus allowing our services to become more portable and easier to run in parallel - as they can share configuration data between each other without relying on a datastore being present.
- It provides a DNS service for services on the network which allows our servers to become more "Network Aware" with almost zero extra work. The DNS service also doubles as a simple Load Balancer.
- It provides health checks against those services, and will remove them from the DNS pool if they begin to fail.
Of course, Consul does a heap of other things for us, but we'll focus on these three main things today as they're the most relevant to the solving of our problem.
I'm not going to go over installing Consul here, as there's a brilliant tutorial on Consul.io[4], but I will explain services, as they're the key to how we achieve a fully distributed system.
A Service is defined in Consul with (you guessed it) a Service Definition. A Service Definition outlines what kind of service we're describing, which port it's on, and what we have to run to check its health. I recommend at least running service checks on the database and the application instances. You can check the service however you want (bash script, ruby script, etc). The main stipulation is that you return a number that's not zero for less-than-perfect results. This allows Consul to decide if a service is unhealthy or not. This in turn allows consul to remove dodgy services from the pool of connections.
Another important point is how Consul's DNS API works. Yes - Consul has a DNS API. The way that it works is simple: it provides you with a random IP if you send it a specially crafted domain to resolve. It can even give you more detailed version if you use the SRV command. Very cool. But the question is, how do you get your app (or any tool for that matter) to send DNS requests to consul? At FarmGeek, we're using DNSMasq to achieve this. All you need to do, is install consul using their guide, install DNSMasq, and then create a `/etc/dnsmasq.d/10-consul` file with the following contents:
server=/consul/127.0.0.1#8600
Restart dnsmasq and you'll be able to resolve consul's *.consul domains without breaking your regular DNS resolution. Simple!
4: brilliant tutorial on Consul.io
Consul allows our servers to talk to one another and to check on the services on our servers, but how do our apps talk to consul? Consul has a DNS and a HTTP API for us to use, and Diplomat[1] is a lightweight ruby wrapper for the HTTP API. At FarmGeek, we use it to store basic configuration data amongst our servers that we'd traditionally provide within Environment Variables.
To use Diplomat, simply add it to your Gemfile, then use Diplomat's static methods anywhere where you'd like to get or set key-value data.
An example use-case would be to configure rails' database connection. The example used in the README looks like this:
<% if Rails.env.production? %> production: adapter: postgresql encoding: unicode host: <%= Diplomat::Service.get('postgres').Address %> database: <%= Diplomat.get('project/db/name') %> pool: 5 username: <%= Diplomat.get('project/db/user') %> password: <%= Diplomat.get('project/db/pass') %> port: <%= Diplomat::Service.get('postgres').ServicePort %> <% end %>
However, since we have DNS resolution working now, we could have Consul balance our API connections by setting the host to `postgres.service.consul`, and if we have more than one postgres service available in the network, we'll be randomly switched between them automatically.
At this point our servers are aware of one another, our services can are aware of one another, and our apps are able to share configurations. The final step is to connect our apps to our services. Usually this is straight forward. In the case of HAProxy, however, it's a bit more tricky.
So we came up with Envoy[2], a really simple NodeJS script FarmGeek have released on Github under the MIT license to connect HAProxy to Consul. It's designed to be very hackable and lightweight, and it should run on each HAProxy server.
Envoy will reload your config simply by calling `service haproxy reload`, so it may require sudo.
To use Envoy, clone the repository onto your server, add in a haproxy template based on the sample one in the repository, and run it (as a service, preferably). Envoy will periodically poll Consul for changes, and if it finds any, it'll replace your haproxy config and reload. Simple! I've outlined an example configuration to serve as a way of explaining what envoy does:
global log 127.0.0.1 local0 log 127.0.0.1 local1 notice chroot /var/lib/haproxy daemon maxconn 4096 stats timeout 30s stats socket /tmp/haproxy.status.sock mode 660 level admin user haproxy group haproxy # Default ciphers to use on SSL-enabled listening sockets. # For more information, see ciphers(1SSL). ssl-default-bind-ciphers RC4-SHA:AES128-SHA:AES256-SHA defaults log global mode http option httplog option dontlognull option redispatch retries 3 maxconn 2000 timeout connect 5000ms timeout client 50000ms timeout server 50000ms errorfile 400 /etc/haproxy/errors/400.http errorfile 403 /etc/haproxy/errors/403.http errorfile 408 /etc/haproxy/errors/408.http errorfile 500 /etc/haproxy/errors/500.http errorfile 502 /etc/haproxy/errors/502.http errorfile 503 /etc/haproxy/errors/503.http errorfile 504 /etc/haproxy/errors/504.http listen stats :1234 mode http stats enable stats uri / stats refresh 2s stats realm Haproxy\ Stats stats auth username:password frontend incoming bind *:80 reqadd X-Forwarded-Proto:\ http mode http acl api hdr_dom(host) -i api.farmer.io acl web hdr_dom(host) -i farmer.io <% if (services.indexOf('api') > -1) { %> use_backend api if api <% } %> <% if (services.indexOf('web') > -1) { %> use_backend web if web <% } %> frontend incoming_ssl bind *:443 ssl crt /etc/ssl/ssl_certification.crt no-sslv3 ciphers RC4-SHA:AES128-SHA:AES256-SHA reqadd X-Forwarded-Proto:\ https mode http acl api hdr_dom(host) -i api.farmer.io acl web hdr_dom(host) -i farmer.io <% if (services.indexOf('api') > -1) { %> use_backend api if api <% } %> <% if (services.indexOf('web') > -1) { %> use_backend web if web <% } %> <% services.forEach(function(service) { %> backend <%= service %> # Redirect to https if it's available redirect scheme https if !{ ssl_fc } # Data is proxied in http mode (not tcp mode) mode http <% backends[service].forEach(function(node) { %> server <%= node['node'] + ' ' + node['ip'] + ':' + node['port'] %> <% }); %> <% }); %>
I won't go over how HAProxy works, as there's plenty of guides on the internet on that, but let's break dive into the areas which aren't "standard" compared to most configs:
frontend incoming bind *:80 reqadd X-Forwarded-Proto:\ http mode http acl api hdr_dom(host) -i api.farmer.io acl web hdr_dom(host) -i farmer.io <% if (services.indexOf('api') > -1) { %> use_backend api if api <% } %> <% if (services.indexOf('web') > -1) { %> use_backend web if web <% } %>
Line 5 - `acl api hdr_dom(host) -i api.farmer.io` - is using HAProxy's access control list system to create the variable "api" if the incoming traffic is requesting the hostname `api.farmer.io`. In line 8, we then use that variable to decide whether to use the backend or not. However we must also check that consul has a backend of the same name, and so in line 7 we check that consul has a backend to match before we try to use it.
<% services.forEach(function(service) { %> backend <%= service %> # Redirect to https if it's available redirect scheme https if !{ ssl_fc } # Data is proxied in http mode (not tcp mode) mode http <% backends[service].forEach(function(node) { %> server <%= node['node'] + ' ' + node['ip'] + ':' + node['port'] %> <% }); %> <% }); %>
In this segment, we're taking all the services that Envoy has found through Consul and spitting them out as backend services. Part of this includes spitting out all of the healthy nodes attached to the service, which can be seen in lines 7-9.
Now we've connected up our systems, we've made a great stride towards building a more fault-tolerant system.