Hosting and Maintenance¶

This document gives a comprehensive overview of all important concepts, components and tools in use. As you will see one of our key expertise is to automate monitoring to provide an excellent quality concerning stability.

This document is also linked to the document ALM, where you find similar philosophies and principles.

Overview¶

Overview

The following features explain our environment in a more detailed fashion.

Infrastructure as Code¶

This is one of our key concepts. All our code for creating our infrastructure and tools is stored in Ansible scripts in our GitLab. These scripts can always be executed through jobs in a build pipeline. It is also a key part of our recovery concept described later in this document.

Live Monitoring¶

By using sensors and heartbeats, we can monitor a host by checking:

If the host is running
CPU
Memory Usage
Disk I/O
Network

We use the tool Netdata, several thousand metrics are monitored every second, e.g.:

MySQL
Docker
Apache
etc.

Extensive Logging¶

As you can see in the picture above, we collect the logs from every host and from every application in use. This is a huge amount of data, so we use Elasticsearch for analysis. This analysis can also result in an alert, which is sent to our alerting system described below.

High-Quality Analysis¶

With Netdata and logging in place we can provide a comprehensive analysis of all our components. It allows us to retrace problems or even avoid problems, which might appear in the future.

Flexible distribution of Alerts¶

Specific Alerts can be distributed by Netdata to several channels.

'*' Channels supported by Netdata:

e-mails (using the sendmail command),
push notifications to your mobile phone pushover.net,
messages to your slack team slack.com,
messages to your alerta server Alerta,
messages to your flock team Flock,
messages to your discord guild Discord,
messages to your telegram chat / group chat Telegram
sms messages to your cell phone or any sms enabled device Twilio
sms messages to your cell phone or any sms enabled device (messagebird.com)
sms messages to your cell phone or any sms enabled device (smstools3)
notifications to users on pagerduty.com
push notifications to iOS devices (via prowlapp.com)
notifications to Amazon SNS topics (aws.amazon.com)
messages to your irc channel on your selected network
messages to a local or remote syslog daemon
message to Microsoft Team (through webhook)
message to Rocket.Chat (through webhook)
message to Google Hangouts Chat (through webhook)

Fallback Scenario¶

In case of an outage of the alerting system, we have an external fallback platform, called Matrix. All alerts will be sent to this platform as well.

Alert Deduplication¶

As seen in the picture this component has four incoming channels:

Result of the Log Analysis
Host sensors
Host heartbeats
Applications

This component is critical because it acts as a single pool of alerts for the DevOps staff. We want to avoid duplications to get a clear overview of what has happened.

Automatic Ticket Creation¶

Our alert deduplication component will create tickets in GitLab automatically, when something goes wrong. This allows us to react as fast as possible to take care of the potential problem.

CI / CD Pipelines and Communication¶

The tasks defined in the pipelines can be triggered from various sources, like: * Schedule, e.g. running every night * Action of a developer in Git (our VCS) * Command from the communication platform. We use the ChatOps feature in Mattermost.

The tasks are developed once to satisfy our goal of DRY (don't repeat yourself). These tasks can be used from different triggers.

Result of CI / CD Pipelines¶

The pipelines have three main goals:

Provisioning: Covers all the tools of the host
Maintenance: Covers the configuration of the applications
Deployment: Install and update the application

Manage Configuration in Repository¶

Configuration of a host is managed by the Git repositories. This allows us to easily keep track and recover the system. This concept is explained in the next chapter.

Recovery¶

The following components are backed up by the 3-2-1 philosophy and can be recovered.

Recovery

Note: The DB and User Data recovery duration cannot be determined due to the amount of data and network speed.

Tools¶

Function	Tool
Repo / Versioning	GitLab
Issues	GitLab
CI/CD Pipelines	GitLab
Communication	Mattermost
Provisioning / Deployment / Maintenance	Ansible
Alert Deduplication	Alerta
Sensors	Netdata and Beats
Heartbeats	curl
Log Collection	TD-Agent
Log Aggregation	Elasticsearch
Log Visualizing	Kibana
Log Analysis	ElastAlert
Ping	Netdata API
Backup	Borg Backup