Skip to content

Hosting and Maintenance

This document gives a comprehensive overview of all important concepts, components and tools in use. As you will see one of our key expertise is to automate monitoring to provide an excellent quality concerning stability.

This document is also linked to the document ALM, where you find similar philosophies and principles.

Overview

Overview

The following features explain our environment in a more detailed fashion.

Infrastructure as Code

This is one of our key concepts. All our code for creating our infrastructure and tools is stored in Ansible scripts in our GitLab. These scripts can always be executed through jobs in a build pipeline. It is also a key part of our recovery concept described later in this document.

Live Monitoring

By using sensors and heartbeats, we can monitor a host by checking:

  • If the host is running
  • CPU
  • Memory Usage
  • Disk I/O
  • Network

We use the tool Netdata, several thousand metrics are monitored every second, e.g.:

  • MySQL
  • Docker
  • Apache
  • etc.

Extensive Logging

As you can see in the picture above, we collect the logs from every host and from every application in use. This is a huge amount of data, so we use Elasticsearch for analysis. This analysis can also result in an alert, which is sent to our alerting system described below.

High-Quality Analysis

With Netdata and logging in place we can provide a comprehensive analysis of all our components. It allows us to retrace problems or even avoid problems, which might appear in the future.

Flexible distribution of Alerts

Specific Alerts can be distributed by Netdata to several channels.

'*' Channels supported by Netdata:

  • e-mails (using the sendmail command),
  • push notifications to your mobile phone pushover.net,
  • messages to your slack team slack.com,
  • messages to your alerta server Alerta,
  • messages to your flock team Flock,
  • messages to your discord guild Discord,
  • messages to your telegram chat / group chat Telegram
  • sms messages to your cell phone or any sms enabled device Twilio
  • sms messages to your cell phone or any sms enabled device (messagebird.com)
  • sms messages to your cell phone or any sms enabled device (smstools3)
  • notifications to users on pagerduty.com
  • push notifications to iOS devices (via prowlapp.com)
  • notifications to Amazon SNS topics (aws.amazon.com)
  • messages to your irc channel on your selected network
  • messages to a local or remote syslog daemon
  • message to Microsoft Team (through webhook)
  • message to Rocket.Chat (through webhook)
  • message to Google Hangouts Chat (through webhook)

Fallback Scenario

In case of an outage of the alerting system, we have an external fallback platform, called Matrix. All alerts will be sent to this platform as well.

Alert Deduplication

As seen in the picture this component has four incoming channels:

  • Result of the Log Analysis
  • Host sensors
  • Host heartbeats
  • Applications

This component is critical because it acts as a single pool of alerts for the DevOps staff. We want to avoid duplications to get a clear overview of what has happened.

Automatic Ticket Creation

Our alert deduplication component will create tickets in GitLab automatically, when something goes wrong. This allows us to react as fast as possible to take care of the potential problem.

CI / CD Pipelines and Communication

The tasks defined in the pipelines can be triggered from various sources, like: * Schedule, e.g. running every night * Action of a developer in Git (our VCS) * Command from the communication platform. We use the ChatOps feature in Mattermost.

The tasks are developed once to satisfy our goal of DRY (don't repeat yourself). These tasks can be used from different triggers.

Result of CI / CD Pipelines

The pipelines have three main goals:

  • Provisioning: Covers all the tools of the host
  • Maintenance: Covers the configuration of the applications
  • Deployment: Install and update the application

Manage Configuration in Repository

Configuration of a host is managed by the Git repositories. This allows us to easily keep track and recover the system. This concept is explained in the next chapter.

Recovery

The following components are backed up by the 3-2-1 philosophy and can be recovered.

Recovery

Note: The DB and User Data recovery duration cannot be determined due to the amount of data and network speed.

Tools

Function Tool
Repo / Versioning GitLab
Issues GitLab
CI/CD Pipelines GitLab
Communication Mattermost
Provisioning / Deployment / Maintenance Ansible
Alert Deduplication Alerta
Sensors Netdata and Beats
Heartbeats curl
Log Collection TD-Agent
Log Aggregation Elasticsearch
Log Visualizing Kibana
Log Analysis ElastAlert
Ping Netdata API
Backup Borg Backup

Last update: March 17, 2021 09:56:50