Skip to content

Hosting and Maintenance

This document gives a comprehensive overview of all important concepts, components and tools in use. As you will see one of our key expertise is to automate monitoring to provide an excellent quality concerning stability.

This document is also linked to the document ALM, where you find similar philosophies and principles.

Overview

Overview

The following features explain our environment in a more detailed fashion.

Infrastructure as Code

This is one of our key concepts. All our code for creating our infrastructure and tools is stored in Ansible scripts in our GitLab. These scripts can always be executed through jobs in a build pipeline. It is also a key part of our recovery concept described later in this document.

Live Monitoring

By using sensors and heartbeats, we can monitor a host by checking:

  • If the host is running
  • CPU
  • Memory Usage
  • Disk I/O
  • Network

We use the tool Netdata, several thousand metrics are monitored every second, e.g.:

  • MySQL
  • Docker
  • Apache
  • etc.

Extensive Logging

As you can see in the picture above, we collect the logs from every host and from every application in use. This is a huge amount of data, so we use Elasticsearch for analysis. This analysis can also result in an alert, which is sent to our alerting system described below.

High-Quality Analysis

With Netdata and logging in place we can provide a comprehensive analysis of all our components. It allows us to retrace problems or even avoid problems, which might appear in the future.

Alert Deduplication

As seen in the picture this component has four incoming channels:

  • Result of the Log Analysis
  • Host sensors
  • Host heartbeats
  • Applications

This component is critical because it acts as a single pool of alerts for the DevOps staff. We want to avoid duplications to get a clear overview of what has happened.

Automatic Ticket Creation

Our alert deduplication component will create tickets in GitLab automatically, when something goes wrong. This allows us to react as fast as possible to take care of the potential problem.

CI / CD Pipelines and Communication

The tasks defined in the pipelines can be triggered from various sources, like: * Schedule, e.g. running every night * Action of a developer in Git (our VCS) * Command from the communication platform. We use the ChatOps feature in Mattermost.

The tasks are developed once to satisfy our goal of DRY (don't repeat yourself). These tasks can be used from different triggers.

Result of CI / CD Pipelines

The pipelines have three main goals:

  • Provisioning: Covers all the tools of the host
  • Maintenance: Covers the configuration of the applications
  • Deployment: Install and update the application

Manage Configuration in Repository

Configuration of a host is managed by the Git repositories. This allows us to easily keep track and recover the system. This concept is explained in the next chapter.

Recovery

The following components are backed up by the 3-2-1 philosophy and can be recovered.

Recovery

Note: The DB and User Data recovery duration cannot be determined due to the amount of data and network speed.

Tools

Function Tool
Repo / Versioning GitLab
Issues GitLab
CI/CD Pipelines GitLab
Communication Mattermost
Provisioning / Deployment / Maintenance Ansible
Alert Deduplication Alerta
Sensors Netdata and Beats
Heartbeats curl
Log Collection TD-Agent
Log Aggregation Elasticsearch
Log Visualizing Kibana
Log Analysis ElastAlert
Ping Netdata API
Backup Borg Backup

Last update: January 12, 2021 18:06:10