Hosting and Maintenance¶
This document gives a comprehensive overview of all important concepts, components and tools in use. As you will see one of our key expertise is to automate monitoring to provide an excellent quality concerning stability.
This document is also linked to the document ALM, where you find similar philosophies and principles.
The following features explain our environment in a more detailed fashion.
Infrastructure as Code¶
This is one of our key concepts. All our code for creating our infrastructure and tools is stored in Ansible scripts in our GitLab. These scripts can always be executed through jobs in a build pipeline. It is also a key part of our recovery concept described later in this document.
By using sensors and heartbeats, we can monitor a host by checking:
- If the host is running
- Memory Usage
- Disk I/O
We use the tool Netdata, several thousand metrics are monitored every second, e.g.:
As you can see in the picture above, we collect the logs from every host and from every application in use. This is a huge amount of data, so we use Elasticsearch for analysis. This analysis can also result in an alert, which is sent to our alerting system described below.
With Netdata and logging in place we can provide a comprehensive analysis of all our components. It allows us to retrace problems or even avoid problems, which might appear in the future.
Flexible distribution of Alerts¶
Specific Alerts can be distributed by Netdata to several channels.
'*' Channels supported by Netdata:
- e-mails (using the sendmail command),
- push notifications to your mobile phone pushover.net,
- messages to your slack team slack.com,
- messages to your alerta server Alerta,
- messages to your flock team Flock,
- messages to your discord guild Discord,
- messages to your telegram chat / group chat Telegram
- sms messages to your cell phone or any sms enabled device Twilio
- sms messages to your cell phone or any sms enabled device (messagebird.com)
- sms messages to your cell phone or any sms enabled device (smstools3)
- notifications to users on pagerduty.com
- push notifications to iOS devices (via prowlapp.com)
- notifications to Amazon SNS topics (aws.amazon.com)
- messages to your irc channel on your selected network
- messages to a local or remote syslog daemon
- message to Microsoft Team (through webhook)
- message to Rocket.Chat (through webhook)
- message to Google Hangouts Chat (through webhook)
In case of an outage of the alerting system, we have an external fallback platform, called Matrix. All alerts will be sent to this platform as well.
As seen in the picture this component has four incoming channels:
- Result of the Log Analysis
- Host sensors
- Host heartbeats
This component is critical because it acts as a single pool of alerts for the DevOps staff. We want to avoid duplications to get a clear overview of what has happened.
Automatic Ticket Creation¶
Our alert deduplication component will create tickets in GitLab automatically, when something goes wrong. This allows us to react as fast as possible to take care of the potential problem.
CI / CD Pipelines and Communication¶
The tasks defined in the pipelines can be triggered from various sources, like: * Schedule, e.g. running every night * Action of a developer in Git (our VCS) * Command from the communication platform. We use the ChatOps feature in Mattermost.
The tasks are developed once to satisfy our goal of DRY (don't repeat yourself). These tasks can be used from different triggers.
Result of CI / CD Pipelines¶
The pipelines have three main goals:
- Provisioning: Covers all the tools of the host
- Maintenance: Covers the configuration of the applications
- Deployment: Install and update the application
Manage Configuration in Repository¶
Configuration of a host is managed by the Git repositories. This allows us to easily keep track and recover the system. This concept is explained in the next chapter.
The following components are backed up by the 3-2-1 philosophy and can be recovered.
Note: The DB and User Data recovery duration cannot be determined due to the amount of data and network speed.
|Repo / Versioning||GitLab|
|Provisioning / Deployment / Maintenance||Ansible|
|Sensors||Netdata and Beats|