Self-Healing and Monitoring: A comprehensive guide to revolutionizing System Resilience Through Automation
In today’s fast-paced digital world, maintaining system reliability and minimizing downtime are critical for business success. This comprehensive guide explores how to enhance system resilience through advanced monitoring and self-healing mechanisms.
We will walk you through integrating Datadog for monitoring, setting up automated recovery scripts, and leveraging Node.js and webhooks to create a reliable self-healing system (with a focus on disk management).
By the end of this guide, you'll have a fully automated setup that can proactively manage system issues, ensuring smooth and uninterrupted operations.
Prerequisites
A Datadog Account: Create an active Datadog account for monitoring and alerts.
A Linux Server: This can be either on-premises or a cloud-based instance.
Internet Access: Required for installing software, setting up integrations, and accessing Datadog.
1. Create a Datadog Account
Sign Up for Datadog:
Visit Datadog's sign-up page and create a
free trial
account by entering your email and setting a password.Complete the registration process, create a name for your
organization
, and verify your email if required.Log in to your Datadog dashboard.
2. Deploy Your First Datadog Agent
Guides to install on your preferred operating system are listed in the integration
--> agents
section.
For a Local Ubuntu Server:
Install the Datadog Agent:
Obtain Your Datadog API Key:
- Navigate to Organization settings > API Keys to find your API key.
Install the Agent:
Run the following command on your Ubuntu server to install the Datadog Agent:
DD_API_KEY=8axxxxxxxxxxxxxxxxxxxxxxxxxxxf43 DD_SITE="datadoghq.com" bash -c "$(curl -L https://install.datadoghq.com/scripts/install_script_agent7.sh)"
- This command sets up the agent and automatically configures it with your API key.
Verify Agent Installation:
Check the agent status to ensure it is running:
systemctl status datadog-agent sudo datadog-agent status
You should see output indicating that the agent is running and sending data.
To check logs, use:
tail -f /var/log/datadog/agent.log
For Ubuntu Server Managing a Kubernetes Cluster with kubectl
and helm
already installed:
Create a Datadog API Key Secret:
Execute the following command to create a Kubernetes secret containing your Datadog API key:
kubectl create secret generic datadog-secret --from-literal api-key=8axxxxxxxxxxxxxxxxxxxxxxxxxxxf43
Deploy the Datadog Agent in the Cluster:
Add the Datadog Helm repository and update:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 chmod 700 get_helm.sh ./get_helm.sh helm repo add datadog https://helm.datadoghq.com helm repo update
Create and configure
datadog-values.yaml
:nano datadog-values.yaml
Add the following content:
datadog: apiKeyExistingSecret: datadog-secret
Deploy the Datadog Agent:
helm install datadog-agent -f datadog-values.yaml datadog/datadog
Confirm agents are running:
kubectl get all
3. Prepare a Disk for Monitoring
An alert will be triggered when the disk capacity reaches a determined threshold.
Add a New Disk to the Server:
- In your VMware or AWS cloud instance, add a new 2GB disk.
Adding a 2GB Disk in VMware Workstation
Open VMware Workstation:
- Launch VMware Workstation and select your virtual machine.
Open VM Settings:
Right-click on the virtual machine and select Settings.
Add a New Disk:
Click Add to open the Add Hardware Wizard.
Choose Hard Disk and click Next.
Select SCSI (recommended) or IDE and click Next.
Choose Create a new virtual disk and click Next.
Specify the disk size as 2 GB.
Choose the location to store the virtual disk file and click Next.
Click Finish to create the disk.
Adding a 2GB Disk in AWS
Log in to AWS Management Console:
Navigate to AWS Management Console.
Log in with your credentials.
Navigate to EC2 Dashboard:
- Go to Services > EC2.
Create a New EBS Volume:
In the left sidebar, click on Volumes under Elastic Block Store.
Click Create Volume.
Configure the volume:
Volume Type: Choose
General Purpose SSD (gp3)
,Provisioned IOPS SSD (io1)
, etc.Size: Enter 2 GiB.
Availability Zone: Select the same availability zone as your EC2 instance.
Click Create Volume.
Attach the EBS Volume to an EC2 Instance:
Go back to Volumes.
Select the volume you created.
Click Actions > Attach Volume.
Choose the instance you want to attach the volume to from the drop-down list.
Click Attach.
Log in to Your EC2 or VMware Instance.
Prepare the Volume for Use:
Verify Disk:
lsblk
The new disk should be listed as
/dev/xvdf
or similar.
Install LVM: LVM (Logical Volume Management) is used to manage disk volumes because it offers flexibility, efficiency, and scalability. It allows dynamic resizing of partitions, easy addition and removal of disks, and improved storage utilization through aggregation and thin provisioning. LVM enhances performance with striping, supports snapshots for backups, simplifies administration, and can be combined with mirroring or RAID for high availability, making it an ideal choice for environments with dynamic storage needs.
Install LVM tools:
sudo apt update sudo apt install -y lvm2
Set Up LVM:
Create a Physical Volume (PV):
sudo pvcreate /dev/sdb
Create a Volume Group (VG):
sudo vgcreate demoVG /dev/sdb
Create a Logical Volume (LV) using all available space:
sudo lvcreate -n demoLV -l 100%FREE demoVG
Format and Mount the Logical Volume:
Format the LV with ext4 filesystem:
sudo mkfs.ext4 /dev/demoVG/demoLV
Create a mount point and mount the LV:
sudo mkdir /demo sudo mount /dev/demoVG/demoLV /demo
Verify the Mount:
df -h /demo
Configure Automatic Mounting:
Add the following entry to
/etc/fstab
for automatic mounting at boot:echo '/dev/demoVG/demoLV /demo ext4 defaults 0 2' | sudo tee -a /etc/fstab
Verify fstab Configuration:
cat /etc/fstab
4. Set Up the Webhook HTTPS Listener Using Node.js
If you prefer not to use Node.js, you may explore python Flask service, or Datadog’s serverless functions (if using a cloud provider like AWS) to trigger the script directly via AWS Lambda or an equivalent service, but the below method gives direct control on the infrastructure.
**First create the script, that would be triggered by Datadog’s Webhook Integration that clears/move log files in /demo
directory**.
Example script (purge_
demo.sh
):
nano /tmp/purge_demo.sh
#!/bin/bash
LOGFILE="/tmp/purge_demo.log"
echo "Running purge script at $(date)" >> $LOGFILE
# Directory to purge
TARGET_DIR="/demo"
# Check if the directory exists
if [ -d "$TARGET_DIR" ]; then
echo "Purging all files in $TARGET_DIR..." >> $LOGFILE
rm -rf ${TARGET_DIR}/* >> $LOGFILE 2>&1
echo "All files in $TARGET_DIR have been purged." >> $LOGFILE
else
echo "Directory $TARGET_DIR does not exist." >> $LOGFILE
fi
Make the Script Executable:
chmod +x /path/to/purge_demo.sh
Install Node.js:
Install the Node.js package repository and Node.js:
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - sudo apt install -y nodejs
Verify Installation:
node -v npm -v
Create a Simple Node.js Webhook Listener:
Create a directory for the webhook listener and navigate to it:
mkdir ~/webhook_listener cd ~/webhook_listener
Initialize a new Node.js project:
npm init -y
Install Express:
npm install express
Create the
webhook_listener.js
file:nano webhook_listener.js
Add the following code to
webhook_listener.js
:const express = require('express'); const { exec } = require('child_process'); const app = express(); const port = 6060; app.post('/purge', (req, res) => { // Execute the purge script exec('/tmp/purge_demo.sh', (error, stdout, stderr) => { if (error) { console.error(`Error executing script: ${error.message}`); res.status(500).send('Internal Server Error'); return; } if (stderr) { console.error(`Script stderr: ${stderr}`); } console.log(`Script output: ${stdout}`); res.send('Purge script executed'); }); }); app.listen(port, () => { console.log(`Webhook listener running at http://localhost:${port}`); });
The service will be actively listening for incoming HTTP POST requests on port 6060. When Datadog triggers the webhook, it will send an HTTP or HTTPS POST request to this specific URL. This request will prompt the execution of the purge script.
Make the Listener Persistent with PM2 (the service will continuously run in background):
Install PM2:
sudo npm install -g pm2
Start the webhook listener with PM2:
pm2 start webhook_listener.js
Verify PM2 Process:
pm2 list pm2 stop webhook_listener.js pm2 restart webhook_listener.js
Step 4: Test the Webhook Listener
Simulate a Webhook Request:
You can use
curl
to simulate a POST request to your webhook:curl -X POST http://localhost:6060/purge
If everything is set up correctly, the Node.js script should execute
/tmp/purge_
demo.sh
and return a confirmation message.
5. [Optional] Expose the Local Server with a VPN Tunnel, Just in case it is not a Linux cloud instance, it will need a temporary internet access through a vpn tunnel.
Install Localtunnel:
Install Localtunnel:
sudo npm install -g localtunnel
Start a Tunnel:
Start a Localtunnel to expose your local webhook listener:
lt --port 6060 --subdomain trigger-xxxx
This screen output URL will look like
https://trigger-xxxx.loca.lt
.
Get Tunnel Password:
Access the tunnel password (if needed) for first-time access:
Open the URL in a browser and you may be requested to enter the tunnel password.
wget -q -O - https://loca.lt/mytunnelpassword
The private IP displayed will be your password.
6. Configure Datadog Webhook
Create a Webhook in Datadog:
Log in to Datadog and navigate to Integrations > Search for Webhooks.
Click New Webhook and configure it:
Name:
Run_Purge_Script
URL:
https://trigger-xxxxx.loca.lt/purge
#for tunnel URLOR
URL:
https://server_domainIP/purge
#for cloud instanceAdditional Options: Set as needed. [optional]
Click Save.
Test that datadog can send a POST test request and Set Up a Datadog Monitor to Trigger the Webhook:
On the monitoring page, navigate to
synthetic monitoring and testing
>New Test
.Click
New API test
,HTTP
,URL: POST
,https://server_domainIP/purge
, >send
.You should get a success response as the screenshot below
Create a Monitor:
Navigate to Infrastructure in Datadog page.
Hover your mouse on the host and click on
view host dashboard
A graphic display of some metrics you can monitor will be shown here.
Click on the metrics you want to monitor (e.g.
disk usage by device
), click Create Monitor.
Configure the Monitor:
Set the query to trigger an alert when disk usage exceeds 90%:
max(last_5m):max:system.disk.in_use{device:/dev/mapper/demoVG-demoLV} by {device} > 0.9
Set the alert message:
Alert: Disk usage on /demo has exceeded 90%. Triggering purge script.
Add recipients:
@your_email@domain.com @webhook-Run_Purge_Script
Here, your webhook will also be a recipient, by simply typing the
@
key, in the message tab, a list of recipients will pop up for you to select.
Save and Activate the Monitor:
- Click Save to activate the monitor.
Navigate to
monitor section
to see a list of your configured monitors.
7. Verify the Self-Healing Process
Populate the
/demo
Directory:Open a separate session to the server.
Copy files to
/demo
directory until it reaches 95% capacity to simulate a full disk:dd if=/dev/zero of=/demo/testfile bs=1M count=1900
Monitor Disk Usage: (open a new session to the server)
Run a continuous loop on your server to monitor the /demo directory:
while true; do date && ls -l && pwd && du -ms; sleep 2; done
This command shows me the real time date, list of files, size of files and current working directory of
/demo
partition.
Ensure that the disk usage reaches 90% and triggers the Datadog monitor.
Check Webhook Execution:
Verify that the webhook is called and the purge script executes as expected.
An email will be automatically sent about the filled-up disk, and the
/demo
partition will be cleaned up.You will notice that the time that the email was triggered and the time the disk was full, are the same. Within seconds, the script has been executed and any upcoming disruption would have been averted.
Check the
/demo
directory to ensure that files are deleted when the threshold is crossed.Another email will be received to inform that the alert has been treated and closed.
By following these detailed steps, you'll establish a realistic self-healing system. While the script provided focuses on clearing logs, it can be adapted to perform other actions, such as restarting a service, scaling an instance, or any other task. With Datadog monitoring disk usage and triggering a Node.js webhook, the system will automatically execute the necessary script, ensuring responsive and efficient management of your infrastructure. Please feel free to leave a like, comment, or ask a question if you need clarity on any of the steps. Happy Learning!