Skip to content

WorkingTitle Advanced Monitoring System

A production-ready, enterprise-grade monitoring system for the WorkingTitle application with comprehensive health monitoring, intelligent log analysis, automated recovery procedures, and centralized logging.

📋 System Overview

The WorkingTitle Advanced Monitoring System is a comprehensive, enterprise-grade solution designed to ensure the reliability and performance of your WorkingTitle application. This system provides real-time health monitoring, intelligent log analysis, automated recovery procedures, and centralized logging capabilities.

🎯 Key Features

🔍 Comprehensive Health Monitoring

  • Container Health Checks: Monitors both staging and production Docker containers with HTTP-based health verification
  • Database Monitoring: Tracks PostgreSQL database connectivity for both staging and production environments
  • Resource Monitoring: Monitors CPU, memory, and disk usage with configurable thresholds
  • SSL Certificate Monitoring: Automatically checks SSL certificate expiration (alerts when < 30 days)
  • External Service Monitoring: Verifies connectivity to critical external services (Google, GitHub, Docker Hub)

📊 Advanced Log Analysis

  • Centralized Log Collection: Aggregates logs from systemd journal, Docker containers, application logs, and nginx
  • JSON Processing: Uses jq for efficient JSON log processing with memory-efficient stream processing for large files
  • Pattern Recognition: Advanced error and performance pattern detection
  • Time-Range Analysis: Flexible time-based log analysis with custom date ranges
  • Report Generation: Comprehensive JSON and human-readable reports

🔄 Automated Recovery System

  • Intelligent Recovery: Multi-step recovery procedures for containers, databases, and system resources
  • Resource Cleanup: Automatic Docker resource cleanup and system maintenance
  • Network Recovery: Docker network recreation and connectivity restoration
  • Failure Tracking: Prevents infinite recovery loops with attempt limits
  • Rollback Support: Built-in backup and rollback capabilities

⚙️ Modern Systemd Integration

  • Service Management: Full systemd service integration with enhanced security settings
  • Timer-Based Scheduling: Replaces cron with systemd timers for better reliability
  • Centralized Logging: All logs captured by systemd journal with automatic rotation
  • Security Hardening: Comprehensive security settings including process isolation and resource limits

🛡️ Enterprise Security

  • No SSH Dependencies: Eliminates SSH-related security risks by running entirely on the server
  • Process Isolation: Private temp directories and system call filtering
  • Resource Limits: Memory and CPU limits to prevent resource exhaustion
  • Secure Credential Handling: Environment-based configuration management

🔄 How It Works

Step 1: Installation & Setup

  1. Prerequisites Check: Validates required dependencies (docker, systemctl, journalctl, jq)
  2. Backup Creation: Creates automatic backup of existing configuration before installation
  3. File Deployment: Copies all monitoring scripts and configuration to /var/www/workingtitle/
  4. Systemd Integration: Creates systemd service and timer files with enhanced security settings
  5. Service Activation: Enables and starts monitoring services and timers

Step 2: Health Monitoring Loop

  1. Container Checks: Verifies Docker containers are running and responding to HTTP requests
  2. Database Verification: Tests PostgreSQL connectivity for both staging and production databases
  3. Resource Analysis: Monitors CPU, memory, and disk usage against configurable thresholds
  4. SSL Validation: Checks SSL certificate expiration dates and connectivity
  5. External Connectivity: Tests connection to critical external services
  6. Alert Generation: Sends alerts based on failure severity and consecutive failure counts

Step 3: Automated Recovery

  1. Failure Detection: Triggers when consecutive failures exceed the configured threshold
  2. Resource Cleanup: Performs Docker resource cleanup and system maintenance
  3. Network Recovery: Recreates Docker networks and restarts Docker daemon if needed
  4. Database Recovery: Restarts PostgreSQL service and verifies database connectivity
  5. Container Recovery: Stops, removes, and recreates containers with proper networking
  6. Verification: Confirms all services are running and healthy after recovery

Step 4: Log Analysis & Reporting

  1. Log Collection: Aggregates logs from systemd journal, Docker containers, and application files
  2. JSON Processing: Converts all logs to structured JSON format using jq
  3. Pattern Analysis: Identifies error patterns, performance issues, and system trends
  4. Report Generation: Creates comprehensive JSON and human-readable reports
  5. Data Retention: Manages log rotation and cleanup based on configured retention policies

Step 5: Continuous Operation

  1. Timer-Based Execution: Uses systemd timers for scheduled health checks and log analysis
  2. Centralized Logging: All output captured by systemd journal with automatic rotation
  3. Resource Management: Monitors and manages system resources to prevent exhaustion
  4. Security Enforcement: Maintains process isolation and security constraints throughout operation

🏗️ System Architecture

Core Components

  • setup-monitoring-v2.sh: Installation and configuration management script
  • health-monitor-v2.sh: Main health monitoring service with comprehensive checks
  • log-analyzer-v2.sh: Advanced log analysis and reporting engine
  • auto-recovery-v2.sh: Automated recovery and maintenance procedures
  • shared-functions.sh: Common utilities and functions used across all components
  • monitoring-config-v2.env: Centralized configuration management

Systemd Services

  • workingtitle-monitor.service: Main monitoring service with enhanced security settings
  • workingtitle-check.timer: Scheduled health checks (every 5 minutes with randomized delay)
  • workingtitle-analyze.timer: Daily log analysis and report generation

Configuration Management

  • Centralized Config: All settings managed through monitoring-config-v2.env
  • Environment Variables: Secure credential handling through environment variables
  • Backup System: Automatic backup creation before any configuration changes
  • Rollback Support: Built-in rollback capabilities for failed installations

Logging & Monitoring

  • Systemd Journal: Primary logging mechanism with automatic rotation
  • JSON Processing: All logs processed through jq for structured analysis
  • Multi-Source Collection: Aggregates logs from containers, applications, and system services
  • Report Generation: Automated generation of JSON and human-readable reports

🚀 Quick Start

1. Setup Monitoring System

# Copy the setup script to the server and run it locally
scp setup-monitoring-v2.sh root@195.24.67.210:/tmp/
ssh root@195.24.67.210 "chmod +x /tmp/setup-monitoring-v2.sh && /tmp/setup-monitoring-v2.sh install"

2. Manual Health Check

# Check system health
/var/www/workingtitle/health-monitor-v2.sh check

# Or using systemd-run for better isolation
/var/www/workingtitle/health-monitor-v2.sh check-systemd

3. Analyze Logs

# Comprehensive log analysis
/var/www/workingtitle/log-analyzer-v2.sh analyze

# Analyze with custom time range
/var/www/workingtitle/log-analyzer-v2.sh analyze --since "1 hour ago"

📁 File Structure V2

workingtitle_gen/
├── monitoring_system/
│   ├── setup-monitoring-v2.sh             # Fully local setup script (no SSH/SCP)
│   ├── health-monitor-v2.sh               # Enhanced health monitoring with SSL checks
│   ├── log-analyzer-v2.sh                 # Advanced log analysis with jq processing
│   ├── auto-recovery-v2.sh                # Enhanced automated recovery
│   ├── monitoring-config-v2.env           # Centralized configuration V2
│   ├── shared-functions.sh                # Shared functions (DRY principle)
│   └── ADVANCED-MONITORING-README.md      # This file
└── [other project files...]

🔧 Configuration

All configuration is centralized in monitoring-config-v2.env with enhanced security and features:

# Server Configuration
SERVER_ALIAS="root@195.24.67.210"
WORKING_DIR="/var/www/workingtitle"
LOG_DIR="/var/log/workingtitle"

# Advanced Monitoring Settings
CHECK_INTERVAL=60
MAX_FAILURES=3
RECOVERY_ATTEMPTS=2
ALERT_EMAIL="text@workingtitle.ru"

# Resource Thresholds
DISK_USAGE_THRESHOLD=85
MEMORY_USAGE_THRESHOLD=90
CPU_USAGE_THRESHOLD=80

# Security Settings
ENABLE_SSL_CHECKS=true
ENABLE_EXTERNAL_CHECKS=true
ENABLE_PERFORMANCE_MONITORING=true

# Systemd Timer Configuration (replaces cron)
HEALTH_CHECK_INTERVAL="*:0/5"
LOG_ANALYSIS_INTERVAL="daily"
RANDOMIZED_DELAY=30

# Log Configuration
LOG_RETENTION_DAYS=7
LOG_ROTATION_DAYS=3
ANALYSIS_RETENTION_DAYS=30

🏥 Health Monitoring

Commands

# Start continuous monitoring
/var/www/workingtitle/health-monitor-v2.sh start

# Stop monitoring
/var/www/workingtitle/health-monitor-v2.sh stop

# Check current status
/var/www/workingtitle/health-monitor-v2.sh status

# Single health check
/var/www/workingtitle/health-monitor-v2.sh check

# Health check using systemd-run (recommended for timers)
/var/www/workingtitle/health-monitor-v2.sh check-systemd

Systemd Integration

The monitoring runs as a fully integrated systemd service with timers:

# Service management
sudo systemctl start workingtitle-monitor.service
sudo systemctl stop workingtitle-monitor.service
sudo systemctl restart workingtitle-monitor.service
sudo systemctl status workingtitle-monitor.service

# Timer management (replaces cron)
sudo systemctl start workingtitle-check.timer
sudo systemctl start workingtitle-analyze.timer
sudo systemctl list-timers | grep workingtitle

# View logs
journalctl -u workingtitle-monitor.service -f
journalctl -u workingtitle-monitor.service --since "1 hour ago"

📊 Log Analysis

Commands

# Comprehensive analysis
/var/www/workingtitle/log-analyzer-v2.sh analyze

# Analyze with custom time range
/var/www/workingtitle/log-analyzer-v2.sh analyze --since "1 hour ago"
/var/www/workingtitle/log-analyzer-v2.sh analyze --since "1 week ago"

# Search for specific patterns
/var/www/workingtitle/log-analyzer-v2.sh search --pattern "out of memory" --since "1 day ago"
/var/www/workingtitle/log-analyzer-v2.sh search --pattern "error" --since "2 hours ago" --max-results 20

# Focused analysis
/var/www/workingtitle/log-analyzer-v2.sh errors --since "1 day ago"
/var/www/workingtitle/log-analyzer-v2.sh performance --since "1 week ago"

# Generate comprehensive reports
/var/www/workingtitle/log-analyzer-v2.sh report --since "1 month ago"

Output Files

  • comprehensive-report-TIMESTAMP.json - Structured JSON report with jq processing
  • comprehensive-report-TIMESTAMP.txt - Human-readable report
  • errors-analysis.txt - Error pattern analysis
  • performance-analysis.txt - Performance metrics
  • search-results.txt - Search results with context

🔄 Automated Recovery

Commands

# Full system recovery
/var/www/workingtitle/auto-recovery-v2.sh full

# Specific recovery types
/var/www/workingtitle/auto-recovery-v2.sh containers
/var/www/workingtitle/auto-recovery-v2.sh database
/var/www/workingtitle/auto-recovery-v2.sh resources
/var/www/workingtitle/auto-recovery-v2.sh logs
/var/www/workingtitle/auto-recovery-v2.sh networking

📈 Monitoring Dashboard

Health Check Endpoints

  • Staging: http://195.24.67.210:3001/
  • Production: http://195.24.67.210:3000/

Log Locations

  • Systemd Journal: journalctl -u workingtitle-monitor.service
  • Application Logs: /var/log/workingtitle/
  • Container Logs: docker logs workingtitle_staging_app
  • Aggregated Logs: /var/log/workingtitle/aggregated.log

🚨 Alerting

Alert Types

  • CRITICAL: System failures requiring immediate attention
  • WARNING: Issues that need monitoring
  • INFO: Status updates and recoveries

Alert Channels

  • Systemd Journal: Primary logging mechanism with structured data
  • Email: Optional email alerts (configure in monitoring-config.env)
  • Console: Real-time console output with color coding
  • Centralized Logs: Aggregated log files for analysis

🔍 Troubleshooting

Common Issues

Health Check Fails

# Check container status
docker ps --filter name=workingtitle

# Check container logs
docker logs workingtitle_staging_app
docker logs workingtitle_prod_app

# Check systemd service
systemctl status workingtitle-monitor.service

# Check system resources
free -h
df -h

Monitoring Service Not Starting

# Check service status
systemctl status workingtitle-monitor.service

# Check service logs
journalctl -u workingtitle-monitor.service -n 50

# Check configuration
workingtitle-health check

# Restart service
systemctl restart workingtitle-monitor.service

Log Analysis Issues

# Check log directory permissions
ls -la /var/log/workingtitle/

# Run analysis with verbose output
workingtitle-logs analyze --since "1 hour ago" 2>&1 | tee analysis.log

# Check systemd journal
journalctl -u workingtitle-monitor.service --since "1 hour ago"

Debug Mode

# Enable debug logging
export DEBUG=1
workingtitle-health check

# Check systemd journal with debug info
journalctl -u workingtitle-monitor.service -f

📋 Maintenance

Daily Tasks

  • Monitor health check status: workingtitle-health status
  • Review error logs: workingtitle-logs errors --since "1 day ago"
  • Check resource usage: free -h && df -h

Weekly Tasks

  • Run comprehensive log analysis: workingtitle-logs analyze --since "1 week ago"
  • Review performance metrics: workingtitle-logs performance --since "1 week ago"
  • Clean up old log files: workingtitle-recovery logs

Monthly Tasks

  • Update monitoring thresholds in monitoring-config.env
  • Review and optimize recovery procedures
  • Generate monthly reports: workingtitle-logs report --since "1 month ago"
  • Update documentation

🔒 Security

Security Features

  • No SSH Dependencies: Eliminates SSH-related security risks
  • Systemd Security: Comprehensive security settings in service file
  • Limited Privileges: Restricted file system access
  • Process Isolation: Private temp directories and system call filtering
  • Resource Limits: Memory and CPU limits to prevent resource exhaustion

Best Practices

  • Regular security updates
  • Monitor access logs
  • Use strong authentication
  • Regular backup of configuration
  • Review systemd security settings

📚 Usage

Custom Health Checks

Add custom health checks by modifying health-monitor-advanced.sh:

check_custom_health() {
    # Your custom health check logic
    # Return 0 for success, 1 for failure
    return 0
}

Custom Alerts

Modify the send_alert function to add custom alert channels:

send_alert() {
    local message="$1"
    local severity="$2"

    # Add your custom alert logic here
    # e.g., Slack webhook, PagerDuty API, etc.
}

Integration with External Tools

The JSON reports can be integrated with: - Grafana: For visualization dashboards - Prometheus: For metrics collection - ELK Stack: For centralized logging - Splunk: For enterprise log analysis - Datadog: For APM and monitoring

🤝 Contributing

Adding New Checks

  1. Add check function to health-monitor-advanced.sh
  2. Update perform_health_check() function
  3. Test with workingtitle-health check
  4. Update documentation

Adding New Analysis

  1. Add analysis function to log-analyzer-advanced.sh
  2. Update main script logic
  3. Test with sample logs
  4. Update documentation

Adding New Templates

  1. Create template file in templates/ directory
  2. Update setup-monitoring-advanced.sh to process template
  3. Test template processing
  4. Update documentation

📞 Support

For issues or questions:

  1. Check the troubleshooting section
  2. Review systemd logs: journalctl -u workingtitle-monitor.service
  3. Run diagnostic commands
  4. Check configuration: cat /var/www/workingtitle/monitoring-config.env
  5. Create an issue with logs and configuration

🎯 Performance

Resource Usage

  • Memory: ~50MB for monitoring service
  • CPU: <1% average usage
  • Disk: ~100MB for logs per day
  • Network: Minimal (local operations only)

Scaling Considerations

  • Single Server: Optimized for single server deployment
  • Multiple Servers: Use centralized logging for multiple servers
  • High Load: Adjust thresholds in configuration
  • Large Logs: Use log rotation and aggregation

Note: This advanced monitoring system is designed for production use and includes comprehensive error handling, security features, automated recovery procedures, and enterprise-grade logging. Always test changes in a staging environment before deploying to production.