WorkingTitle Advanced Monitoring System¶

A production-ready, enterprise-grade monitoring system for the WorkingTitle application with comprehensive health monitoring, intelligent log analysis, automated recovery procedures, and centralized logging.

📋 System Overview¶

The WorkingTitle Advanced Monitoring System is a comprehensive, enterprise-grade solution designed to ensure the reliability and performance of your WorkingTitle application. This system provides real-time health monitoring, intelligent log analysis, automated recovery procedures, and centralized logging capabilities.

🎯 Key Features¶

🔍 Comprehensive Health Monitoring

Container Health Checks: Monitors both staging and production Docker containers with HTTP-based health verification
Database Monitoring: Tracks PostgreSQL database connectivity for both staging and production environments
Resource Monitoring: Monitors CPU, memory, and disk usage with configurable thresholds
SSL Certificate Monitoring: Automatically checks SSL certificate expiration (alerts when < 30 days)
External Service Monitoring: Verifies connectivity to critical external services (Google, GitHub, Docker Hub)

📊 Advanced Log Analysis

Centralized Log Collection: Aggregates logs from systemd journal, Docker containers, application logs, and nginx
JSON Processing: Uses jq for efficient JSON log processing with memory-efficient stream processing for large files
Pattern Recognition: Advanced error and performance pattern detection
Time-Range Analysis: Flexible time-based log analysis with custom date ranges
Report Generation: Comprehensive JSON and human-readable reports

🔄 Automated Recovery System

Intelligent Recovery: Multi-step recovery procedures for containers, databases, and system resources
Resource Cleanup: Automatic Docker resource cleanup and system maintenance
Network Recovery: Docker network recreation and connectivity restoration
Failure Tracking: Prevents infinite recovery loops with attempt limits
Rollback Support: Built-in backup and rollback capabilities

⚙️ Modern Systemd Integration

Service Management: Full systemd service integration with enhanced security settings
Timer-Based Scheduling: Replaces cron with systemd timers for better reliability
Centralized Logging: All logs captured by systemd journal with automatic rotation
Security Hardening: Comprehensive security settings including process isolation and resource limits

🛡️ Enterprise Security

No SSH Dependencies: Eliminates SSH-related security risks by running entirely on the server
Process Isolation: Private temp directories and system call filtering
Resource Limits: Memory and CPU limits to prevent resource exhaustion
Secure Credential Handling: Environment-based configuration management

🔄 How It Works¶

Step 1: Installation & Setup

Prerequisites Check: Validates required dependencies (docker, systemctl, journalctl, jq)
Backup Creation: Creates automatic backup of existing configuration before installation
File Deployment: Copies all monitoring scripts and configuration to /var/www/workingtitle/
Systemd Integration: Creates systemd service and timer files with enhanced security settings
Service Activation: Enables and starts monitoring services and timers

Step 2: Health Monitoring Loop

Container Checks: Verifies Docker containers are running and responding to HTTP requests
Database Verification: Tests PostgreSQL connectivity for both staging and production databases
Resource Analysis: Monitors CPU, memory, and disk usage against configurable thresholds
SSL Validation: Checks SSL certificate expiration dates and connectivity
External Connectivity: Tests connection to critical external services
Alert Generation: Sends alerts based on failure severity and consecutive failure counts

Step 3: Automated Recovery

Failure Detection: Triggers when consecutive failures exceed the configured threshold
Resource Cleanup: Performs Docker resource cleanup and system maintenance
Network Recovery: Recreates Docker networks and restarts Docker daemon if needed
Database Recovery: Restarts PostgreSQL service and verifies database connectivity
Container Recovery: Stops, removes, and recreates containers with proper networking
Verification: Confirms all services are running and healthy after recovery

Step 4: Log Analysis & Reporting

Log Collection: Aggregates logs from systemd journal, Docker containers, and application files
JSON Processing: Converts all logs to structured JSON format using jq
Pattern Analysis: Identifies error patterns, performance issues, and system trends
Report Generation: Creates comprehensive JSON and human-readable reports
Data Retention: Manages log rotation and cleanup based on configured retention policies

Step 5: Continuous Operation

Timer-Based Execution: Uses systemd timers for scheduled health checks and log analysis
Centralized Logging: All output captured by systemd journal with automatic rotation
Resource Management: Monitors and manages system resources to prevent exhaustion
Security Enforcement: Maintains process isolation and security constraints throughout operation

🏗️ System Architecture¶

Core Components

setup-monitoring-v2.sh: Installation and configuration management script
health-monitor-v2.sh: Main health monitoring service with comprehensive checks
log-analyzer-v2.sh: Advanced log analysis and reporting engine
auto-recovery-v2.sh: Automated recovery and maintenance procedures
shared-functions.sh: Common utilities and functions used across all components
monitoring-config-v2.env: Centralized configuration management

Systemd Services

workingtitle-monitor.service: Main monitoring service with enhanced security settings
workingtitle-check.timer: Scheduled health checks (every 5 minutes with randomized delay)
workingtitle-analyze.timer: Daily log analysis and report generation

Configuration Management

Centralized Config: All settings managed through monitoring-config-v2.env
Environment Variables: Secure credential handling through environment variables
Backup System: Automatic backup creation before any configuration changes
Rollback Support: Built-in rollback capabilities for failed installations

Logging & Monitoring

Systemd Journal: Primary logging mechanism with automatic rotation
JSON Processing: All logs processed through jq for structured analysis
Multi-Source Collection: Aggregates logs from containers, applications, and system services
Report Generation: Automated generation of JSON and human-readable reports

🚀 Quick Start¶

1. Setup Monitoring System¶

# Copy the setup script to the server and run it locally
scp setup-monitoring-v2.sh root@195.24.67.210:/tmp/
ssh root@195.24.67.210 "chmod +x /tmp/setup-monitoring-v2.sh && /tmp/setup-monitoring-v2.sh install"

2. Manual Health Check¶

# Check system health
/var/www/workingtitle/health-monitor-v2.sh check

# Or using systemd-run for better isolation
/var/www/workingtitle/health-monitor-v2.sh check-systemd

3. Analyze Logs¶

# Comprehensive log analysis
/var/www/workingtitle/log-analyzer-v2.sh analyze

# Analyze with custom time range
/var/www/workingtitle/log-analyzer-v2.sh analyze --since "1 hour ago"

📁 File Structure V2¶

workingtitle_gen/
├── monitoring_system/
│   ├── setup-monitoring-v2.sh             # Fully local setup script (no SSH/SCP)
│   ├── health-monitor-v2.sh               # Enhanced health monitoring with SSL checks
│   ├── log-analyzer-v2.sh                 # Advanced log analysis with jq processing
│   ├── auto-recovery-v2.sh                # Enhanced automated recovery
│   ├── monitoring-config-v2.env           # Centralized configuration V2
│   ├── shared-functions.sh                # Shared functions (DRY principle)
│   └── ADVANCED-MONITORING-README.md      # This file
└── [other project files...]

🔧 Configuration¶

All configuration is centralized in monitoring-config-v2.env with enhanced security and features:

# Server Configuration
SERVER_ALIAS="root@195.24.67.210"
WORKING_DIR="/var/www/workingtitle"
LOG_DIR="/var/log/workingtitle"

# Advanced Monitoring Settings
CHECK_INTERVAL=60
MAX_FAILURES=3
RECOVERY_ATTEMPTS=2
ALERT_EMAIL="text@workingtitle.ru"

# Resource Thresholds
DISK_USAGE_THRESHOLD=85
MEMORY_USAGE_THRESHOLD=90
CPU_USAGE_THRESHOLD=80

# Security Settings
ENABLE_SSL_CHECKS=true
ENABLE_EXTERNAL_CHECKS=true
ENABLE_PERFORMANCE_MONITORING=true

# Systemd Timer Configuration (replaces cron)
HEALTH_CHECK_INTERVAL="*:0/5"
LOG_ANALYSIS_INTERVAL="daily"
RANDOMIZED_DELAY=30

# Log Configuration
LOG_RETENTION_DAYS=7
LOG_ROTATION_DAYS=3
ANALYSIS_RETENTION_DAYS=30

🏥 Health Monitoring¶

Commands¶

# Start continuous monitoring
/var/www/workingtitle/health-monitor-v2.sh start

# Stop monitoring
/var/www/workingtitle/health-monitor-v2.sh stop

# Check current status
/var/www/workingtitle/health-monitor-v2.sh status

# Single health check
/var/www/workingtitle/health-monitor-v2.sh check

# Health check using systemd-run (recommended for timers)
/var/www/workingtitle/health-monitor-v2.sh check-systemd

Systemd Integration¶

The monitoring runs as a fully integrated systemd service with timers:

# Service management
sudo systemctl start workingtitle-monitor.service
sudo systemctl stop workingtitle-monitor.service
sudo systemctl restart workingtitle-monitor.service
sudo systemctl status workingtitle-monitor.service

# Timer management (replaces cron)
sudo systemctl start workingtitle-check.timer
sudo systemctl start workingtitle-analyze.timer
sudo systemctl list-timers | grep workingtitle

# View logs
journalctl -u workingtitle-monitor.service -f
journalctl -u workingtitle-monitor.service --since "1 hour ago"

📊 Log Analysis¶

Commands¶

# Comprehensive analysis
/var/www/workingtitle/log-analyzer-v2.sh analyze

# Analyze with custom time range
/var/www/workingtitle/log-analyzer-v2.sh analyze --since "1 hour ago"
/var/www/workingtitle/log-analyzer-v2.sh analyze --since "1 week ago"

# Search for specific patterns
/var/www/workingtitle/log-analyzer-v2.sh search --pattern "out of memory" --since "1 day ago"
/var/www/workingtitle/log-analyzer-v2.sh search --pattern "error" --since "2 hours ago" --max-results 20

# Focused analysis
/var/www/workingtitle/log-analyzer-v2.sh errors --since "1 day ago"
/var/www/workingtitle/log-analyzer-v2.sh performance --since "1 week ago"

# Generate comprehensive reports
/var/www/workingtitle/log-analyzer-v2.sh report --since "1 month ago"

Output Files¶

comprehensive-report-TIMESTAMP.json - Structured JSON report with jq processing
comprehensive-report-TIMESTAMP.txt - Human-readable report
errors-analysis.txt - Error pattern analysis
performance-analysis.txt - Performance metrics
search-results.txt - Search results with context

🔄 Automated Recovery¶

Commands¶

# Full system recovery
/var/www/workingtitle/auto-recovery-v2.sh full

# Specific recovery types
/var/www/workingtitle/auto-recovery-v2.sh containers
/var/www/workingtitle/auto-recovery-v2.sh database
/var/www/workingtitle/auto-recovery-v2.sh resources
/var/www/workingtitle/auto-recovery-v2.sh logs
/var/www/workingtitle/auto-recovery-v2.sh networking

📈 Monitoring Dashboard¶

Health Check Endpoints¶

Staging: http://195.24.67.210:3001/
Production: http://195.24.67.210:3000/

Log Locations¶

Systemd Journal: journalctl -u workingtitle-monitor.service
Application Logs: /var/log/workingtitle/
Container Logs: docker logs workingtitle_staging_app
Aggregated Logs: /var/log/workingtitle/aggregated.log

🚨 Alerting¶

Alert Types¶

CRITICAL: System failures requiring immediate attention
WARNING: Issues that need monitoring
INFO: Status updates and recoveries

Alert Channels¶

Systemd Journal: Primary logging mechanism with structured data
Email: Optional email alerts (configure in monitoring-config.env)
Console: Real-time console output with color coding
Centralized Logs: Aggregated log files for analysis

🔍 Troubleshooting¶

Common Issues¶

Health Check Fails¶

# Check container status
docker ps --filter name=workingtitle

# Check container logs
docker logs workingtitle_staging_app
docker logs workingtitle_prod_app

# Check systemd service
systemctl status workingtitle-monitor.service

# Check system resources
free -h
df -h

Monitoring Service Not Starting¶

# Check service status
systemctl status workingtitle-monitor.service

# Check service logs
journalctl -u workingtitle-monitor.service -n 50

# Check configuration
workingtitle-health check

# Restart service
systemctl restart workingtitle-monitor.service

Log Analysis Issues¶

# Check log directory permissions
ls -la /var/log/workingtitle/

# Run analysis with verbose output
workingtitle-logs analyze --since "1 hour ago" 2>&1 | tee analysis.log

# Check systemd journal
journalctl -u workingtitle-monitor.service --since "1 hour ago"

Debug Mode¶

# Enable debug logging
export DEBUG=1
workingtitle-health check

# Check systemd journal with debug info
journalctl -u workingtitle-monitor.service -f

📋 Maintenance¶

Daily Tasks¶

Monitor health check status: workingtitle-health status
Review error logs: workingtitle-logs errors --since "1 day ago"
Check resource usage: free -h && df -h

Weekly Tasks¶

Run comprehensive log analysis: workingtitle-logs analyze --since "1 week ago"
Review performance metrics: workingtitle-logs performance --since "1 week ago"
Clean up old log files: workingtitle-recovery logs

Monthly Tasks¶

Update monitoring thresholds in monitoring-config.env
Review and optimize recovery procedures
Generate monthly reports: workingtitle-logs report --since "1 month ago"
Update documentation

🔒 Security¶

Security Features¶

No SSH Dependencies: Eliminates SSH-related security risks
Systemd Security: Comprehensive security settings in service file
Limited Privileges: Restricted file system access
Process Isolation: Private temp directories and system call filtering
Resource Limits: Memory and CPU limits to prevent resource exhaustion

Best Practices¶

Regular security updates
Monitor access logs
Use strong authentication
Regular backup of configuration
Review systemd security settings

📚 Usage¶

Custom Health Checks¶

Add custom health checks by modifying health-monitor-advanced.sh:

check_custom_health() {
    # Your custom health check logic
    # Return 0 for success, 1 for failure
    return 0
}

Custom Alerts¶

Modify the send_alert function to add custom alert channels:

send_alert() {
    local message="$1"
    local severity="$2"

    # Add your custom alert logic here
    # e.g., Slack webhook, PagerDuty API, etc.
}

Integration with External Tools¶

The JSON reports can be integrated with: - Grafana: For visualization dashboards - Prometheus: For metrics collection - ELK Stack: For centralized logging - Splunk: For enterprise log analysis - Datadog: For APM and monitoring

🤝 Contributing¶

Adding New Checks¶

Add check function to health-monitor-advanced.sh
Update perform_health_check() function
Test with workingtitle-health check
Update documentation

Adding New Analysis¶

Add analysis function to log-analyzer-advanced.sh
Update main script logic
Test with sample logs
Update documentation

Adding New Templates¶

Create template file in templates/ directory
Update setup-monitoring-advanced.sh to process template
Test template processing
Update documentation

📞 Support¶

For issues or questions:

Check the troubleshooting section
Review systemd logs: journalctl -u workingtitle-monitor.service
Run diagnostic commands
Check configuration: cat /var/www/workingtitle/monitoring-config.env
Create an issue with logs and configuration

🎯 Performance¶

Resource Usage¶

Memory: ~50MB for monitoring service
CPU: <1% average usage
Disk: ~100MB for logs per day
Network: Minimal (local operations only)

Scaling Considerations¶

Single Server: Optimized for single server deployment
Multiple Servers: Use centralized logging for multiple servers
High Load: Adjust thresholds in configuration
Large Logs: Use log rotation and aggregation

Note: This advanced monitoring system is designed for production use and includes comprehensive error handling, security features, automated recovery procedures, and enterprise-grade logging. Always test changes in a staging environment before deploying to production.