WorkingTitle Advanced Monitoring System¶
A production-ready, enterprise-grade monitoring system for the WorkingTitle application with comprehensive health monitoring, intelligent log analysis, automated recovery procedures, and centralized logging.
📋 System Overview¶
The WorkingTitle Advanced Monitoring System is a comprehensive, enterprise-grade solution designed to ensure the reliability and performance of your WorkingTitle application. This system provides real-time health monitoring, intelligent log analysis, automated recovery procedures, and centralized logging capabilities.
🎯 Key Features¶
🔍 Comprehensive Health Monitoring
- Container Health Checks: Monitors both staging and production Docker containers with HTTP-based health verification
- Database Monitoring: Tracks PostgreSQL database connectivity for both staging and production environments
- Resource Monitoring: Monitors CPU, memory, and disk usage with configurable thresholds
- SSL Certificate Monitoring: Automatically checks SSL certificate expiration (alerts when < 30 days)
- External Service Monitoring: Verifies connectivity to critical external services (Google, GitHub, Docker Hub)
📊 Advanced Log Analysis
- Centralized Log Collection: Aggregates logs from systemd journal, Docker containers, application logs, and nginx
- JSON Processing: Uses
jqfor efficient JSON log processing with memory-efficient stream processing for large files - Pattern Recognition: Advanced error and performance pattern detection
- Time-Range Analysis: Flexible time-based log analysis with custom date ranges
- Report Generation: Comprehensive JSON and human-readable reports
🔄 Automated Recovery System
- Intelligent Recovery: Multi-step recovery procedures for containers, databases, and system resources
- Resource Cleanup: Automatic Docker resource cleanup and system maintenance
- Network Recovery: Docker network recreation and connectivity restoration
- Failure Tracking: Prevents infinite recovery loops with attempt limits
- Rollback Support: Built-in backup and rollback capabilities
⚙️ Modern Systemd Integration
- Service Management: Full systemd service integration with enhanced security settings
- Timer-Based Scheduling: Replaces cron with systemd timers for better reliability
- Centralized Logging: All logs captured by systemd journal with automatic rotation
- Security Hardening: Comprehensive security settings including process isolation and resource limits
🛡️ Enterprise Security
- No SSH Dependencies: Eliminates SSH-related security risks by running entirely on the server
- Process Isolation: Private temp directories and system call filtering
- Resource Limits: Memory and CPU limits to prevent resource exhaustion
- Secure Credential Handling: Environment-based configuration management
🔄 How It Works¶
Step 1: Installation & Setup
- Prerequisites Check: Validates required dependencies (docker, systemctl, journalctl, jq)
- Backup Creation: Creates automatic backup of existing configuration before installation
- File Deployment: Copies all monitoring scripts and configuration to
/var/www/workingtitle/ - Systemd Integration: Creates systemd service and timer files with enhanced security settings
- Service Activation: Enables and starts monitoring services and timers
Step 2: Health Monitoring Loop
- Container Checks: Verifies Docker containers are running and responding to HTTP requests
- Database Verification: Tests PostgreSQL connectivity for both staging and production databases
- Resource Analysis: Monitors CPU, memory, and disk usage against configurable thresholds
- SSL Validation: Checks SSL certificate expiration dates and connectivity
- External Connectivity: Tests connection to critical external services
- Alert Generation: Sends alerts based on failure severity and consecutive failure counts
Step 3: Automated Recovery
- Failure Detection: Triggers when consecutive failures exceed the configured threshold
- Resource Cleanup: Performs Docker resource cleanup and system maintenance
- Network Recovery: Recreates Docker networks and restarts Docker daemon if needed
- Database Recovery: Restarts PostgreSQL service and verifies database connectivity
- Container Recovery: Stops, removes, and recreates containers with proper networking
- Verification: Confirms all services are running and healthy after recovery
Step 4: Log Analysis & Reporting
- Log Collection: Aggregates logs from systemd journal, Docker containers, and application files
- JSON Processing: Converts all logs to structured JSON format using
jq - Pattern Analysis: Identifies error patterns, performance issues, and system trends
- Report Generation: Creates comprehensive JSON and human-readable reports
- Data Retention: Manages log rotation and cleanup based on configured retention policies
Step 5: Continuous Operation
- Timer-Based Execution: Uses systemd timers for scheduled health checks and log analysis
- Centralized Logging: All output captured by systemd journal with automatic rotation
- Resource Management: Monitors and manages system resources to prevent exhaustion
- Security Enforcement: Maintains process isolation and security constraints throughout operation
🏗️ System Architecture¶
Core Components
setup-monitoring-v2.sh: Installation and configuration management scripthealth-monitor-v2.sh: Main health monitoring service with comprehensive checkslog-analyzer-v2.sh: Advanced log analysis and reporting engineauto-recovery-v2.sh: Automated recovery and maintenance proceduresshared-functions.sh: Common utilities and functions used across all componentsmonitoring-config-v2.env: Centralized configuration management
Systemd Services
workingtitle-monitor.service: Main monitoring service with enhanced security settingsworkingtitle-check.timer: Scheduled health checks (every 5 minutes with randomized delay)workingtitle-analyze.timer: Daily log analysis and report generation
Configuration Management
- Centralized Config: All settings managed through
monitoring-config-v2.env - Environment Variables: Secure credential handling through environment variables
- Backup System: Automatic backup creation before any configuration changes
- Rollback Support: Built-in rollback capabilities for failed installations
Logging & Monitoring
- Systemd Journal: Primary logging mechanism with automatic rotation
- JSON Processing: All logs processed through
jqfor structured analysis - Multi-Source Collection: Aggregates logs from containers, applications, and system services
- Report Generation: Automated generation of JSON and human-readable reports
🚀 Quick Start¶
1. Setup Monitoring System¶
# Copy the setup script to the server and run it locally
scp setup-monitoring-v2.sh root@195.24.67.210:/tmp/
ssh root@195.24.67.210 "chmod +x /tmp/setup-monitoring-v2.sh && /tmp/setup-monitoring-v2.sh install"
2. Manual Health Check¶
# Check system health
/var/www/workingtitle/health-monitor-v2.sh check
# Or using systemd-run for better isolation
/var/www/workingtitle/health-monitor-v2.sh check-systemd
3. Analyze Logs¶
# Comprehensive log analysis
/var/www/workingtitle/log-analyzer-v2.sh analyze
# Analyze with custom time range
/var/www/workingtitle/log-analyzer-v2.sh analyze --since "1 hour ago"
📁 File Structure V2¶
workingtitle_gen/
├── monitoring_system/
│ ├── setup-monitoring-v2.sh # Fully local setup script (no SSH/SCP)
│ ├── health-monitor-v2.sh # Enhanced health monitoring with SSL checks
│ ├── log-analyzer-v2.sh # Advanced log analysis with jq processing
│ ├── auto-recovery-v2.sh # Enhanced automated recovery
│ ├── monitoring-config-v2.env # Centralized configuration V2
│ ├── shared-functions.sh # Shared functions (DRY principle)
│ └── ADVANCED-MONITORING-README.md # This file
└── [other project files...]
🔧 Configuration¶
All configuration is centralized in monitoring-config-v2.env with enhanced security and features:
# Server Configuration
SERVER_ALIAS="root@195.24.67.210"
WORKING_DIR="/var/www/workingtitle"
LOG_DIR="/var/log/workingtitle"
# Advanced Monitoring Settings
CHECK_INTERVAL=60
MAX_FAILURES=3
RECOVERY_ATTEMPTS=2
ALERT_EMAIL="text@workingtitle.ru"
# Resource Thresholds
DISK_USAGE_THRESHOLD=85
MEMORY_USAGE_THRESHOLD=90
CPU_USAGE_THRESHOLD=80
# Security Settings
ENABLE_SSL_CHECKS=true
ENABLE_EXTERNAL_CHECKS=true
ENABLE_PERFORMANCE_MONITORING=true
# Systemd Timer Configuration (replaces cron)
HEALTH_CHECK_INTERVAL="*:0/5"
LOG_ANALYSIS_INTERVAL="daily"
RANDOMIZED_DELAY=30
# Log Configuration
LOG_RETENTION_DAYS=7
LOG_ROTATION_DAYS=3
ANALYSIS_RETENTION_DAYS=30
🏥 Health Monitoring¶
Commands¶
# Start continuous monitoring
/var/www/workingtitle/health-monitor-v2.sh start
# Stop monitoring
/var/www/workingtitle/health-monitor-v2.sh stop
# Check current status
/var/www/workingtitle/health-monitor-v2.sh status
# Single health check
/var/www/workingtitle/health-monitor-v2.sh check
# Health check using systemd-run (recommended for timers)
/var/www/workingtitle/health-monitor-v2.sh check-systemd
Systemd Integration¶
The monitoring runs as a fully integrated systemd service with timers:
# Service management
sudo systemctl start workingtitle-monitor.service
sudo systemctl stop workingtitle-monitor.service
sudo systemctl restart workingtitle-monitor.service
sudo systemctl status workingtitle-monitor.service
# Timer management (replaces cron)
sudo systemctl start workingtitle-check.timer
sudo systemctl start workingtitle-analyze.timer
sudo systemctl list-timers | grep workingtitle
# View logs
journalctl -u workingtitle-monitor.service -f
journalctl -u workingtitle-monitor.service --since "1 hour ago"
📊 Log Analysis¶
Commands¶
# Comprehensive analysis
/var/www/workingtitle/log-analyzer-v2.sh analyze
# Analyze with custom time range
/var/www/workingtitle/log-analyzer-v2.sh analyze --since "1 hour ago"
/var/www/workingtitle/log-analyzer-v2.sh analyze --since "1 week ago"
# Search for specific patterns
/var/www/workingtitle/log-analyzer-v2.sh search --pattern "out of memory" --since "1 day ago"
/var/www/workingtitle/log-analyzer-v2.sh search --pattern "error" --since "2 hours ago" --max-results 20
# Focused analysis
/var/www/workingtitle/log-analyzer-v2.sh errors --since "1 day ago"
/var/www/workingtitle/log-analyzer-v2.sh performance --since "1 week ago"
# Generate comprehensive reports
/var/www/workingtitle/log-analyzer-v2.sh report --since "1 month ago"
Output Files¶
comprehensive-report-TIMESTAMP.json- Structured JSON report withjqprocessingcomprehensive-report-TIMESTAMP.txt- Human-readable reporterrors-analysis.txt- Error pattern analysisperformance-analysis.txt- Performance metricssearch-results.txt- Search results with context
🔄 Automated Recovery¶
Commands¶
# Full system recovery
/var/www/workingtitle/auto-recovery-v2.sh full
# Specific recovery types
/var/www/workingtitle/auto-recovery-v2.sh containers
/var/www/workingtitle/auto-recovery-v2.sh database
/var/www/workingtitle/auto-recovery-v2.sh resources
/var/www/workingtitle/auto-recovery-v2.sh logs
/var/www/workingtitle/auto-recovery-v2.sh networking
📈 Monitoring Dashboard¶
Health Check Endpoints¶
- Staging:
http://195.24.67.210:3001/ - Production:
http://195.24.67.210:3000/
Log Locations¶
- Systemd Journal:
journalctl -u workingtitle-monitor.service - Application Logs:
/var/log/workingtitle/ - Container Logs:
docker logs workingtitle_staging_app - Aggregated Logs:
/var/log/workingtitle/aggregated.log
🚨 Alerting¶
Alert Types¶
- CRITICAL: System failures requiring immediate attention
- WARNING: Issues that need monitoring
- INFO: Status updates and recoveries
Alert Channels¶
- Systemd Journal: Primary logging mechanism with structured data
- Email: Optional email alerts (configure in
monitoring-config.env) - Console: Real-time console output with color coding
- Centralized Logs: Aggregated log files for analysis
🔍 Troubleshooting¶
Common Issues¶
Health Check Fails¶
# Check container status
docker ps --filter name=workingtitle
# Check container logs
docker logs workingtitle_staging_app
docker logs workingtitle_prod_app
# Check systemd service
systemctl status workingtitle-monitor.service
# Check system resources
free -h
df -h
Monitoring Service Not Starting¶
# Check service status
systemctl status workingtitle-monitor.service
# Check service logs
journalctl -u workingtitle-monitor.service -n 50
# Check configuration
workingtitle-health check
# Restart service
systemctl restart workingtitle-monitor.service
Log Analysis Issues¶
# Check log directory permissions
ls -la /var/log/workingtitle/
# Run analysis with verbose output
workingtitle-logs analyze --since "1 hour ago" 2>&1 | tee analysis.log
# Check systemd journal
journalctl -u workingtitle-monitor.service --since "1 hour ago"
Debug Mode¶
# Enable debug logging
export DEBUG=1
workingtitle-health check
# Check systemd journal with debug info
journalctl -u workingtitle-monitor.service -f
📋 Maintenance¶
Daily Tasks¶
- Monitor health check status:
workingtitle-health status - Review error logs:
workingtitle-logs errors --since "1 day ago" - Check resource usage:
free -h && df -h
Weekly Tasks¶
- Run comprehensive log analysis:
workingtitle-logs analyze --since "1 week ago" - Review performance metrics:
workingtitle-logs performance --since "1 week ago" - Clean up old log files:
workingtitle-recovery logs
Monthly Tasks¶
- Update monitoring thresholds in
monitoring-config.env - Review and optimize recovery procedures
- Generate monthly reports:
workingtitle-logs report --since "1 month ago" - Update documentation
🔒 Security¶
Security Features¶
- No SSH Dependencies: Eliminates SSH-related security risks
- Systemd Security: Comprehensive security settings in service file
- Limited Privileges: Restricted file system access
- Process Isolation: Private temp directories and system call filtering
- Resource Limits: Memory and CPU limits to prevent resource exhaustion
Best Practices¶
- Regular security updates
- Monitor access logs
- Use strong authentication
- Regular backup of configuration
- Review systemd security settings
📚 Usage¶
Custom Health Checks¶
Add custom health checks by modifying health-monitor-advanced.sh:
check_custom_health() {
# Your custom health check logic
# Return 0 for success, 1 for failure
return 0
}
Custom Alerts¶
Modify the send_alert function to add custom alert channels:
send_alert() {
local message="$1"
local severity="$2"
# Add your custom alert logic here
# e.g., Slack webhook, PagerDuty API, etc.
}
Integration with External Tools¶
The JSON reports can be integrated with: - Grafana: For visualization dashboards - Prometheus: For metrics collection - ELK Stack: For centralized logging - Splunk: For enterprise log analysis - Datadog: For APM and monitoring
🤝 Contributing¶
Adding New Checks¶
- Add check function to
health-monitor-advanced.sh - Update
perform_health_check()function - Test with
workingtitle-health check - Update documentation
Adding New Analysis¶
- Add analysis function to
log-analyzer-advanced.sh - Update main script logic
- Test with sample logs
- Update documentation
Adding New Templates¶
- Create template file in
templates/directory - Update
setup-monitoring-advanced.shto process template - Test template processing
- Update documentation
📞 Support¶
For issues or questions:
- Check the troubleshooting section
- Review systemd logs:
journalctl -u workingtitle-monitor.service - Run diagnostic commands
- Check configuration:
cat /var/www/workingtitle/monitoring-config.env - Create an issue with logs and configuration
🎯 Performance¶
Resource Usage¶
- Memory: ~50MB for monitoring service
- CPU: <1% average usage
- Disk: ~100MB for logs per day
- Network: Minimal (local operations only)
Scaling Considerations¶
- Single Server: Optimized for single server deployment
- Multiple Servers: Use centralized logging for multiple servers
- High Load: Adjust thresholds in configuration
- Large Logs: Use log rotation and aggregation
Note: This advanced monitoring system is designed for production use and includes comprehensive error handling, security features, automated recovery procedures, and enterprise-grade logging. Always test changes in a staging environment before deploying to production.