Health Checks in Zeet
Health checks are critical tools in managing the lifecycle and reliability of applications within Kubernetes environments. They provide mechanisms to automatically monitor and manage application health, making decisions based on the readiness and liveness of pods. This document aims to explain what health checks are, their types, and illustrate their use through examples. Additionally, we address common issues and solutions related to health checks.
What Are Health Checks?
Health checks are automated tests conducted by the Kubernetes system on running containers to assess if they are operating as expected. These checks are essential for applications requiring high availability, as they ensure traffic is directed only to instances ready and capable of handling requests. When a container fails a health check, Kubernetes can automatically restart it or halt traffic to it, depending on the health check type that failed.
Types of Health Checks
Kubernetes supports three primary types of health checks:
1. Startup Checks
Startup checks determine if an application within a container has started correctly. They are vital for applications with a slow initialization process. By employing startup checks, Kubernetes delays marking the container as started and postpones liveness or readiness checks until the application is ready.
- Useful for: Preventing applications with lengthy startup times from being prematurely considered unhealthy or terminated by Kubernetes.
2. Liveness Checks
Liveness checks allow Kubernetes to ascertain if a container is running. Should a liveness check fail, Kubernetes restarts the container, offering a self-healing mechanism to rectify issues like deadlocks or memory leaks.
- Useful for: Ensuring application availability by restarting containers that have ceased to function correctly.
3. Readiness Checks
Readiness checks determine if a container can accept traffic. Containers failing these checks are removed from service endpoints, ensuring only traffic-ready containers handle requests.
- Useful for: Regulating traffic to containers, guaranteeing that clients access only those prepared to serve requests.
Sample Health Check Scenarios
Production Environment
- Startup Check: An HTTP endpoint verifies the application's readiness to serve, ensuring it doesn't start receiving traffic prematurely.
- Liveness Check: An HTTP endpoint
/health-check
confirms vital application processes are operational. Failing this check triggers a container restart to remedy potential issues. - Readiness Check: An HTTP endpoint conducts a swift, simple check to confirm the application's capacity to manage new requests, crucial for traffic management during deployments or dynamic scaling.
Staging Environment
- Startup Check: Configured similarly to production, reflecting the staging environment's role as a production proxy.
- Liveness Check: Mirrors production with an HTTP endpoint
/health-check
to ensure the application's operational status. - Readiness Check: Also akin to production, verifying the application's readiness to handle requests.
Development Environment
- Startup Check: Optionally disabled to accelerate container startups for development.
- Liveness Check: Optionally disabled, with container crashes serving as the restart trigger.
- Readiness Check: Typically disabled or simplified, given the testing nature of development environments and the non-critical need for high availability.
Common Issues and Solutions
Application Goes Offline During Deployment
- Cause: Inadequate startup check configuration leads to premature traffic handling.
- Solution: Implement startup checks to maintain service continuity until the new version is fully operational and traffic-ready.
Request Failures Due to Unresponsive Containers
- Cause: Absence or misconfiguration of liveness checks, resulting in non-responsive containers not being restarted.
- Solution: Utilize liveness checks to ensure automatic container restarts, addressing non-functionality issues.
Deployment Downtime Despite Health Checks
- Observation: Deployments might still face brief downtime, signaled by Bad Gateway errors or asset loading failures.
- Solution: Fine-tune health check parameters like timeout durations or initial delays to match application startup characteristics. Consider a pre-termination timeout to finalize ongoing requests before container termination.
Sample Configuration with Math: Startup Check
An example startup health check configuration demonstrates how parameters affect the timing for determining container health status.
Configuration Example
Probe: HTTP GET request to `/start-check` on port `8080`
InitialDelaySeconds: 10
PeriodSeconds: 5
TimeoutSeconds: 2
FailureThreshold: 3
Understanding the Math
- Time Before First Check:
10 seconds
post-container start, marking the probe initiation delay. - Check Interval:
5 seconds
, denoting the frequency of health check execution post-initial delay. - Timeout:
2 seconds
, indicating the probe's failure timeout duration. - Failure Threshold:
3
, the number of failed attempts before the container is deemed unhealthy and subject to restart.
Healthiness and Unhealthiness Calculation
Time to Healthy: Immediate success post-initial delay (10 seconds) signifies
container readiness.
Time to Unhealthy/Restart: Maximum duration before restart =
Initial delay (10s) + (3 failed attempts * 5s interval) = 25 seconds
.
Adjust these parameters based on your application's startup behavior to minimize downtime and optimize user experience.