Resiliency in Microservices: A Guide to Circuit Breaker Pattern

9 min readJan 7, 2024

The Circuit Breaker pattern with fallback is a strategy used in microservices architecture to boost system resilience. Picture it as a safety net for your services. When a critical service, like a product catalog, starts acting up due to high traffic or errors, the circuit breaker kicks in and temporarily redirects the system to a backup plan — the fallback mechanism. This can be displaying cached data or a simple message to users, preventing the entire system from going haywire. The circuit breaker periodically checks if the troubled service is back on track, automatically switching back to normal operations when it’s ready. Essentially, it’s a smart way to handle hiccups in microservices, ensuring smoother user experiences even when certain services hit a rough patch.

Scenario and Problem Explanation:

In a distributed environment, applications often make calls to remote resources or services. These calls may encounter faults, which can be transient or long-lasting. The Retry pattern is commonly used for transient faults, allowing the application to retry the operation until it succeeds. However, there are situations where continuous retries are not effective, and a more strategic approach is needed.

Scenario 1: Transient Faults and Retry Pattern: Transient faults are short-lived issues, like network hiccups or temporary unavailability of resources. The Retry pattern is employed to handle such faults by automatically retrying the operation for a certain period.

Example Scenario: A cloud application makes a request to a database, but due to a temporary spike in network latency, the initial request fails. The Retry pattern allows the application to retry the database operation until the transient issue resolves.

Scenario 2: Unanticipated Long term faults and Quick Failure Response: Some faults are caused by events that might take a long time to fix. In such cases, continuously retrying an operation might not be practical. The application should quickly acknowledge the failure and take appropriate action.

Example Scenario: An e-commerce website relies on a payment service. If the payment service is completely down, continuously retrying payment processing would be futile. Instead, the application should quickly identify the failure and notify the user about the payment issue.

Scenario 3: Cascading Failures and Immediate Failure Acceptance: In highly concurrent systems, a failure in one part of the system might lead to cascading failures if not handled properly. Blocking resources due to long timeouts can be detrimental. It’s better to fail immediately in certain situations to prevent resource exhaustion.

Example Scenario: An online ticket booking system experiences a surge in traffic, overwhelming a reservation service. If each new request tries to reserve a resource with a long timeout, it could lead to resource exhaustion.

Scenario 4: Optimizing Timeout Periods: Setting an appropriate timeout period is crucial. It shouldn’t be so short that the operation fails most of the time, even if the service is operational. Balancing the timeout duration is essential to avoid unnecessary failures and ensure a timely response.

Example Scenario: A file upload service receives requests from various clients. If the timeout for each upload request is set too short, even under normal load conditions, valid uploads might be prematurely marked as failures. Conversely, setting the timeout too long could lead to resource congestion during peak usage.

============================================================

Solution: Circuit Breaker Pattern

Implementing a circuit breaker pattern helps prevent cascading failures and provides a way to gracefully handle service failures.

In software, the Circuit Breaker pattern monitors calls to a remote service. If the service experiences a failure or becomes unresponsive, the Circuit Breaker intervenes and stops further calls to the service for a predetermined time. Meanwhile, the Circuit Breaker can redirect calls to a backup service, return cached data, or provide an error response.

This pattern helps prevent cascading failures, conserves resources, and allows the system to gracefully handle faults, improving overall resilience and stability.

It acts as a proxy, monitoring the operation’s success and deciding whether to allow it to proceed, return an exception immediately, or wait for a specified time before trying again.

This pattern is particularly useful in scenarios where faults might take longer to resolve, preventing the application from wasting resources on fault prone operations.

Concept of Proxy/ How State Changes in Circuit Breaker Pattern ?(Controlling Request flow & failure monitoring & changing states):

the proxy in the Circuit Breaker pattern acts as an intelligent gatekeeper, regulating the flow of requests based on the health of the microservice.

Monitoring and Counting Failures(During Closed State): The proxy keeps track of the number of failures that occur when making requests to a particular microservice. It acts as a counter, incrementing each time an operation fails.

Transition to Open State based on failure count > threshold: When the failure count exceeds a predefined threshold within a specified time period, the proxy makes the decision to transition to the Open state. In this state, the proxy blocks requests from reaching the actual microservice.

Timeout Handling(During Open State): The proxy starts & manages the timeout period during the Open state, giving the system time to recover from the fault. The circuit breaker will remain in the Open state until the timeout period ends. Then, it will move into the Half-Open state.

Controlled Request Pass & Success Check (During Half-Open State): In the Half-Open state, the circuit breaker will allow a limited number of requests to reach article service. If those requests are successful, the circuit breaker will switch the state to Closed and allow normal operations. If not, it will again block the requests for the defined timeout period.

Resetting and Transition From Half Open to Closed State: Based on the outcomes of the requests in the Half-Open state, the proxy may reset the failure count and return to the Closed state if it determines that the microservice has recovered.

Example Scenario:

Imagine you have a microservices-based e-commerce application. One of the essential services in your architecture is the Product Catalog service, which provides information about the products available for purchase. Your web application relies on this service to display product details on the product pages.

Now, let’s say the Product Catalog service experiences a sudden surge in traffic or encounters an issue, leading to slower response times and occasional failures. Without a Circuit Breaker pattern, these failures might propagate to the entire system, affecting the user experience on the website.

To implement the Circuit Breaker pattern with a fallback mechanism:

Monitoring: Set up monitoring for the Product Catalog service to detect response time and error rate thresholds.
Circuit Breaker State: Implement a circuit breaker that monitors the service. If the error rate or response time surpasses predefined thresholds, the circuit breaker transitions to an “open” state.
Fallback Mechanism: When the circuit is in an “open” state, the system switches to a fallback mechanism. In our example, the fallback mechanism could involve displaying cached product information or a generic message like “Product details temporarily unavailable.”
Automatic Retry: Periodically attempt to reset the circuit to a “closed” state by automatically retrying the Product Catalog service. If successful, the circuit breaker transitions back to “closed,” and normal operations resume. If the retries fail, the circuit breaker remains in the “open” state.

This implementation helps in isolating the failing service, preventing cascading failures, and maintaining a level of functionality for end-users even when a service is temporarily unavailable. It also avoids continuously hammering a failing service, allowing it time to recover.

Use of Fallback Mechanisms: Movie Streaming Service Scenario

Scenario: Consider a movie streaming service where a recommendation microservice suggests personalized content to users. If this microservice fails due to high demand, the Circuit Breaker opens to prevent further requests.

Explanation: When the Circuit Breaker is open, a fallback mechanism is crucial to maintaining a positive user experience. In this case, the fallback could involve redirecting requests to a backup recommendation service, perhaps one that provides generic recommendations. Alternatively, the fallback mechanism might offer default recommendations based on overall popularity or user preferences, ensuring that users still receive valuable content recommendations even when the primary service is temporarily unavailable.

==============================================================

Advantages of Circuit Breaker Pattern

Helps to prevent cascading failures.
Handles errors gracefully and provides better user experience.
Reduces the application downtimes.
Suitable for handling asynchronous communications.
State changes of the circuit breaker can be used for error monitoring.

=============================================================

Retry Pattern vs Circuit Breaker Pattern

The Retry Pattern and the Circuit Breaker Pattern serve different purposes in managing the resilience of applications in a distributed environment.

Retry Pattern (Transient fault):

Purpose: The Retry pattern is designed to help an application deal with transient issues by attempting to perform an operation again, with the expectation that the subsequent attempts might succeed.
Example: If a service call fails due to a temporary network glitch, the Retry pattern allows the application to make multiple retries, assuming that the issue is temporary and may be resolved with subsequent attempts.

Circuit Breaker Pattern (Persistent Fault):

Purpose: In contrast, the Circuit Breaker pattern aims to prevent an application from continuously attempting an operation that is likely to fail, especially in scenarios where the fault is not transient and might take longer to resolve.
Example: If a service is experiencing a prolonged outage, the Circuit Breaker helps by stopping the application from repeatedly trying to use that service, conserving resources and preventing further system issues.

Implementing the Circuit Breaker Pattern with Hystrix:

Implementing the Circuit Breaker pattern with Hystrix involves several steps. Below is a step-by-step guide along with an explanation of how it works:

Add Hystrix Dependency:

<!-- Maven Dependency -->
<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-starter-netflix-hystrix</artifactId>
</dependency>

2.Enable Hystrix in Spring Boot:

import org.springframework.cloud.netflix.hystrix.EnableHystrix;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
@EnableHystrix
public class YourApplication {
    public static void main(String[] args) {
        SpringApplication.run(YourApplication.class, args);
    }
}

3. Create a Method with Hystrix Command: Create a method that will be wrapped with a Hystrix command. Annotate this method with @HystrixCommand and provide a fallback method.

@Service
public class MyService {

    private final RestTemplate restTemplate;

    public MyService(RestTemplate restTemplate) {
        this.restTemplate = restTemplate;
    }

    @HystrixCommand(fallbackMethod = "fallbackMethod")
    public String performRemoteCall() {
        // Make a remote service call using RestTemplate
        return restTemplate.getForObject("http://remote-service/api/endpoint", String.class);
    }

    public String fallbackMethod() {
        // Fallback logic when the circuit is open or there's an error
        return "Fallback response";
    }
}

4. Apply Circuit Breaker Configuration: Configure Hystrix properties in your application.properties or application.yml file to set thresholds for circuit breaker behavior.

hystrix:
  command:
    default:
      circuitBreaker:
        requestVolumeThreshold: 20
        errorThresholdPercentage: 50
        sleepWindowInMilliseconds: 5000

How It Works:

Normal Operation: During normal operation closed circuit, Hystrix allows requests to pass through to the underlying service.
Threshold Breach: When the configured thresholds (like request volume or error percentage) are breached, Hystrix opens the circuit.
Circuit Open: Once the circuit is open, Hystrix starts redirecting requests to the fallback method specified in the @HystrixCommand annotation.
Fallback Logic: The fallback method contains logic to handle failures gracefully, providing a default response or a cached result.
Circuit Half-Open: After a specified time (configured by sleepWindowInMilliseconds), Hystrix allows a few requests to pass through to test if the underlying service has recovered.
Circuit Closed or Open: If the test requests are successful, the circuit is closed, and normal operation resumes. If the test fails, the circuit remains open.

Question: Can you provide an example of a real-world scenario where the Circuit Breaker pattern would be beneficial?

Answer: Consider a payment processing system in an e-commerce application. If the payment gateway experiences a sudden surge in failures, the Circuit Breaker pattern prevents continuous requests, protecting the system from complete failure. It redirects users to a fallback mechanism, such as allowing them to choose an alternative payment method or placing orders in a pending state until the payment gateway is stable again.

Question: Can you explain a situation where adjusting Circuit Breaker configuration parameters (threshold) would be necessary?

Scenario: Imagine a cloud-based e-commerce platform that experiences periodic traffic spikes during holiday sales or promotional events. The recommendation service, crucial for guiding users to relevant products, might receive an increased number of requests during these periods.

Explanation: During normal operation, the Circuit Breaker has a requestVolumeThreshold set to a default value, say 100 requests within a specific time window. However, during promotional events, adjusting this parameter becomes necessary. By increasing the requestVolumeThreshold to, let's say, 200 requests, the Circuit Breaker becomes more tolerant to short-lived spikes in traffic. This adjustment ensures that genuine service failures, perhaps due to increased load, still trigger the circuit to open and protect the system. Meanwhile, temporary load increases during promotions don't unnecessarily interrupt normal operation.