Building Resilient Microservices: Strategies for Handling Failures

The Java Trail
9 min readJan 27, 2024

The goal of our system is to monitor changes in stock prices and promptly notify users of drastic price changes using email notification. In order to ensure the necessary level of fault tolerance, we opted for a microservices architecture based on Apache Kafka as a distributed event streaming platform and Kubernetes as the orchestration tool. The communication between services is facilitated by Kafka topics, allowing for seamless data flow and event-driven interactions. Our system collects data from the exchange through the Market Data Handler and then distributes it to subscribers of the Price Change topic via Kafka. Among these subscribers is a notification service responsible for sending notifications via email.

Let’s outline a scenario where fault tolerance and circuit breaker mechanisms are needed, along with approaches to make the system resilient.

Scenario 1: Dependency Service Outage: The Market Data Handler service experiences a temporary outage due to an issue with the stock exchange’s API or infrastructure.

Challenge: Without data from the Market Data Handler, the entire system may fail to provide accurate notifications, impacting user experience and trust.

Approaches:

  • Timeouts and Retries: Implement timeouts and retry mechanisms when calling external dependencies to mitigate transient failures. Use exponential backoff strategies for retries to avoid overwhelming the dependency.
  • Circuit Breaker: Employ a circuit breaker pattern to detect and isolate failures in the Market Data Handler. Once the circuit is tripped, redirect requests to a fallback mechanism or cached data until the dependency is restored.
  • Health Checks: Continuously monitor the health of the Market Data Handler service using health checks. If the service becomes unhealthy, divert traffic away from it until it recovers.

Scenario 2: Email Service Degradation: The notification service experiences intermittent issues with sending emails due to network instability or SMTP server problems.

Challenge: Users may not receive timely notifications, leading to dissatisfaction and potential financial losses if they miss critical price changes.

Approaches:

  • Retry Mechanisms: Implement retry mechanisms with exponential backoff for failed email sending operations to handle transient issues.
  • Circuit Breaker: Utilize a circuit breaker pattern to detect when the email service is experiencing prolonged failures. Switch to an alternative communication channel (e.g., SMS notifications) temporarily to ensure users receive alerts.
  • Monitoring and Alerting: Implement monitoring and alerting systems to proactively detect email service degradation and intervene promptly.

Scenario 3: Kafka Broker Failure: One of the Kafka brokers in the cluster experiences a hardware failure or network partition.

Challenge: The failure may disrupt the flow of events within the system, impacting data consistency and real-time processing.

Approaches:

  • Replication: Configure Kafka topics with replication to ensure data redundancy and fault tolerance. This allows Kafka to continue functioning even if a broker fails.
  • Automatic Rebalancing: Enable Kafka’s automatic partition rebalancing feature to redistribute partitions across healthy brokers in the cluster after a failure.
  • Circuit Breaker: Implement a circuit breaker mechanism within services that depend on Kafka topics to handle situations where Kafka becomes temporarily unavailable. Switch to offline processing or cached data to maintain functionality until Kafka is restored.

Scenario 4: High Traffic Spike:

  • Scenario: Suddenly, there is a significant surge in stock market activity, leading to a dramatic increase in data volume processed by our system.
  • Challenge: The sudden spike in traffic may overwhelm the system, causing performance degradation or even service downtime.

Strategies to Handle Failures:

Strategy 1: Fire-and-Forget:

This process involved logging an error upon any unsuccessful execution of the request and proceeding with further tasks. Any unsuccessful execution of the email service request is logged as an error, but the system continues its operations without addressing the failure.

(-)Not suitable for systems requiring guaranteed event handling, leading to potentially large numbers of unsuccessful requests in log.

(-)Critical issues, such as notification failures, cannot be addressed effectively.

** Strategy 2: Circuit Breaker & Fallback Approach:

Implementing a circuit breaker pattern helps prevent cascading failures and provides a way to gracefully handle service failures. Here’s how we can integrate a circuit breaker into the system:

  • Health Checks: Implement health checks within the notification service to monitor its availability and responsiveness. If the service becomes unresponsive or experiences errors beyond a threshold, the circuit breaker trips.
  • Fallback Mechanism: Define a fallback mechanism to handle scenarios where the notification service is unavailable. For instance, temporarily store the failed notifications in a queue for later processing or utilize alternative notification channels such as SMS or push notifications.
  • Retry Strategy: Configure the circuit breaker to attempt retries before opening the circuit if the failure is transient. Implement exponential backoff strategies to prevent overwhelming the service during recovery.

(+) Reduces the load on the email service during failures. (+) Guarantees the processing of requests by re-executing the last unsuccessfully processed message.

(-) May lead to inefficient resource consumption or overflow of storage or queue (-) Sacrifices parallel processing capabilities for notifications.

Strategy 3 Schedule and Retry:

This strategy involves scheduling and retrying failed operations after a certain delay. In notification service, unsuccessful requests to the email service are retried at intervals. This is particularly useful for scenarios where the service may be temporarily unavailable. Automates workflow during temporary service outages, Not effective for scenarios with persistent failures.

Strategy 4: Rerunning Requests Using Kafka Message Queues:

Kafka utilizes topics to organize and manage messages. A Kafka topic is a log-like data structure where messages are stored in an append-only fashion.

  • When an operation, such as sending an email notification, fails, the system publishes the failed message to a specific Kafka topic dedicated to retries.
  • A consumer group listens to this retry-topic and retries processing messages based on specific time-algorithms.

Consuming and Retrying:

  • A consumer group, designed to handle retries, listens to the retry topic.
  • For each message in the retry topic, consumer group calculates the delay based on the algorithm, then retries processing the message.

Delay Calculation Algorithms:

Constant Backoff: A constant delay, e.g., 5 seconds, is introduced between retry attempts for each unsuccessful message in the Kafka topic.

Jitter Backoff: A random variable within a constant interval, e.g., between 1 second and 10 seconds, with a normal distribution, is used to introduce variability in the retry schedule.

Linear Backoff: The delay time increases linearly with each retry attempt.

Exponential Backoff: The delay time grows exponentially with each retry attempt, allowing for increasing intervals between retries.

Retry Schedule and Time Limit for Retry: For our stock price notification system, we picked the exponential with jittered backoff approach, as it minimizes the load on the external service and distributes the peak load between a range. We opted to retry messages only for 24 hours, since the notifications become irrelevant after this time lapses. After 24 hours, the system stops attempting retries for a given message.

When retry limit exceeded - Dead-Letter-Queue (DLQ): To handle scenarios where a message has exceeded the maximum limit of retries or the retry duration (in our case) has elapsed, a Dead-Letter-Queue (DLQ) is utilized. This allows for segregating failed messages that require manual intervention or further analysis.

(+) Allows parallel processing of requests (+) Natively supports scaling and fault tolerance

(-) Requires a message queue (-) After resolving the issues on the service side, messages in the queue are not processed immediately

What if instead of using DLQ we send that failed message back to topic from the consumer and store it there with some marking?

in this case there might be duplicated messages stored back in the topic, since a topic maybe consumed by multiple consumers, if each of the consumers sends back the unsuccessful message then there maybe duplicated messages in the topic’s partition

Strategy 5) Rescheduling with Circuit Breaker Pattern:

The circuit breaker prevents continuous retries during prolonged periods of external service degradation, protecting both the system and the external service from excessive load.

The combined rescheduling and circuit breaker approach addresses previous challenges in the stock price notification system by providing a dynamic and adaptive strategy for handling failures.

When a request to the external service fails, the system employs a rescheduling mechanism (jitter backoff/ exponential backoff etc), introducing delays to prevent overwhelming the external service with rapid retries, especially during transient issues.

Simultaneously, a circuit breaker monitors the overall health of the external service; if it detects degradation, it temporarily halts requests to allow recovery, prevents continuous retry of rescheduled approach. The circuit breaker prevents continuous retries during prolonged periods of external service degradation, protecting both the system and the external service from excessive load.

This combined approach optimizes system performance with adaptive retry strategies during normal operation and ensures stability by reducing the load on the external service during challenging periods.

import org.springframework.retry.annotation.Backoff;
import org.springframework.retry.annotation.Retryable;

@Service
public class ExternalServiceClient {

@Retryable(maxAttempts = 3, backoff = @Backoff(delay = 1000, multiplier = 2))
public void performExternalRequest() {
// Your external service request logic
}
}


import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;

@Configuration
public class CircuitBreakerConfig {

@Bean
public CircuitBreakerConfig circuitBreakerConfig() {
return CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofMillis(1000))
.build();
}

@Bean
public CircuitBreaker circuitBreaker(CircuitBreakerConfig circuitBreakerConfig) {
return CircuitBreaker.of("externalService", circuitBreakerConfig);
}
}


import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.retry.annotation.Backoff;
import org.springframework.retry.annotation.Retryable;
import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;

@Service
public class ExternalServiceClient {

@Autowired
private CircuitBreaker circuitBreaker;

@Retryable(maxAttempts = 3, backoff = @Backoff(delay = 1000, multiplier = 2))
@CircuitBreaker(name = "externalService")
public void performExternalRequest() {
// Your external service request logic
}
}

Solving the common issues in retry & circuit breaker pattern:

* Idempotence: Ensure no duplicate notifications if the same faulty request is retried. Reduces the risk of users receiving duplicate notifications, providing a seamless user experience.

A stock price change triggers a notification.

  • StockPriceChangeNotification{ symbol:'AAPL', changePercentage:2.5 }
  • Generate an idempotence key: idempotenceKey = hash(symbol + changePercentage)
  • Before sending the notification, check if idempotenceKey has been processed to prevent duplicates.

Message Order and Timeliness: we can first successfully process the most important request and then the less time-sensitive one. For a mail service, it could work like this:

  1. Message A was received at 10:00
  2. Message A failed to deliver, next delivery schedule 11:00.
  3. Message B was received at 10:30 and contains the most up-to-date information on the subject (Stock) of message A. in that case the stock price update notification regarding message A is of no use.
  4. Message B has been sent successfully.
  5. At 11:00 message A is sent successfully with outdated information./ may be ignored since we already got the latest info at 10:30 no need of 10:00’s information
Message A: Failed and Queued for rescheduling at 11:00: 
StockPriceChangeNotification { symbol: 'GOOGL', changePercentage: 1.8,
timestamp: '2022-03-01T10:00:00' }

Message B: Successfully sent:
StockPriceChangeNotification { symbol: 'GOOGL', changePercentage: -0.5,
timestamp: '2022-03-01T10:30:00' }

Utilize a Key-Value storage to store timestamps. Compare timestamps during processing to prioritize more time-sensitive notifications.

Limits on Retries: Prevent endless retries and resource waste. Avoids system overload and unnecessary resource consumption due to continuous retries.

  • Email notifications fail to send due to a temporary service outage.
  • Set a retry limit (e.g., 3 attempts).
  • Move the message to a Dead-Letter-Queue after exceeding the retry limit for manual inspection to avoids system overload and unnecessary resource consumption

There are many ways to handle service failures. The most common method is fire-and-forget, the simplicity of which is impressive. The next level of processing that reduces the number of unsuccessful requests is the Circuit Breaker pattern. If you need to achieve successful processing of each request, you can use queues (retry-topic or dead-letter-queue for failed retry) to store messages scheduled for reprocessing. To regulate the compromise between the number and frequency of requests to the service, you should choose the appropriate function for the delay before reprocessing. For most use cases, a constant or exponentially increasing jittered delay will be suitable.

To minimize the message processing time, you can use priority-enabled message queues, thus decreasing the delay before re-processing.

--

--

The Java Trail

Scalable Distributed System, Backend Performance Optimization, Java Enthusiast. (mazumder.dip.auvi@gmail.com Or, +8801741240520)