Common microservice patterns with Resilience4j
In this post I want to explore a few ideas that are useful to keep in mind when building a distributed system using a library called Resilience4j.
We'll look at three concepts:
- Rate Limiters
- Bulkheads
- Circuit Breakers
I'll explain the idea behind each of these concepts, when you would want to make use of them, and then we'll look at a small sample implementation of each one using Resilience4j in a ktor web project.
Rate Limiters
Let's start with what I think is the easiest of the three concepts to understand, a rate limiter. Rate limiters allow you to control how often you make a particular call within a given time frame.
Rate limiters can be used to control how many requests a client can make against your service according to some billing plan, they can be used to help prevent your service from being overwhelmed by a sudden spike in traffic, or to help keep your own service from being a bad neighbor and overwhelming any down stream services that you call while processing requests.
Let's look at how we could add a rate limiter to an endpoint in a ktor service to prevent too many requests from overwhelming our service.
First we'll need to create a RateLimiterConfig
val config = RateLimiterConfig {
timeoutDuration(Duration.ofSeconds(5))
limitRefreshPeriod(Duration.ofMinutes(1))
limitForPeriod(10)
}
The timeoutDuration
is how long a given call will wait before giving up on being allowed through the rate limiter.
We define 5 seconds here in our example which is also just the default if you dont explicitly configure it.
limitRefreshPeriod
is the period of time between when the rate limiter will refresh itself. Resilience4j splits time into cycles and the duration of a given cycle is controlled by the limitRefreshPeriod
. The number of requests the rate limiter will let through in a given refresh period is defined by limitForPeriod
Once you have your RateLimiterConfig
you can pass that to a registry object which you will use to create any rate limiters you need.
val rateLimiterRegistry = RateLimiterRegistry.of(config)
val importantResourceRateLimiter = rateLimiterRegistry.rateLimiter("importantResource")
As you can see the code above, creating a rate limiter is as simple as calling a function on the registry and giving it a name. The resulting RateLimiter
object will be created using the default config passed into the registry when it was created.
You can always provide a custom configuration for a particular rate limiter by passing a different config as a second argument on the .rateLimiter
call.
val rateLimiterRegistry = RateLimiterRegistry.of(config)
val importantResourceRateLimiter = rateLimiterRegistry.rateLimiter("customRateLimiter", customConfigObject)
Once we have our rate limiter we can use it to guard one of our service's endpoints.
get("/someImportantResource") {
try{
val data = importantResourceRateLimiter.executeSuspendFunction {
repository.getExpensiveData()
}
call.respond(mapOf("data" to data))
} catch(ex: RequestNotPermitted){
call.respond(HttpStatusCode.TooManyRequests)
}
}
In the happy path our request will be allowed through the rate limiter, we'll get the data out of our repository and then respond with that data to our client. If our rate limiter has been overwhelmed though it will throw the RequestNotPermitted
exception which we can catch and then respond with a 429 to let our calling client know that they've sent us too many requests.
As you can see Resilience4j makes applying this pattern in your service really simple. You can have rate limiters defined for all kinds of scenarios or situations to help make your service more resilient :D
Circuit Breakers
The next concept we'll explore are circuit breakers. This is another pattern you should follow when building a distributed system to increase overall resiliency.
Let's imagine that when a request comes in to your service, you need to reach out to Service B in order to process the request.
What if Service B goes down? or is returning a large number of errors?
In such a situation a circuit breaker can help keep a failure in Service B from cascading into a larger system failure by stopping your calls to Service B. This can help prevent your service from using up valuable resources on a call that is destined to fail anyway.
You can configure your circuit breakers to "flip" or "open" based on an error rate. The circuit breaker will monitor the result of calls that it wraps, and when the error rate exceeds the configured acceptable error rate, the circuit breaker will pre-emptively deny future calls from being attempted. The circuit breaker will remain in this "open" state for a given amount of time before transitioning to a "half open" state. In the half open state the circuit breaker will attempt to let a small number of requests through, to see if they succeed. If enough requests succeed, the circuit breaker will assume the down stream problem has been resolved and automatically transition back to the "closed" state and normal function will resume.
Resilience4j provides two different types of circuit breakers. "Count Based" and "Time Based". A count based circuit breaker keeps track of a sliding window of requests, and simply records success or failure. When the number of failed calls within the current window exceeds the configured route, the circuit breaker will "open".
The "Time based" implementation works in a similar way except instead of keeping track of sliding window of total requests, it looks at requests going back in a certain time frame, when the number of requests in the given time frame exceeds the configured error rate, the circuit breaker will "open".
In our example code we'll look at creating a "Count Based" circuit breaker in Resilience4j and wrap a call to some imaginary down stream service. We'll create a very basic circuit breaker and rely mostly on Resilience4j's defaults, but rest assured that the library's circuit breakers have a ton of different configuration options that you can play with.
We start by creating a circuit breaker config
val circuitBreakerConfig = CircuitBreakerConfig {
failureRateThreshold(25F)
permittedNumberOfCallsInHalfOpenState(5)
slidingWindowType(io.github.resilience4j.circuitbreaker.CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
slidingWindowSize(5)
minimumNumberOfCalls(5)
}
Our circuit breaker will flip to the open state if 25% of its calls fail.
When it is in the half open state, it will let 5 calls through to make a determination about whether the down stream service has recovered.
We will use a count based slide window that we have sized at 5, and we will require a minimum of 5 calls before the circuit breaker attempts to make any decision about down stream service failures.
val circuitBreakerRegistry = CircuitBreakerRegistry.of(circuitBreakerConfig)
val serviceBCircuitBreaker = circuitBreakerRegistry.circuitBreaker("ServiceB")
We can then use the config object to create a circuit breaker registry. The registry then allows us to create individual circuit breakers that we can use in our service. As with rate limiters you can also pass a custom configuration to the .circuitBreaker()
method to get a circuit breaker instance that adheres to your custom configuration rather than the registry default.
We can then use our circuit breaker to guard a call to our down stream "Service B" service that isnt always reliable.
get("/someUnreliableResource"){
try{
val data = serviceBCircuitBreaker.executeSuspendFunction {
serviceBWrapper.unreliableRemoteCall()
}
call.respond(mapOf("data" to data))
} catch(ex: CallNotPermittedException){
call.respond(HttpStatusCode.ServiceUnavailable)
}
}
When enough failures happen within our sliding window the circuit breaker will "open" and begin to throw CallNotPermittedException
which we will then translate into a 503 response to our calling client.
Once again we see that Resilience4j makes it super easy to apply a really useful pattern in our microservice code.
Bulkheads
For the final pattern of this post, we'll look at bulkheads. This pattern takes its name from an idea traditionally associated with ships. The hull of a ship may be split into several different bulkheads. A bulkhead may begin to take on water and be shut off from the rest of the ship, thereby preventing the flooding from spreading and sinking the entire vessel.
In a distribued system we use the bulkhead pattern to keep our service alive and functioning as best as possible when a down stream service is failing or misbehaving.
Let's imagine that we have a two different endpoints in our service. Endpoint 1 calls service A whenever it gets hit. Service A is usually very fast and very reliable. Endpoint 2 calls service B whenever it gets hit. Service B is very slow and very unreliable.
If we don't use the bulkhead pattern then we could end up in a situation where we have enough resources tied up waiting on service B that we can no longer process requests that dont require calling service B. This would result in our service being unresponsive to all of our clients, rather than just the ones that need to hit endpoint 2 (the endpoint that relies on Service B).
With a bulkhead though, we can set limits that make it so our service can always continue to service requests against endpoint 1 (which relies on service A), even when endpoint 2 is overwhelmed by a slow or failing Service B. This is obviously a better state to be in than simply failing for all clients on all calls, just because of one bad/failing down stream dependency.
Resilience4j once again comes to the rescue and saves us from having to implement this pattern by hand by giving us an easy to configure and use Bulkhead object. Let's look at how we could use one below
val bulkheadConfig = BulkheadConfig {
maxConcurrentCalls(4)
maxWaitDuration(Duration.ofSeconds(1))
}
Following the same pattern as before, we create a config object for our bulkhead registry. This config object sets a limit of 4 concurrent requests for each bulkhead, and gives callers a 1 second time out before telling them to give up.
val bulkheadRegistry = BulkheadRegistry.of(bulkheadConfig)
val serviceBBulkhead = bulkheadRegistry.bulkhead("ServiceB")
val serviceABulkhead = bulkheadRegistry.bulkhead("ServiceA")
We can then use the config object to create a registry, and then use the registry to create bulkhead instances, just like we did with rate limiters and circuit breakers. We can also pass custom configs when we create a bulkhead if we wish.
Now let's look at our bulkheads in action on our endpoints
get("/endpoint1"){
try{
val data = serviceABulkhead.executeSuspendFunction {
serviceAWrapper.remoteCall()
}
call.respond(mapOf("data" to data))
} catch(ex: BulkheadFullException){
call.respond(HttpStatusCode.ServiceUnavailable)
}
}
get("/endpoint2"){
try{
val data = serviceBBulkhead.executeSuspendFunction {
serviceBWrapper.slowCall()
}
call.respond(mapOf("data" to data))
} catch(ex: BulkheadFullException){
call.respond(HttpStatusCode.ServiceUnavailable)
}
}
You can see that each call to the down stream service is wrapped by the bulkhead. When a given bulkhead becomes "saturated" it begins to throw BulkheadFullException
which we then translate into a 503 for our calling clients. If the bulkhead for either ServivceA or ServiceB becomes saturated, we will stop trying to waste resources calling the failing service, and instead save those resources for use on calls that are still healthy and do not depend on the failing down stream system.
That's it for this post. I hope you've found this helpful in exploring how Resilience4j can make your life a little easier.
As always I've included links below that I found helpful in relation
to the things we've discussed in this post. You can find the full sample project I wrote for this post here.
Useful Links
https://github.com/bltuckerdevblog/resilience4j-sample
https://resilience4j.readme.io/docs/ratelimiter
https://resilience4j.readme.io/docs/bulkhead
https://resilience4j.readme.io/docs/circuitbreaker
https://microservice-api-patterns.org/patterns/quality/qualityManagementAndGovernance/RateLimit.html
https://martinfowler.com/bliki/CircuitBreaker.html
https://dzone.com/articles/resilient-microservices-pattern-bulkhead-pattern