Microservices Workflows : Choreography Coordination Pattern

Naresh Waswani
Simpplr Technology

--

When working with Microservices Architecture, many times, you happen to come across situations where multiple services are needed to collaborate to execute a business use case (referred as workflow from here after). And in such situations, there comes the confusion of — How the Communication needs to be Coordinated across multiple participating services to achieve the business case?

There are 2 patterns of handling distributed workflows in Microservices —

  1. Orchestration based Coordination
  2. Choreography based Coordination

In my earlier blog, I discussed Orchestration based coordination to handle workflow. In this blog, the focus is on Choreography based coordination to implement the workflow.

Choreography pattern for workflow

In this pattern, the participating microservices in the workflow are intelligent enough to talk to other services directly to complete the workflow without the need of a central orchestrator.

Each microservice knows what role it has to play as part of the overall workflow execution, and is designed to handle both positive flows and negative flows.

Let’s take the same example we used for Orchestration based coordination in the previous blog — Food Order Placement workflow from the company — Yet Another Food Order Platform

Placing Food Order using Choreography Coordination Pattern

In this case, when someone places an Order for Feed Delivery, the request is received by the first microservice in the workflow, in this case Order Service. The request flow goes as follows —

  1. On receiving the request, the Order Domain service makes an order entry in the Data Storage and generates an Order ID. On successful completion, it communicates asynchronously with Payment Service passing enough details for Payment Service to process the payment. In this case the asynchronous communication happens via emitting a Business Domain event “OrderPlaced” by Order Service.
  2. Payment service has been given the intelligence to understand what is supposed to happen when it receives an “OrderPlaced” event. IT initiates the payment process and on successful completion of the payment, it emits a Business Domain event “PaymentSuccess”.
  3. Next in the workflow chain is the Restaurant microservice. It is designed to understand the event “PaymentSuccess”. Once the payment is successfully done, it tries to confirm the food order with the Restaurant. On confirmation of the order, it publishes its own Domain event “RestaurantConfirmedOrder”
  4. “RestaurantConfirmedOrder” event triggers four parallel activities — a) Order microservice updates the order status in it’s Data Storage, b) Notification microservice sends an email to the user for successful placement of the order, c) Delivery Partner microservice gets the order details for assigning the delivery person, to pick the order from the Restaurant and deliver to the customer’s address and d) Loyalty service adds points to the customer’s wallet based on the Order amount.
  5. Once the Delivery person is assigned for the order pickup, Delivery Partner service publishes a Business Domain event — “DeliveryPartnerAssigned”.
  6. Both Order microservice and Notification service are designed to understand this domain event and take action of updating the Order status and notifying the user respectively.

If you notice here, every participating service knows in bits and pieces their set of actions to be performed given a specific Business Domain event is published.

The business flow above is implemented using Event based asynchronous communication but one can also use point to point based asynchronous communication or even synchronous communication. The important point is — every service knows the part they have to play in the overall workflow and directly talks to other services for the actions to be performed. No central coordinator is needed.

Handling success flow with Choreography makes it look easy. But things get really complex when you also need to support rainy day scenarios and if you have more of such error flows to handle, then the overall system becomes really complicated. Just imagine for a second — someone wants to understand how the overall workflow is implemented and what are the possible alternate paths and error conditions within the workflow. It would be simply crazy :)

The positive flow looks pretty straight forward, but things do go wrong and handling the error flows is what makes things more complicated. Let’s see what happens when the error situations happen.

As per business requirements, there are quite a few error situations which needs to be handled —

  1. If a Restaurant service could not confirm the order, the payment made needs to be reversed.
  2. If the order could not be delivered within 45 minutes, the customer is given x points to compensate for the delay. The value of x depends on the value of the order placed.
  3. If the order successfully placed with the Restaurant service could not be delivered at all for some reason, maybe because none of the Delivery Partner was available during that time period, the entire amount needs to be revered to the customer. At the same time, to compensate for this situation, x points are added to the customer’s wallet.
  4. And many more….

To handle these errors, there will be more Business Domain events published by the microservices, and every other microservice needs to be enhanced further to handle those domain events and take action if it is applicable for them which implicitly means — additional communication flows.

Let’s take an error scenario — in the error scenario where the order could not be confirmed with the Restaurant service after successful payment was done, following set of domain events gets published and microservices respond to those events as follows —

Choreography way of handling error condition while placing the order
  1. Order microservice makes an entry in its own Data Store and publishes an event “OrderPlaced”.
  2. Payment microservice listens to the “OrderPlaced” event and initiates the payment process. On successful completion, it emits a Business Domain event “PaymentSuccess”.
  3. Restaurant microservice reacts to the “PaymentSuccess” event and tries to confirm the order with the Restaurant. For some reason, it fails to confirm the order and gives up. It then publishes an event “RestaurantOrderConfirmationFailed”.
  4. Because an error condition has occurred, an alternate path needs to be triggered to handle it. In this case, 3 microservices react to this event of “RestaurantOrderConfirmationFailed” — a) Payment microservice initiates the reversal of the payment, b) Notification microservice notifies the user that Order could not be placed successfully and any payment done will be reversed and c) Order microservice updates its Data Store to capture the latest status of the order.
  5. Payment microservice on successful reversal of the payment publishes an event “PaymentReversed”.
  6. Notification microservice responds to the “PaymentReversed” event and sends an email to the user giving updates of the payment reversal transaction.

If the payment could not be reversed successfully, then it would trigger another path to handle it further. And if you see, handling all such alternate paths means publishing more and more business events, which means enhancing all the participating microservices to understand those events and react to them.

Who manages the workflow state (how many services have completed their steps, retries, error conditions, etc.)— an obvious question might be coming to your mind

Approach 1 —Let the 1st point of interaction for the workflow manages the workflow state. And all it really means is, this front service is playing more than the role of just a Domain service.

In our case, it would be Order service. And every other service needs to update the status of their workflow processing to the Order service. Net net — more network bandwidth consumption and chattiness between Domain services and Order service as every step performed by other services needs to be communicated back to Order microservice. Handling more than the Domain behaviour makes the Order service more complex.

Although there is a complexity and over communication involved in this approach, there is an advantage as well — it is easy to find the current state of a workflow.

Approach 2 — let every microservice manage their own workflow state. And if the state of a workflow needs to be queried then all the participating microservices need to be queried at runtime to get their current processing state. If you have multiple services participating in a workflow, then getting state from each of them, and building the final state of the workflow can be very complex. And you need to create another workflow to handle this Query use case. But the advantage is — the first service handling the workflow can now focus only on the Domain behaviour and need not understand the complexities for workflow state management and no more additional communication overhead between microservices.

I hope, by now you would have realised the advantages of Choreography pattern —

  1. Highly responsive in nature — Since there is no central orchestrator, no additional network hop. Multiple services have the option to react to a business event in parallel and hence also leads to performance improvement.
  2. Highly Scalable — no more central choking point like Orchestrator service; requests are directly handled by the domain services which can scale to handle more load.
  3. Loosely coupled in nature
  4. High Fault Tolerance

Every pattern has some downside as well, and the Choreography pattern is no exception. Here are some of the cons —

  1. It’s very difficult to visualize and understand the overall workflow as you do not have a central place to manage it. The workflow execution path is scattered across microservices and is difficult to understand how the dots are connected.
  2. No central place to get the state of a workflow. Although options are available to manage the workflow state, there is no clean owner.
  3. Error handling is tough. Should you have a need to add more business rules and or error conditions to be handled or enhance the business flow, you need to make changes to all the participating services which need to handle the error condition.

Choreography coordination pattern supports high Responsiveness and Scalability which could be one of the important software characteristics for your workflow but if you have a need of better workflow management and error handling then this pattern may not fit your need. Instead, you may want to check the Orchestrator coordination pattern which handles these architectural characteristics well.

You see, it’s all about Trade-Offs :)

Hope you enjoyed reading this blog. Do share this blog with your friends if this has helped you in any way.

Happy Blogging…Cheers!!!

#DistributedWorkflow #OrchestratorWorkflow #ChoreographyWorkflow #MicroservicesCoordination #MicroservicesCommunication

--

--

Naresh Waswani
Simpplr Technology

#AWS #CloudArchitect #CloudMigration #Microservices #Mobility #IoT