Wednesday, September 03, 2008

My take on CAP

The CAP theorem (Consistency, Availability, Partitioning) has been receiving quite a lot of interest lately, just to mention one of the many references.

What is CAP about?

First let me give credits here: I am deriving my inspiration from the theoretical insights found in this paper co-authored by one of my favorite woman scientists, Nancy Lynch from MIT. If you get a chance to read this paper, go ahead it will bring you some very useful fundamental understanding...

The CAP theorem is essentially a limitation on what you can do with clustered (web) services in the fashionable context of SOA.

The word 'cluster' is important here since that is what it is all about. In particular, the theorem states that you can't have all three properties (Consistency, Availability, Partitioning) in one and the same system (read: service). This implies that there is no perfect solution to building a high-throughput popular service, or is there? Let's first explore what each thing means...

Consistency

By consistency, the theorem refers to the property that changes (updates) to the service back-end are visible to later queries. Simplifying: if you add something to your shopping basket then it will appear there next time you retrieve your basket status. That sounds trivial, but it is not if the basket is spread over multiple physical server processes... Consistency is commonly ensured (between processes) by having some sort of distributed transaction coordinator, or (assuming a central back-end) a single centralized database.

Availability

The Lynch paper uses a very simple but sufficient definition of "availability": a system is available if every request to it returns. In other words: there is no infinite blocking.

Partitioning

Partitioning means the cut-off between two segments of the cluster. In other words, one or more nodes become unreachable for at least some time.


What is the Theorem saying?

You can't have all three of the above qualities, period. However, you can combine any two of them if you like. This is proven in the paper by Lynch et al. Also (and this is important) you can apply different combinations of qualities to parts of your system. Meaning: you can stress consistency in one part, availability in another part, and so on. For instance, order processing or payment processing can be done consistently and available (sacrificing partition tolerance) whereas querying the product catalog can be done differently (stressing partition tolerance in favor of consistency).

Does this contradict or invalidate Atomikos?

Not at all, quite the contrary: it makes Atomikos (and its third generation of TP monitors) all the more relevant. Why? Because Atomikos products can help you in making those parts consistent when you want them to be.

Virtually achieving all three qualities

If you embrace asynchronous messaging (a la JMS or email) and extreme transaction processing (XTP) then it is possible to asymptotically realize all three qualities (consistency, availability, partition-tolerance) provided that you do use a callback mechanism to communicate results (e.g., by sending a confirmation email). Here is how:


  • Queue requests in JMS.

  • Process each request transactionally (so failures will leave the request queued for retries).

  • The process that digests each request can be arbitrarily complex and use transactions (consistency) and return whenever it likes (thanks to the queuing, no reply is expected within a preset time frame).

  • Any lack of availability of the processing is recovered by the queues: failed requests will stay queued until the process in the back-end is in fact available again.



Now did I just break the CAP impossibility? More on this in a next post...