Introduction to Distributed System

This is an excerpt from https://github.com/aphyr/distsys-class

Don’t distribute

  • Rule 1: don’t distribute where you don’t have to
    • Local systems have reliable primitives. Locks. Threads. Queues. Txns.
      • When you move to a distributed system, you have to build from ground up.
    • Is this thing small enough to fit on one node?
      • “I have a big data problem”
        • Softlayer will rent you a box with 3TB of ram for $5000/mo.
        • Supermicro will sell a 6TB box for ~$115,000 total.
    • Modern computers are FAST.
      • Production JVM HTTP services I’ve known have pushed 50K requests/sec
        • Parsing JSON events, journaling to disk, pushing to S3
      • Protocol buffers over TCP: 10 million events/sec
        • 10-100 event batches/message, in-memory processing
    • Can this service tolerate a single node’s guarantees?
    • Could we just stand up another one if it breaks?
    • Could manual intervention take the place of the distributed algorithm?

Distributed queues

  • Kafka, Kestrel, Rabbit, IronMQ, ActiveMQ, HornetQ, Beanstalk, SQS, Celery, …
  • Journals work to disk on multiple nodes for redundancy
  • Useful when you need to acknowledge work now, and actually do it later
  • Send data reliably between stateless services
  • The only one I know that won’t lose data in a partition is Kafka
    • Maybe SQS?
  • Queues do not improve end-to-end latency
    • Always faster to do the work immediately
  • Queues do not improve mean throughput
    • Mean throughput limited by consumers
  • Queues do not provide total event ordering when consumers are concurrent
    • Your consumers are almost definitely concurrent
  • Likewise, queues don’t guarantee event order with async consumers
    • Because consumer side effects could take place out of order
    • So, don’t rely on order
  • Queues can offer at-most-once or at-least-once delivery
    • Anyone claiming otherwise is trying to sell you something
    • Recovering exactly-once delivery requires careful control of side effects
    • Make your queued operations idempotent
  • Queues do improve burst throughput
    • Smooth out load spikes
  • Distributed queues also improve fault tolerance (if they don’t lose data)
    • If you don’t need the fault-tolerance or large buffering, just use TCP
    • Lots of people use a queue with six disk writes and fifteen network hops where a single socket write() could have sufficed
  • Queues can get you out of a bind when you’ve chosen a poor runtime
Advertisements

Subjectivity aside, leave a reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s