Looking for a "modest data" streaming system

03 Sep 2017

Each system has its own trade-offs

Big Data — that’s what’s cool and sexy. But Google’s problems often aren’t the same as your problems. If you’re a company that relies heavily on data processing, you need to scale parts of your pipelines and you don’t want to worry too much about physical location of your services. I don’t know you, but chances are you don’t have petabytes of data. In this situation, as opposed to web development for example, you may find a shortage of ready-made frameworks. The big names like Spark, Storm, Hadoop or Beam may not help you.

What I’ve seen several times is that people reach for the big data tools only to end up frustrated because of the trade-offs that don’t match their problem. Once, we had some 100 GB of zipped data from a biological experiment and needed to identify interesting points and do some visualizations. A colleague of mine was still in the middle of writing a Spark job when my Unix pipeline composed of awk, sort, uniq and their friends finished on my laptop. The preprocessed data were then served to our web application from a Postgresql database that handled it without breaking a sweat. I feel confident the same general approach could be extended to a production setup.

By the way, the Spark job never finished, because there was some operational problem with the cluster at the time.

I’m now helping a company to grow their data processing pipeline. They need reasonably low latency, in the order of seconds end to end, support for both Python and JVM languages and easy way to scale the system and prioritize requests because they occasionally receive bursts of data.

Frameworks help by constraining yourself

From my point of view, it would be best to find a framework even if it means to adjust the coding and architectural style of the application. If the trade-offs of the framework match your problem well enough, fitting your solution into the constraints is actually a benefit, not a drawback. It helps you keep the code organized, forces well defined interfaces and eases communication about the code across the team. If the stars are right, the framework may even help you with the operations, e. g. with monitoring.

So far, Apache Storm has been the closest contender so far. It has low enough latency to be plugged as a backend to HTTP requests that need a response. It has the DRPC server built exactly for this purpose. You don’t need to care on which machine the individual functions run and you think of deploying the whole pipelines at once, not the functions or workers.

What I dislike, however, is that Storm is mostly JVM-centric and that your data sources (called spouts) need to actively poll the actual source — message queues or the DRPC server.

Options similar to Apache Spark seem to be Concord.io and Twitter Heron. Unfortunately, neither of them seems to be such a good fit to overcome my caution to use a fresh and not widely adopted and documented project.

Looking further

I’m now looking also one level down — to simpler components. Instead of a full streaming framework, the app may use an architecture oriented around pipes and workers processing messages instead of current synchronous requests. For the deployment orchestration, I think platforms like RedHat’s OpenShift, [Tectonic][tectonic], AWS [ECS][ecs] or plain [Kubernetes][kube] could help manage the deployment and monitoring of the individual services.

Follow me — @koraalkrabba — on Twitter to learn about progress.

filip at sedlakovi.org - @koraalkrabba

Filip helps companies with a lean infrastructure to test their prototypes but also with scaling issues like high availability, cost optimization and performance tuning. He co-founded NeuronSW, an AI startup.

If you agree with the post, disagree, or just want to say hi, drop me a line: filip at sedlakovi.org