Big Data vs. Mechanical Sympathy

Welcome to the showdown!

In one corner, we have Big Data: the giant, lumbering juggernaut. He’s unstoppable, folks! Throw more data at ‘em, and he crunches right through it… as long as you keep paying your cloud bill.

In the other corner, we have Mechanical Sympathy: swift as a leopard, agile as an impala (not Apache Impala - that’s one of Big Data’s groupies). His strategy is to go with the flow, save his energy, and adapt, adapt, adapt!

Who is going to win? We’ll have to wait and see!

Big Data is a term that we are all familiar with. Starting with the seminal Map Reduce paper from Google in 2004, continuing with the release of Apache Hadoop in 2006, and reaching a fever pitch in the early 2010s, Big Data changed the way that we think about data processing. Data was no longer the domain of spreadsheets and SQL databases but of clusters, frameworks, and distributed systems. I do not want to downplay the significance of the Big Data movement. It enabled companies like Google, Facebook, and Twitter to exist. It changed the way that scientists could analyze particle collider experiments. To be fair, it changed the world.

Big Data: 1, Mechanical Sympathy: 0.

But here’s the deal. Most companies - even large enterprises and most internet startups - don’t operate at a scale that calls for big data. And there’s more: big data also means big overhead. For small data processing tasks, planning execution, spinning up compute resources, scheduling jobs, loading data, shuffling data across a network, and sometimes writing checkpoints means that queries are orders of magnitude slower. Most of us have pretty average data, but “Average Data” doesn’t pack a marketing punch.

Enter: Mechanical Sympathy

What about mechanical sympathy? If this is your first time hearing that term, I don’t blame you. There are not any Hortonworks or Clouderas of mechanical sympathy out there. There are no big frameworks with tons of mindshare. So what is mechanical sympathy, and why do I think it may just be the next important thing in data?

Mechanical sympathy is the practice of understanding and working with the hardware such that the systems you produce are the most natural and efficient for the hardware to run. Rather than stacking abstractions on top of abstractions, we take a step back, take a step down, and figure out how to maximize these incredible machines on our desks, in our pockets, and in our data centers. To be clear, this does not mean everyone working with a computer must become an assembly programmer. By knowing more about how computers work and how data systems interact with them, we can start selecting more appropriate and efficient tools. Maybe we’ll still need Spark or Presto sometimes. But maybe we can get by with DuckDB another time. Instead of executing a query on a cluster of virtual machines running containers with a programming language VM and runtime to fetch data from a cloud object store, we could, you know, just run it on our laptop. Or phone. Or watch… maybe.

Big Data: 1, Mechanical Sympathy: 1.

The problem right now is that few quality tools that operate in harmony with our hardware. The past three-plus decades of software engineering have primarily focused on making code that is human-grokable (a Very Good Idea). However, we have done so at the expense of efficiency. Even though your computer today likely has at least 8 times the CPU cores running almost 20 times faster and with hundreds of times as much memory as your typical desktop in 1998, many applications are just as slow as they were on that old computer. As hardware has improved, we have met (or even outpaced) the improvements with costly abstractions.

As a small anecdote, I recently experimented with prototyping a data-loading tool that operates a bit closer to the metal than the existing options. I could start up a connector on the order of 10,000 times faster than other tools. This is not because I was doing anything outrageous. It is because I wasn’t doing anything I didn’t need to.

Chart showing the time it tool to load a connector across three different tools. Fivetran took about 74 seconds, airbyte took about 10 seconds, and maat (the author's tool) took 2.4 milliseconds

Runtime of loading a data connector

If I have not at least convinced you to think about this mechanical sympathy thing, I’d like to talk about one more point: carbon. Data centers - and even desktop and laptop computers - draw a lot of power. It is not unusual for CPUs today to draw ten times the power of a circa-1998 CPU. The amount of compute-per-watt, however, has drastically improved, so we can do way more work with less power than we could twenty or thirty years ago. However, we have decided to optimize for programmer hours instead of watts and work harder, not smarter. Sooner or later, this has to stop. Whether we voluntarily push for more efficient data processing (and computing in general) or wait for regulation to force it, we will eventually need to focus on cutting back our power consumption. Yes, innovations in power will likely mean that we have a cleaner, more efficient energy supply. Yes, innovations in hardware will likely make computer architecture even more efficient. And yes, we will still need to make more efficient software and use it to process data more efficiently.

The Punchline: It Depends

Having just talked up mechanical sympathy, let me say that Big Data is not going anywhere anytime soon, nor should it. While there are costs - monetary, complexity, and environmental - there are also incredible benefits to the kinds of processing for which we can use Big Data. The thoughtful use of Big Data in science and industry can be a net win for all of us. The important thing is to consider the trade-offs.

Let me tell you a story from my experience at a previous job. We migrated from running analytics on MySQL to BigQuery. This was an essential transition, and the company probably would not have survived without making this move. However, some queries that had taken 200 milliseconds now took about 5 seconds. On the other hand, queries that had taken 45 minutes (yes, minutes) now completed in under 10 seconds. We traded performance on small data for the ability to process big data. The loss was that the performance floor was much higher, the monetary cost was painful, and the total energy required increased. However, we continued offering our customers a valuable service that helped them be more efficient, and I am certain that this was a net win.

Big Data: ?, Mechanical Sympathy: ?.

Ding! Ding! Ding! It looks like this match doesn’t have a clear winner, but I would love to hear your thoughts!