Big Data: SQL Planning & Migration to Spark and Hadoop


I was in a meeting the other day discussing a problem that a client keeps running into. They need a platform to analyze trends in a rapidly growing data set, where the criteria is changing as fast as their business is changing, which as it turns out, is pretty fast. Right now they are storing the data in a relational database and writing complex SQL queries to mine information from it. The DBA told us that he would run a query and then go to lunch, hoping it would be done by the time he gets back. They need the results faster, and they know that their problem is just going to get worse as the data grows.

The kneejerk reaction to a problem like this is to get a bigger database server. Sure, this may help right now when the data is only a few hundred gigabytes, but what happens when we are dealing with a few hundred terabytes? A few hundred petabytes? This kind of solution just does not scale.

The real answer here is to step back, examine the problem, understand what the goal is, and then design a process that can achieve that goal. In this case, the problem is that a business needs to be able to understand patterns and trends in a rapidly growing data set. The goal is to be able to do this quickly and consistently even as the data grows. One process that can achieve this is by using something like Hadoop or Spark to build a cluster that can scale as the data scales.

There were concerns as soon as I brought this up; What about the schema? How do you write SQL for that? Why not just shard the database? Some of these concerns may be valid, but I feel we must evaluate this without emotion. Do people want to use the relational database because it is a better solution for the problem or because they feel comfortable with it?

I’m not sure it’s accurate to say that we are facing new problems these days, but the shape and size of our problems have changed. Now even the smallest company has something to gain from working with big data– anyone with a credit card can spin up a compute cluster. We should not be afraid to change our tools as our challenges change.

Technology is continuously evolving. This means our tools are continuously changing and so must our processes for tackling new challenges. I believe that the system we came up with in that meeting will be the one to solve our client’s problem. If someone gave us the same problem five years ago or five years from now we would probably have wildly different suggestions, but we would come to those suggestions in the same way: through deep understanding of both the problem and the technology available.