But data analysis itself is still prone to several problems. For example, there’s a balancing act that data software developers have to play when creating tools for big data analysis, and there are only so many ways around it.
Speed vs. Scalability
When querying big data, data tool developers need to consider which is the more important priority: speed or scalability.
Let’s say you want to prioritize speed. Generally, this type of software will take data straight from the source and store it in a device’s memory or disk. When you do this, you can resolve queries quickly, but there’s a significant downside: you can only hold one section of your data at a time, and it’s usually summary data. Furthermore, it’s nearly impossible to hold real-time data here, since it changes so frequently. Accordingly, you can achieve fast queries, but you’ll never have access to the full library of data at your disposal.
So what happens when you prioritize scalability? Here, a developer will grant you direct access to all the data in your system. Rather than accessing a mere subset or being forced to query old data, you can access unlimited data points in real-time. The weakness here is that it’s going to take much longer to process queries—especially in environments depending on petabytes of data.
Is There a Solution?
So is there a way to resolve the push-pull problem that speed and scalability present? At present, there’s no direct solution; when optimizing for one, you’re inherently going to create challenges for the other.
However, there are some strategies that can help you compensate for the inherent weaknesses of either strategy, or allow you to develop the ideal solution for your business.
Favoritism. If your business needs demand one type of optimization over the other, you may simply choose a solution that tips to that side of the equation. For example, if you know you’re going to be making lots of common, high-level queries throughout the day, it may be in your best interest to optimize for speed over scalability.
The balancing act. Some solutions try to find a midpoint between speed and scalability, looking for the point where users become frustrated at speed and performance issues, then pushing scalability as far as it will go within those parameters. This is a midrange approach ideal for many businesses without a priority in one direction or the other.
Hardware upgrades. If you’re optimizing for speed, you can use hardware upgrades to afford yourself more scalability. However, these can be expensive—especially considering the recent surge in GPU and other processing unit prices. Getting enough memory and processing power to handle your large-scale queries may be feasible for a user or two, but not for an entire team of people (unless you have a budget for lucrative hardware).
Caching. If you’re leaning toward scalability, but want to keep your performance reasonable, caching may be an option. The idea here is to keep a running cache of certain high-level information, so you can support fast querying for the vast majority of the queries you encounter. If an answer isn’t available in the cache, users can access the full database at the cost of speed.
Projections. You can also use projections as a way to compensate for low performance, if you’re optimizing for scale. This solution is meant to provide an estimate for a given query, rather than scouring the entire database for an answer. This isn’t an acceptable measure if precision is a must.
What’s more important to you, speed or scalability? You don’t need to choose one or the other, but you should know where your priorities lie. Different data analytics tools are going to excel in one department or the other, so make sure you understand your needs before moving forward with any new business intelligence (BI) or analytics tool.