Big data is somewhat of a buzzword, not only in the BI space, but in the larger tech space overall. No one wants to implement data solutions anymore, they all want to implement big data solutions. Vendors aren’t interested in solving your data needs, they’re interested in solving your big data needs. Businesses aren’t hiring data scientists, but they’re constantly looking to fill big data scientist roles.
What does big data refer to?
The term “big data” has been used and abused so widely that it’s hard to actually pick out what it actually means. It’s close to losing all usefulness as a term to discuss a real concept in the BI space—but it still does represent a real concept, and it’s something that any business professional needs to know and understand.
It may seem basic, but big data really just is what it says it is: data that’s bigger than normal. At a certain scale, datasets become too big to be handled by anything but the largest and most robust data solutions.
A smaller or mid-sized business might be comfortable storing all their data with a smaller, less expensive data storage solution, and using a basic BI tool to analyze that data. Larger businesses and enterprises will soon start to run up against the limits of their smaller data solutions, since they produce so much more data than smaller organizations.
Aspects of big data
So what actually makes data “big data?” How can a dataset, which lives as a bunch of ones and zeros on a hard drive in a server cabinet somewhere, be bigger than another given dataset that also only exists as files on a server?
A dataset’s size is determined by how much space it takes up on a hard drive or other storage technique. What makes some data bigger than other data? There are three main ways data gets big:
The actual quantity of data stored in the dataset. Even a dataset made up of simple quantitative data might be big data if it’s made up of enough entries. For example, think of the data that a credit card company might collect within a single day. Each individual data point may not be too complex, just the customer, what they bought, and how much they paid, plus some metadata, but across a network of millions of credit card users, this data starts to take up a lot of space.
Challenges with data volume
For companies like this, it’s a massive challenge to store all this information, even if it is just simple numbers and text. However, customers need to be able to access this data, and in many cases, they need all the data that they’ve ever generated, not just a few entries. The company also wants to analyze this data; this dataset has valuable information about their customer’s spending habits. The business that generated this data has to implement novel solutions for managing this data, due to the amount of data involved. That’s big data.
How do I know I have a high volume of data?
The general threshold for whether or not a dataset has enough volume to be considered “big data” is whether or not the data can be stored using an off-the-shelf, consumer-grade storage solution. This threshold has changed over time as digital methods for storing information have become more efficient. In the 90s, a dataset that took up gigabytes of space might have been considered big data, but with today’s storage methods, even datasets dozens of terabytes large might not require a big data solution.
The speed at which new data is added to the dataset. As anyone who’s studied compound interest knows, even a small initial set can quickly get huge if it’s being added to continually.
Common examples of data velocity
A good example of a dataset that’s big due to velocity is a stock price. A company’s stock price is an aggregate of all the data a stock exchange collects about how the stock is selling. All those data points are just one number, the amount of money that the stock is selling for. For some companies, their stock might be bought and sold thousands of times a second. The aggregate of those sales changes constantly based on new information. Even though the system is only ever logging one data point at once, it’ll still get massive quickly.
Stock exchanges need special tools to keep up with this data. If an exchange can’t stay up-to-date on what price stocks are selling for, it can’t facilitate good trades for the members of the exchange. They need powerful big data solutions to keep up with these extremely rapid data updates.
It’s hard to tell how fast data needs to accumulate to be considered big data. A business might not need a big data solution to track metrics that update a few dozen times an hour, but information that needs to be corrected within a second will usually need a special tool.
The actual content of the data; whether it’s text, video, images, or so on, and whether or not the data is stored as raw information or in a format that can be easily analyzed. The more types of data that need to be stored as part of a dataset, the harder the set will be to manage as a whole.
How to store different types of data
Services that store a user’s personal data, like Google Drive or Dropbox, have to figure out how to store a wide range of data types. A given user’s data may not be big in terms of size, but in terms of how many different file formats must be stored at once. A user might use their Google Drive to store pictures, video, text files, spreadsheets, applications and programs, and audio files. Across the entire Google Drive user base, that’s a massive dataset just in terms of the different types of data that needs to be stored.
Structured vs unstructured data
It also matters if data is structured or unstructured. Structured data provides more context to the programs designed to read the data. For example, whether the data needs to be read in a certain order or presented in a certain way. Unstructured or semi-structured data provides the system less context, if it does at all. Structured data includes more metadata, but it’s also harder to store.
To be considered a big dataset, a dataset usually needs to contain a blend of structured and unstructured data, plus a few different types of data types. Usually, these alternative data types include at least one type of qualitative data, which can’t be easily analyzed.
These three criteria aren’t mutually exclusive. Many, if not all, big datasets will have volume, velocity, and variety. The important thing to remember is that big data doesn’t necessarily need to have all three. There isn’t some sort of diagnostic test that experts use to divine whether or not data is big; if your data is big enough to start straining your current storage solutions, then it’s big enough to need new strategies.
What do you use big data for?
Many of the most valuable and interesting applications of data science are only possible through the use of big data. As a dataset gets bigger and bigger, it gradually becomes a clearer reflection of reality. When data scientists perform analysis on big datasets, they can be more confident in the results of their analysis, since a big dataset is a better indicator of reality than a smaller dataset.
Data science initiatives
Every business wants to be confident in its data science models, but for large businesses and enterprises, it’s absolutely crucial. Enterprises can’t justify dedicating resources in a new way unless they’re completely sure that the data analysis they’re using to make their decisions reflects reality as fully as possible. In high-importance data analysis situations, like forecasting or other predictive analysis, businesses need to use big datasets to get the best results.
Machine learning applications
Big datasets are also important for machine learning. Machine learning has become as much of a buzzword as big data, if not more of a buzzword, but machine learning algorithms can be very useful in the right situations.
Machine learning has many potential benefits for almost every kind of business, especially those that rely on human input to do a lot of their work. However, the algorithms that drive machine learning need a sufficiently large dataset to begin with. Machine learning algorithms have to learn their behaviors somewhere, and the best way to do that is to teach them on large sets of data.
These are just a few of the ways that businesses can leverage their big datasets to make decisions and draw out valuable insight. Big data isn’t limited to just the largest businesses; any business that collects data about its operations can unlock these benefits with the right tools.
As customers connect with businesses in more ways, and businesses connect more of their operations with the cloud, the amount of data accessible to any given company is steadily climbing. Even companies that may not think they collect any sort of useful data can get big data insights with the right tools. BI tools, especially ones designed for big data applications, can make all the difference.
Check out some related resources:
Why BI Programs Fail to Scale: The Data Decision Gap
Domo Tops Dresner’s List of Cloud BI Vendors in 2022
Closing the Data Decision Gap
Try Domo for yourself. Completely free.
Domo transforms the way these companies manage business.