Data is Important. In today’s digital world, organizations are relying more and more on data, information, and analytical insights to grow their business. In a small-to-medium business, however, you might hear something like ‘I don’t have any data,’ or ‘I don’t need any data employees, we’re not big enough for it to matter.’ In this article, we’ll show how ‘data’ is more than just databases, why it’s important to structure and organize your data efficiently, and why more data is not always better.
When many organizations think about their data, they think about what is stored in their databases and data warehouses. But data can come from other places other than outright data stores, such as databases, data warehouses, and data lakes. Data can take the form of reports, which are generated from other business systems (ERP, monitoring systems, etc.) Payroll entry can be useful data, especially total time/dollars related to a particular project or client. IoT and social media feeds can all contain valuable data for an organization, which can be used to make better decisions, faster.
In today’s business world, we hear terms like ‘data warehouse,’ ‘data lake,’ ‘machine learning,’ and others. Many organizations, especially in the small to medium size, are not sure exactly what these are, and how they should use them. Essentially, each of these systems are used for the same purpose: decision making. Although they all serve the same purpose, they all have different functions relating to how they help organizations make decisions based on factual data.
In brief, let us take a look at each one of these:
- Data Warehouse – A large data set containing historical data to determine trends and patterns over longer periods of time. The data warehouse holds much larger data sets, but primarily aggregations of data, and not raw data. This aggregated and historical data should be kept away from operational databases, which should only focus on current data.
- Data Lake – A data lake also holds large amounts of historical data, but it differs from a data warehouse in one primary way: The data stored in a data lake is not structured. They contain items such as exported reports, generated csv’s, image files, video files, and similar types of items that don’t lend themselves to typical database storage. These still become an important source for data used to make decisions, but it’s much harder to do a typical ad-hoc query against them than a traditional database or data warehouse.
- Machine Learning – This is a powerful tool for making business decisions based on trend data and mathematical predictive data models. Machine Learning uses the data from data warehouses and data lakes to detect patterns and trends, and make predictions based on similar characteristics. An example of this might be looking at all customer leads and predicting how likely they are to be actual sales. (It’s more complicated than this, and there are hundreds of use cases, but this should serve as a useful example of the importance of data and machine learning.)
Each of these has different use cases, and each requires that data be stored in a different way. Data may be stored in convenient tables for a data warehouse, with more expansive design to be used for machine learning, for example. The important item to take away here is that each of these needs data stored in the right way for its specific use case.
Everybody agrees that data is important, and even that all data is important. However, there are cases where additional data can be detrimental to the decision-making process. If data is not relevant to a particular data model or decision, it should not be included in that model. Adding additional data can impact the final model and potentially skew results, leading to a less-than-optimal result.
In addition to using only relevant data when querying or analyzing, one must also consider missing or invalid data. Ideally, any missing data should be scrubbed and adjusted, to allow for more accurate modeling. Trying to estimate house prices, for example, isn’t going to work if half of the existing records have no price listed.
To summarize, all data is important, but organizing it in an efficient manner is the only way to obtain maximum value and insights from it. The ultimate goal in all of this data management is not to keep everything so that it’s available if it ever becomes useful, but to pull maximum value from it, to make business decisions based on it, and drive your organization forward.
Ascend Technologies can help your organization take control over its data, help you utilize that information to drive your business forward, and help bring you into the digital future.
Written by Andy Maser, Data Architect