How much data is generated on the Internet in one minute? Every time a user signs up to a social network, answers a customer satisfaction survey or fills in a form for a product, enormous quantities of data are gathered: names, surnames, addresses, emails, etc., but also tastes, purchasing power and education. Without some form of established order, databases become barely effective or even useless. In order to avoid the chaos and maintain high quality data, not only is it important to implement effective measures, but it is also convenient to become familiar with the concepts that have arisen in the Big Data environment.
The concept of Big Data has become a recurrent topic in the technology field. This refers to the data that large companies are capable of gathering. Due to the size, here we are talking about terabytes or petabytes of data, traditional systems are incapable of processing this much data. Thus, side-by-side with Big Data is Deep Data, the cross-referencing and selection of data contained in a Big Data environment, carried out by analysts for a specific area and which as a result produces non-redundant reports.
The key difference between Big Data and Deep Data is the analysis of the data used. In Big Data, absolutely everything is analysed, whereas in Deep Data only certain data of interest are used. In this case, if a company wants to know which products they will sell next Christmas, they don’t need to know the localisation data of the customers or which language they use when visiting the site. They will however, want to know about product trends and their relationship with age ranges and other demographic data.
Another concept that cannot be ignored is the Data Lake. This is a storage system for data in its raw format on a single database from which it is possible to extract detailed schema. It was invented by James Dixon 2015 and is used by companies like Amazon.
If a database is not well managed in its entirety and contains erroneous data, incorrect or duplicate records, streets and towns that are not up to date, then the most likely scenario is that in the long term it will stop being effective. When this happens, the term used to refer to this is Dirty Data. If the objective is to obtain high quality data, it will be necessary to implement collection, standardisation and deduplication methodologies that prevent the proliferation of corrupt data and Dirty Data.
Big Data, Deep Data or Data Lake: all lead us to the concept of Data Quality. It may seem obvious that the management of data should be focused on the proper management and collection of data, but how do we determine the quality of the data without a clear way to measure it? Data Quality sets out how to determine this. Kevin Roebuck in his book “Data Quality: High-impact Strategies” indicates the set of quantitative and qualitative variables that determine the quality of data stored: degree of accuracy, whether the data are complete or not, how up-to-date they are and their contextualisation.
The level of fulfilment of all these variables will allow any company to tackle process of business data analytics and generate relevant reports. To do so, it is absolutely necessary to include procedures to identify possible duplicates, and eliminate and correct any erroneous data by means of a Data Quality tool.