Good Data or Big Data?

Arnab Chakraborty
3 min readNov 8, 2022

In one of the articles, the great Andrew Ng urges the ML community to be more data-centric than model-centric earlier this year. It is great to see that the industry is constantly recognizing the necessity of data, good data.

There is a significant development that happened in the Modelling space where new models are coming up with superior prediction capability, especially in the deep learning space. Though the Tesla algorithm recognizes the moon as yellow light and we could arguably say that much is to deal with the input data and associated labelling.

What we failed to realize is Data Science have two components Data + Science (Algorithm, Code, Math, and Stats). Much focus was on the science part at least for the considerable part the term “Data Science” existed. Although we know that data is important but little emphasis has been given to ensuring the data quality. When 80% of the Data Science job is to prepare data then better the data is clean. One of the recent articles by MIT technology review states that hundreds of AI processes have been built to detect COVID but none helped. In a British Medical Journal review the author reviewed 232 algorithms that try to predict how sick the patients may get actually resulted that none qualifying as clinically “fit” to use.

What was the problem there? Primarily the quality of the data.

In Data Science we know the more the merrier. I would say “good” more(data) the “better” the merrier. We have invested in technologies, and programming to come up with better storage, better programming, and better computing. The same focus is not there to make the data clean. Part of the problem is “Cleaning data is not considered to be sexiest in the industry”.

The question asked in the heading of the article “Good data or Big data?”. The answer to me is neither we need a combination of two “Good Big data”. The field of data science is expanding beyond consumer internet and the quality of data will decide its future course. Please remember that industries outside of the Consumer Internet existed for longer and they have lots of data. But most of them does not fit the purpose.

Consumer internet is so successful today is because they defined what they would need and started operating from there. Whilst the data in consumer internet space was collected towards the goal of building models for most of the other industries it is not the case. As a result, we have “Big” data of “Poor” quality and when we try to fit the models on top of the data it is often found that the model does not fit as per expectations.

We need to shift our focus from “Big” data to “Good Big” data. A lot of emphasis needs to be provided on data cleaning and making sure the historical data fits model-building purposes. Organizations need to incentivise employees who take responsibility to clean data, correcting labelling of historical data or whatever task that transforms “Big Data” to “Good Big Data”.

Science alone will not lead us to the world where AI has the power to get us to the next industrial revolution similar to what electricity did to civilisation. “Good” Data + Science will.

All views expressed here are of my personal opinion.

