The Data-Centric Approach to Artificial Intelligence

Jun 21, 2022
3 min read

"Who has the data has the power." In 2022, nobody disagrees with O'Reilly on the potential of data. With technology reaching newer heights and exploring newer domains, data powers the process in every respect.

AI pioneer, Andrew Ng, co-founder of the popular e-learning platform Coursera and former head of Google Brain, advocated that it's high time we switch our focus from the performance of several data models to the quality of the data being made available to existing systems. Data-centric AI is precisely defined by Ng as "the discipline of systematically engineering the data required to develop a successful AI system."

Data Vs Code: A closer look

The model-centric approach targets the development and optimization of algorithms to drive the entire process. This, in turn, limits our scope to ameliorate the gross efficiency of most programs. Oversaturation will only lead it to transform into a program run under a very selective responsive mechanism. A vast section of the AI community believes that the hype over model architecture kills the overall productivity of any program to a large extent. Therefore, the significant implications of following such a strategy will definitely highlight how the quality of data might play a hindrance when fed to sophisticated programs.

And that's why data needs to be the bedrock for any AI system to be more coherent and accurate. Quality over quantity is beyond doubt the absolute necessity prior to implementation. Refined and labelled data that is relevant to your project and is thoroughly consistent, when run through iteratively driven by proper error analysis, surely promises a more explicit outcome. Deploying an ML model is just the first few baby steps, but a collection of high-quality data periodically over time is what leads to accomplishment. Hence, neglecting the importance of data in AI is equivalent to boarding a sinking ship. Data is food for AI.

The Data-Centric Approach

Going by the general pathway that most project follows; when we are upbeat about adopting a data-centric approach, it is important that we collect the right data for our project. Storing compromised data irrelevant in the bigger scheme of things will only have a detrimental step towards improving accuracy. Quantity is not always the solution!

Labelling - Perhaps the most crucial step in the overall process as this is where we obtain our required clean data by continual reduction of noise, if present. For the quality of our data to be more systematic and consistent, one needs to tally the differences in labels for a specific sample independently labelled. On revising the labelling dictum, we try to reduce the points of disagreement till the consistency is achieved.

For example, suppose a data labeller is given an image of 2 iguanas and is instructed to indicate their position. In that case, they might come up with any one of the labels given besides on the right-hand side. Although none of them are technically wrong, such inconsistencies will only puzzle the neural networks.

Error analysis-driven iteration is the next step that involves training the model as soon as sufficient data has been collected. In contrast to the model-centric approach, where one needs to wait till the entire data has been mustered, it is evident how data-centric options save a lot more time. In addition, as the process is continuous, it is easily detectable whether the right data is being acquired or not and thus allows appropriate changes to be made.

Nevertheless, timely supervision of the input data when the model goes under production is of paramount significance. Drifts are inevitable as data evolves, just like how we evolve in this real world. For instance, data drifts might occur when a model encounters situations for which it is not adequately trained as the distribution of the incoming data did not count such cases. Like if a program that is only trained on properties of integers is now fed fractional values as input, data drift can be observed.

Concept drift, unlike data drift, is not as abrupt but surely adversely affects the performance of any model. A model trained on a dataset of the last decade on costs of commodities will never be able to predict the actual pricing of a similar product in 2022. Similarly, one can never expect a dataset trained as per today's standards will result in successful predictions as an outcome 10-20 years down the line.

Thus, to hold the code fixed and iteratively improve the data and not the vice-versa is what is actually proposed by the likes of Andrew Ng, who have called for the shift toward Data-centric AI.

Good Data

All the discussions above have only paved the way to a specific objective for MLOps: corroborating the availability of premium quality data in each phase of an ML project lifecycle. The fact that good data over big data needs to be the approach for efficiency-oriented developments raises a very basic question - How do you define Good Data?

In conclusion, it is fair enough to say that it is high time that the AI community overcomes its bias toward complex model architectures and treats data as an asset. Unlocking the true potential of data must be our objective if we want programs to run with greater efficiency and accuracy.