The world of AI is constantly upgrading and reaching new heights, especially with the current set of sibling technologies that are making this field more wondrous with every passing. Just when the AI enthusiasts feel they have seen too much of this tech, Google’s Artificial Intelligence division Deepmind revealed their state-of-the-art Visual Language model, affectionately calling it Flamingo.
Here is a quick overview of this much-hyped Visual Language Model (VLM) Flamingo, which is all set to change the arena of the machine learning model.
Deepmind, in its report named “Flamingo: a Visual Language Model for Few-Shot Learning,” revealed the dynamics surrounding this model family, calling it a distinct few-shot learning-bases VLM composed of a unique system of software. As affirmed by the Deepmind team of researchers, Flamingo is vastly different from its precursors in the few-shot learning field and is capable of performing brilliantly without much training.
Few-shot learning models are gaining popularity over the last decade, given the usage and application in multiple industries, yet this approach fails to work effectively while dealing with multimodal tasks. Flamingo is, therefore, a striking revelation in the world of few-shot learning as it can handle multimodal tasks with ease and precision in no time without requiring any “extra” fine-tuning.
The company had previously released a pre-trained language model called Chinchilla that possesses 70 billion parameters. The fusion of language model and visual learning architecture by the team of researchers has allowed the subset of Flamingo to reach 80 billion parameters.
Much to the amazement of the AI lovers, the VLM Flamingo outperformed two brilliant fine-tuned baseline models, CLIP and Florence, giving it a competitive edge over its predecessors. It is believed that if the annotation budget were to be increased for the Flamingo model, it could outperform more of such programs and generate high-value outputs even with a few training datasets.
The VLM is designed to take input in the form of text and images. The data is then analyzed, and the model is able to release text-only output. Just like the few-shot learning methodology, the VLM works impressively with a limited training dataset and is able to interpret the data on its own, simulating how the human mind works and handles a problem.
The model is able to take interleaved visual data and text and is expressive in its output significantly. This allows this VLM to handle the open-ended tasks and close-ended sets of data equally well without the need for intervention or the introduction of tricks.
For the visual processing of input, Flamingo has a CLIP-like styled vision encoder that allows the model to recognize the spatial features of the image-text and data. This allows the identification and interpretation of intricate details of the data being exposed. The language processing is done using a sophisticated pre-trained language model that allows the model to speak just like a human would, expressing openly the answers to the questions asked by the user.
Flamingo is undoubtedly capable of achieving a lot more in the field of VLM, and its applications can make it an asset. However, the company still calls it “computationally expensive” to release it out into the world.
With the training datasets invariably small and unique, it cannot be considered the ultimate VLM for solving all the multimodal tasks and still requires a lot of work for it to hit the prime spot in the open.