The world of artificial intelligence (AI) is constantly evolving, and one of the most exciting recent developments is the rise of multimodal AI. Multimodal AI systems can understand and generate multiple types of data, such as text, images, audio, and video. This new frontier in AI holds great promise for improving the way humans interact with technology.
All major tech giants announced advancements in this direction in, through their annual conferences. On May 13th, OpenAI announced the release of GPT-4o, an AI personal assistant that can generate and interact in text, images, audio, and video. As demonstrated in the GPT-4o launch, this is a significant step forward in the world of AI – human interactions. (Ref: https://openai.com/index/hello-gpt-4o/)
The very next day, May 14th, Google, in their annual developer conference, Google I/O, announced a whole range of features and products that leverage multimodality. One of the demos that stood out as a competitor to OpenAI's GPT-4o was Google Astra, another personal assistant capable of doing pretty much everything that OpenAI claimed in their demo and announcement. Google has been experimenting with multimodality for quite some time through their Gemini models. (Ref: https://deepmind.google/technologies/gemini/project-astra/)
Another key player in the AI space, Inflection AI, introduced multimodality into Pi, their AI personal companion, by adding an audio interface that can engage in human-like conversations with the user, providing a voice call-like experience with the AI.
These are just a few examples, and as time progresses, we will continue to see more evolution in this area. We caught a glimpse of it through Microsoft Build, their annual developer conference, which detailed some high-impact use cases with multimodal AI models. (Ref : https://www.linkedin.com/pulse/age-ai-transformation-satya-nadella-soocc/) .
With large AI vendors promoting advanced use cases, more applications are starting to utilize these cutting-edge capabilities. As AI moves beyond the realm of text-based interactions, networks across the globe are bound to experience an increase in high-volume traffic, posing new demands for better-performing networks. Some key factors that will drive the adoption of this new era of applications are:
These interactions will inevitably place more demands on both upstream and downstream channels in a network. To visualize user behavior, imagine users of AI assistants being on all-day-long voice and video calls with AI as they go about their day-to-day activities. Unlike human participants in a WhatsApp video or voice call with limited time and patience, AI has an infinite capacity to stay connected to the user. These interactions will likely increase the demand for higher network performance.
The first step for service providers preparing for this new era of applications, which is already gaining popularity among their users, is to be equipped with the ability to identify AI traffic, classify the type of traffic content (audio/files/video/text) being exchanged with various AI systems, and evaluate network KPIs that provide insights into network performance under these conditions.
At Sandvine, our focus on innovative classification solutions allows us to build and support this evolving ecosystem. AppLogic, Sandvine's AI-powered classification engine, can play a crucial role in providing networks across the globe with the visibility and data necessary to be ready for this new Coming Wave. AppLogic will help kickstart and accelerate the AI journey for service providers.