After months of anticipation for ChatGPT 5, OpenAI has instead released ChatGPT 4-o - a flagship AI model that can process audio, visuals, and text in real time. This development follows the massive success of ChatGPT 3, which fascinated users with its uncanny ability to understand and generate human-like text. While regular users can access the groundbreaking GPT-4o for free, ChatGPT Plus subscribers gain priority access to increased prompt limits and the latest multimodal features. Continue reading to learn more details about GPT-4o.
GPT-4o brings interactive human-computer communication by accepting inputs spanning text, audio and images while generating outputs across these same modalities. The latest AI model can respond to audio prompts in as little as 232 milliseconds, with an average response time of 320 milliseconds—nearly matching human response time. It matches GPT-4’s performance with English text and coding with a massive improvement in non-English language performance. And the best part is its ability to deliver much faster and 50% more cost-effective API access. However, its true strength lies in superior audio and visual comprehension compared to existing models.
Before GPT-4o, users could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, previous versions of ChatGPT used a disconnected "Voice Mode" comprising three separate models: one transcribing audio to text, another (GPT-3.5 or GPT-4) processing that text, and a third converting the output text back to audio. This multi-step process resulted in informational losses for the core AI model, which could not directly perceive nuances like tone, multiple speakers, background noise, laughter, singing or emotional expression. Hence, GPT-4o integrates these modalities through an end-to-end training approach using a unified neural network architecture. With text, vision, and audio data flowing through a single model, ChatGPT-4o is a capable multimodal AI.
Also Read: Tips and Tricks for Developers of AI Applications in the Cloud
To improve GPT-4o's linguistic capabilities, a new tokenizer was developed to process 20 diverse languages: Arabic, Bengali, Chinese, English, French, German, Gujarati, Hindi, Italian, Japanese, Kannada, Korean, Malayalam, Marathi, Oriya, Punjabi, Russian, Spanish, Tamil, and Telugu.
Also Read: Running a Chatbot: How-to Guide
On standard tests that measure text understanding, reasoning, and coding abilities, GPT-4o performs just as well as GPT-4 Turbo. However, GPT-4o sets new higher standards for understanding multiple languages, audio data like speech, and visual information like images. Check out the evaluation details below:
Also Read: How to Use Oobabooga Web UI to Run LLMs on Hyperstack
OpenAI has used robust safeguards in GPT-4o's core architecture across all modalities, like Meta’s LLama 3 and Microsoft’s Phi-3 using a responsible and safe AI approach. The model is built with techniques like filtered training data and post-training behavioural refinement to help imbue the model with built-in safeguards with novel systems to regulate voice outputs. The careful evaluations based on OpenAI's Preparedness Framework and voluntary commitments reveal that GPT-4o poses no higher than a "Medium" risk across categories like cybersecurity, CBRN (chemical, biological, radiological, and nuclear), persuasion, and model autonomy. This multi-stage assessment process involved comprehensive automated and human evaluations throughout training, testing pre- and post-mitigation versions using custom prompts and fine-tuning to probe the model's capabilities.
Over 70 external experts spanning social psychology, bias/fairness, and misinformation were engaged in extensive "red teaming" exercises to identify potential risks amplified by GPT-4o's multimodal nature. While OpenAI acknowledges the dangers presented by GPT-4o's audio capabilities, the current launch enables public access to text/image inputs and outputs.
Thanks to OpenAI’s research of two years aiming to improve efficiency across the AI stack. ChatGPT users can now leverage GPT-4o's text and image capabilities, with the model being offered on the free tier and up to 5x higher message limits for Plus subscribers. An alpha version of the revamped Voice Mode powered by GPT-4o will follow for ChatGPT Plus in the coming weeks. Developers can also now integrate GPT-4o for text and vision tasks via OpenAI's API, with a 2x performance boost, 50% cost reduction and 5x higher rate limits over GPT-4 Turbo. However, the support for GPT-4o's audio/video functions will initially roll out to a limited cohort of trusted partners in the coming weeks.
As OpenAI continues to refine and expand the capabilities of ChatGPT-4o, we are eagerly waiting to experience breakthroughs in AI. At Hyperstack, we offer access to NVIDIA’s powerful resources such as the NVIDIA A100 and NVIDIA H100 PCIe, which boast exceptional performance and advanced features designed for accelerating powerful LLMs like GPT. These GPUs are built with specialised Tensor Cores, which can perform matrix operations faster. Hence, enabling efficient training and inference for large language models like GPT-4o. The multi-GPU systems and cutting-edge interconnect technologies like NVLink further boost the computational power of these AI models for faster training times and real-time inference.
Get started today to experience the power of high-end NVIDIA GPUs to lead innovation.
ChatGPT 4o is OpenAI's latest flagship model that can reason across audio, vision, and text in real time. The model enables much more natural human-computer interaction.
GPT-4o matches GPT-4 Turbo's performance on text in English and code, while significantly improving on non-English languages. It excels at vision and audio understanding and can respond to audio inputs with near-human response times.
ChatGPT-4o's text/image capabilities are currently rolling out on ChatGPT across free and paid tiers, with audio support coming to ChatGPT Plus soon. Developers can also access the model through OpenAI's API.
Yes, absolutely. OpenAI has implemented safety measures such as filtering training data, post-training refinement and novel safety systems for voice outputs. The model has undergone extensive evaluations and red teaming to identify and mitigate potential risks.
Unlike previous models that relied on separate pipelines for different modalities, ChatGPT-4o is a single end-to-end model trained across text, vision, and audio, allowing for seamless multimodal integration.
Yes, regular users can access the GPT-4o for free but ChatGPT Plus subscribers can get priority access to increased prompt limits and the latest multimodal features.