The moment that AI was no longer the talk of the town was the moment that we truly entered the AI era. It’s become so naturalized to our society to the point that it’s integrated into our education, work, and everyday life.
However, one thing that’s limiting our access to AI is the lack of human-computer interaction support. Only a handful LLMs offer multimodal support, and even fewer do it free or accurately. OpenAI might’ve just solved that issue.
In this article, I’ll be discussing briefly what it is and some of my favorite use cases so far of this model.
Disclaimer: All video links provided below are courtesy of OpenAI.
What is GPT-4o?
GPT-4o (“o” stands for omni) is OpenAI’s newest LLM. It’s made to create more natural human-computer interactions by expanding its multimodal capacity and supercharging its nuance. It has an average response time of 320 milliseconds, which is close to the human response time.
Here are a few nifty ways to use it:
Real Time Translation
Ever find yourself lost in a foreign country without any means to communicate? OpenAI has you covered.
One of GPT-4o’s most significant features is its multilingual support. Along with multimodal inputs, ChatGPT can easily translate from one language to another faster and almost as accurately as any human translator. With a turnaround time of about 232 milliseconds for audio, ChatGPT with 4o can be your best friend whenever you’re traveling or speaking to someone not fluent in your language.
Meeting AI Assistant
Meetings can be draining. You never know when you’re dozing off or when your attention’s going elsewhere.
With GPT-4o, you can always be on top of things by using it as an AI assistant for meetings. It can act as a guide whenever someone asks you a question, take minutes of the meeting to revisit later, or clear up things when it gets confusing.
Harmonize
This is one of the craziest things I’ve seen from an AI. We’ve all become accustomed to AI taking inputs in different forms, but I’ve never seen a truly multimodal AI to the point that it can create beats, adjust tone, and actually harmonize to create music. What makes it better is that you can give it additional context as they go along to nail the sound you’re looking for.
Complete Math Assignments
Okay, I know an AI that can do assignments isn’t out of the norm today — but wait until you see what GPT-4o can do.
This new model can answer mathematics questions in real-time. Using its new desktop app, GPT-4o can take questions in the form of text, images, or video, and act like a tutor by giving you the information you’re looking for while explaining how it came to that answer. You can even narrow down your questions in real-time by feeding it further context.
From what I’ve seen, GPT-4o also has greatly improved mathematical reasoning compared to previous GPT models.
Perfect Text Generation
Gone are the days of imperfect text generation from DALL-E 3. With GPT-4o, ChatGPT can now write paragraphs of text in an image with little to no mistakes. You can even tweak how you want the text to look. Let me tell you, as someone who’s using AI image generators daily, this is insane.
Character Reference
Midjourney may be the first AI image generator to release their character reference feature, but I think OpenAI has them beat on accuracy. Using GPT-4o, DALL-E 3 can now create consistent characters based on an input or a previous generation.
Summarize Information From A Video
Similar to the meeting AI assistant, GPT-4o can also transcribe and summarize an entire video as input. To do this, you must input a video directly and not a link. For example, here’s a snippet from OpenAI’s showcase of the model’s input and output:
Certainly! The presentation focused on techniques for maximizing the performance of large language models (LLMs) and was delivered by Colin Jarvis and John Allard from OpenAI. The session was part of OpenAI’s first developer conference and aimed to provide insights into optimizing LLMs for specific tasks. Introduction Colin Jarvis is the head of OpenAI’s solutions practice in Europe, and **John Allard** is an engineering lead on the fine-tuning product team. The presentation highlighted the challenges of optimizing LLMs, emphasizing that there is no one-size-fits-all solution. Challenges of Optimizing LLMs **Separating Signal from Noise**: It’s challenging to identify the exact problem. **Abstract Performance Metrics**: Measuring performance can be difficult. **Choosing the Right Optimization**: It’s hard to know which approach to use. Optimization Flow The presenters introduced a framework for optimizing LLMs based on two axes: **Context Optimization**: What the model needs to know. **LLM Optimization**: How the model needs to act. The framework includes four quadrants: **Prompt Engineering**: The starting point for optimization. **Retrieval-Augmented Generation (RAG)**: For context optimization. **Fine-Tuning**: For LLM optimization. **All of the Above**: Combining all techniques. Prompt Engineering Strategies: Write clear instructions. Split complex tasks into simpler subtasks. Give the model time to think. Test changes systematically. Good for: Testing and learning early. Setting a baseline. Not good for: Introducing new information. Replicating complex styles. Minimizing token usage. Retrieval-Augmented Generation (RAG) Overview: RAG involves retrieving relevant documents and using them to generate responses. Good for: Introducing new information. Reducing hallucinations. Not good for: Embedding broad domain knowledge. Teaching new formats or styles. Minimizing token usage. Success Story: The presenters shared a success story where they improved accuracy from 45% to 98% using RAG. Fine-Tuning Overview: Fine-tuning involves continuing the training process on a smaller, domain-specific dataset. Benefits: Improves performance on specific tasks. Improves efficiency. Good for: Emphasizing existing knowledge. Customizing structure or tone. Teaching complex instructions. Not good for: Adding new knowledge. Quick iteration. Success Story: The presenters shared a success story from Canva, where fine-tuning improved performance significantly. Best Practices **Start with Prompt Engineering and Few-Shot Learning**. **Establish a Baseline**. **Start Small and Focus on Quality**. Combining Fine-Tuning and RAG The presenters highlighted the benefits of combining fine-tuning and RAG for optimal performance. Application of Theory The presenters applied the theory to a practical challenge, the Spider 1.0 benchmark, achieving high accuracy using both RAG and fine-tuning. Conclusion The presentation concluded with a summary of the optimization flow and emphasized the importance of iteratively improving LLM performance using the discussed techniques. Q&A The presenters invited questions from the audience and were available for further discussion. |
As someone who watched the video in its entirety, I can confirm that GPT-4o didn’t miss any key information. This is a huge evolution compared to its previous iteration.
Transcribe Illegible Text
Have you ever unearthed an old piece of paper with text you can barely — if at all — read? Let OpenAI do its magic.
GPT-4o combines multimodal support with enhanced natural language processing to turn illegible handwriting into string using contextual understanding. Here’s an example from Generative History on Twitter:
Create A Facebook Messenger Clone
I was browsing Twitter last night and found what might be the biggest case for GPT-4o’s improved capabilities. Sawyer Hood from Twitter wanted to test this new model by asking it to create a Facebook Messenger clone.
The result? It worked. Not only that, but GPT-4o did all of these in under six seconds. Sure, it’s just a single HTML file — but imagine the implications of this in front-end development in general.
Understand Intonation
And now, we’re down to what I consider GPT-4o’s biggest accomplishment, though some might not agree. In the past, LLMs have always taken what we feed into them at face value. They rarely consider our tone or phrasing in processing our inputs.
That’s why I’ve always considered models that can do sarcasm as science fiction. Well, OpenAI just proved me wrong.
All Said And Done
There’s a lot of talk about Gemini, Claude, and other LLMs potentially passing OpenAI in terms of nuance and features. Well, this is OpenAI’s answer to them.
GPT-4o is the first model I’ve seen that feels truly multimodal. Not only that, but it’s also solved some of the issues that plagued GPT-4 in the past in terms of being lazy and lacking in nuance.
OpenAI is a company that’s been way too familiar with controversies in the past, but I have a gut feeling that people are going to forget those soon with GPT-4o. I can’t wait to see where OpenAI takes LLMs from here. At this rate, GPT-5 may break the world.Want to learn more about the recent OpenAI drama? You can read our article on Sam Altman here or our other articles like this one.