OpenAI has announced the release of the full version of its o1 reasoning model as well as the release of its video generation model Sora. The o1 announcement also included the announcement of a separate fine-tuning API as well.
O1’s chain-of-thought technique enables the model to generate complex, step-by-step thought processes before delivering responses, making it highly adept at tasks requiring nuanced reasoning. The models are trained on a mix of public, proprietary, and custom datasets. The different approach uses slower, more deliberate reasoning. o1 on the API also allows developers to specify a custom developer message that is included with every prompt from their end users.
Safety remains a cornerstone of the o1 series, with several evaluations being rolled out to avoid jailbreak attempts and biased behavior. OpenAI's published evaluations show o1 outperforming GPT-4o ability to avoid overrefusal in benign contexts. The model's reasoning capabilities extend to maintaining adherence to OpenAI's Instruction Hierarchy, ensuring that system directives take precedence over developer and user prompts. Despite these advances, challenges persist, particularly in areas like multimodal inputs, where achieving precise refusal boundaries is still a work in progress.
Red teaming played a role in testing the o1 models’ capabilities and limitations, with experts exploring areas such as cybersecurity, biological and radiological threats, and persuasive manipulation. While the model's safety mechanisms successfully resisted high-risk scenarios in most cases, its increased detail and depth in responses occasionally amplified risks when refusals were bypassed. They attempt to mitigate this by working with external evaluators and using their Preparedness Framework.
“The o1 large language model family is trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers—it can produce a long chain of thought before responding to the user. Through training, the models learn to refine their thinking process, try different strategies, and recognize their mistakes.” - OpenAI
The new Sora model allows users to create videos up to 20 seconds long at 1080p resolution, using input formats ranging from textual descriptions to existing images and videos. Drawing on the foundations laid by DALL·E and GPT architectures, Sora leverages a diffusion-based approach to maintain consistency in visual elements across multiple frames. Its training is based on techniques like recaptioning for more faithful textual alignment.
Sora is built on the concept of visual patches, inspired by tokenization strategies in large language models. Videos are compressed into a lower-dimensional latent space and divided into spacetime patches for scalable representation and processing. OpenAI trained Sora on a hybrid of publicly available datasets, proprietary resources obtained through partnerships, and custom datasets designed in-house. Robust pre-training filtering mechanisms ensure the removal of explicit, violent, or sensitive content before data reaches the model.
Future iterations of Sora will continue to refine its capabilities and safeguards, with a focus on representation, provenance, and ethical alignment. Efforts to reduce biases in output and enhance classifier performance reflect the model's iterative development ethos.
Developers interested in learning more about o1 may check its system card, or if they are interested in learning more about Sora they may check its system card. Developers interested in learning more about Canvas, ChatGPT in Apple Intelligence, and other features from the 12 days of OpenAI should watch InfoQ in the coming days for additional coverage.