Overview
This model is a multimodal large language model that unifies image, audio and text understanding to support question answering, summarization and document intelligence workflows. It is designed to run entirely on local hardware, so no data ever leaves the device and inference works fully offline.
It extends the base family with integrated speech comprehension and optical character recognition, enabling end-to-end processing of rich content such as meeting recordings, training videos and complex business documents.
Capabilities
The model performs well across a broad range of everyday tasks. Typical use cases include:
- Document intelligence — extracting structure from contracts, reports and scanned PDFs.
- Media analysis — captioning, search and summarization of long-form video.
- Assistant workflows — grounded answers, drafting and step-by-step reasoning.
For best results, keep prompts specific and provide context up front — the model rewards clear, well scoped instructions over open-ended ones.
Quick start
Install the runtime and pull the weights with a single command. Once cached, the model loads in seconds and the first token streams almost immediately:
atomic pull <model>
atomic run <model> --prompt "Summarize this report"You can also call it programmatically — pass any prompt to model.run() and stream the response token by token.
License
The weights are released under a permissive open license and are available for commercial use. Full terms are described in the model license agreement.