Overview

LocateAnything-3B is a vision-language model from NVIDIA built for visual grounding: you give it an image and a text prompt, and it returns the exact pixel locations of what you described as bounding boxes and points. It has about 3.8B parameters, a 32K context window, and grows out of NVIDIA's Eagle VLM line. The grounding work runs on Parallel Box Decoding, a block-wise multi-token scheme that predicts whole boxes in parallel instead of generating coordinates one token at a time.

Because the model is small enough to fit on consumer hardware, you can run it inside Atomic Chat fully on-device. The image you analyze and the prompts you write stay on your own machine, with no upload to an external API and no internet connection required after the weights are downloaded.

What it is good at

LocateAnything-3B turns plain-language requests into spatial answers, which makes it useful anywhere you need to find things in an image rather than just describe them.

Open-vocabulary object detection — name any category in text ("forklift", "stop sign", "ripe tomato") and the model draws boxes around every match, including in dense and long-tail scenes where a fixed-class detector like YOLO would miss.
Referring grounding with reasoning — point it at "the red car behind the bus" or "people wearing hats" and its reasoning over the scene resolves which specific instances you mean, returning single or multiple boxes.
GUI and document localization — locate buttons, fields, OCR text, or layout regions from an instruction, which feeds GUI agents, document-understanding pipelines, and code that needs to act on screen coordinates.

Running it locally

At 3.8B parameters the model needs roughly 8.4 GB of VRAM in FP16 for inference; quantized to INT4 that drops to about 2.1 GB, so an 8 GB card such as an RTX 4060 can run it. It uses BF16, which means an Ampere-or-newer NVIDIA GPU (RTX 30/40/50-series, A100, H100). Context length is 32K. Download the weights with:

huggingface-cli download nvidia/LocateAnything-3B

You can load it through Hugging Face Transformers (4.57.1+) and the official Eagle worker code, or skip the setup and open it with one click in Atomic Chat, which manages the download and runtime for you.

License

LocateAnything-3B is released under the NVIDIA License, listed as "other." It permits use, reproduction, and modification for academic and non-profit research only. Commercial use is not granted, so check the license terms on the model card before building anything you intend to ship or sell.

Frequently asked questions

LocateAnything-3B is a 3.8B-parameter vision-language model from NVIDIA, built on its Eagle line, for visual grounding and open-vocabulary object detection. You give it an image and a text prompt, and it returns the exact locations of the objects you named as bounding boxes or points. It handles open-ended categories, referring expressions, GUI elements, and OCR text rather than a fixed list of classes.

In FP16 it needs roughly 8.4 GB of VRAM for inference, covering weights, activations, and KV cache. Quantized to INT4 that falls to about 2.1 GB, which lets an 8 GB card like an RTX 4060 run it. The model uses BF16, so you need an Ampere-or-newer NVIDIA GPU (RTX 30/40/50-series, A100, or H100).

The weights are free to download from Hugging Face. It is released under the NVIDIA License (non-commercial), which allows use, modification, and reproduction for academic and non-profit research only. Commercial use is not permitted, so review the license before using it in a paid product.

Yes. Once you download the weights with huggingface-cli, the model runs entirely on your own hardware with no internet connection needed. In Atomic Chat it runs on-device, so the images you analyze and the prompts you write never leave your machine. There is also a C++ ggml port (locate-anything.cpp) for CPU-only offline inference without a Python runtime.

Unlike a fixed-class detector such as YOLO, it does open-vocabulary detection: you can name any category in plain text and it finds it, no retraining required. It performs strongly in dense scenes and on long-tail objects, and it can resolve referring expressions like "the person on the left holding a phone." That makes it a fit for GUI agents, robotics, and document-understanding pipelines that need spatial coordinates from language.

LocateAnything-3B

More models

At a glance

Overview

What it is good at

Running it locally

License

Frequently asked questions