Overview
LocateAnything-3B is a vision-language model from NVIDIA built for visual grounding: you give it an image and a text prompt, and it returns the exact pixel locations of what you described as bounding boxes and points. It has about 3.8B parameters, a 32K context window, and grows out of NVIDIA's Eagle VLM line. The grounding work runs on Parallel Box Decoding, a block-wise multi-token scheme that predicts whole boxes in parallel instead of generating coordinates one token at a time.
Because the model is small enough to fit on consumer hardware, you can run it inside Atomic Chat fully on-device. The image you analyze and the prompts you write stay on your own machine, with no upload to an external API and no internet connection required after the weights are downloaded.
What it is good at
LocateAnything-3B turns plain-language requests into spatial answers, which makes it useful anywhere you need to find things in an image rather than just describe them.
- Open-vocabulary object detection — name any category in text ("forklift", "stop sign", "ripe tomato") and the model draws boxes around every match, including in dense and long-tail scenes where a fixed-class detector like YOLO would miss.
- Referring grounding with reasoning — point it at "the red car behind the bus" or "people wearing hats" and its reasoning over the scene resolves which specific instances you mean, returning single or multiple boxes.
- GUI and document localization — locate buttons, fields, OCR text, or layout regions from an instruction, which feeds GUI agents, document-understanding pipelines, and code that needs to act on screen coordinates.
Running it locally
At 3.8B parameters the model needs roughly 8.4 GB of VRAM in FP16 for inference; quantized to INT4 that drops to about 2.1 GB, so an 8 GB card such as an RTX 4060 can run it. It uses BF16, which means an Ampere-or-newer NVIDIA GPU (RTX 30/40/50-series, A100, H100). Context length is 32K. Download the weights with:
huggingface-cli download nvidia/LocateAnything-3B
You can load it through Hugging Face Transformers (4.57.1+) and the official Eagle worker code, or skip the setup and open it with one click in Atomic Chat, which manages the download and runtime for you.
License
LocateAnything-3B is released under the NVIDIA License, listed as "other." It permits use, reproduction, and modification for academic and non-profit research only. Commercial use is not granted, so check the license terms on the model card before building anything you intend to ship or sell.
