OpenELM-1_1B-Instruct

Updated

Reasoning

Apple's 1.1B instruction-tuned OpenELM model, built with layer-wise scaling for efficient on-device English text generation.

pip install -U transformers sentencepiece
huggingface-cli download apple/OpenELM-1_1B-Instruct
python generate_openelm.py --model apple/OpenELM-1_1B-Instruct --prompt 'Once upon a time there was' --generate_kwargs repetition_penalty=1.2

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "apple/OpenELM-1_1B-Instruct",
    "messages": [{"role": "user", "content": "Write a short poem about the sea."}]
  }'

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("apple/OpenELM-1_1B-Instruct", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Once upon a time there was", return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=128, repetition_penalty=1.2)
print(tokenizer.decode(out[0], skip_special_tokens=True))

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "local" });
const res = await client.chat.completions.create({
  model: "apple/OpenELM-1_1B-Instruct",
  messages: [{ role: "user", content: "Write a short poem about the sea." }]
});
console.log(res.choices[0].message.content);

More models

View all

No items found.

Name	Size / Usage	Context	Input

At a glance

License: Apple Sample Code License (apple-amlr)
Context length: 2K tokens
Parameters: 1.08B (BF16)
Primary language: English
Minimum hardware: ~4 GB VRAM (BF16), under 1 GB at 4-bit
Strengths: parameter efficiency, on-device English chat

Overview

OpenELM-1_1B-Instruct is the 1.1-billion-parameter, instruction-tuned member of Apple's OpenELM family, released in April 2024. OpenELM stands for Open Efficient Language Models, and the lineup spans 270M, 450M, 1.1B, and 3B sizes in both pretrained and instruction-tuned variants. The defining idea is layer-wise scaling: instead of giving every transformer block the same width, OpenELM uses narrower early layers and wider later ones to spend parameters where they help accuracy most. Apple trained the models with its CoreNet library and published the full pipeline, from data preparation through evaluation.

What it's good at

For its size the model posts solid results on standard benchmarks. On the LLM360 evaluation suite the 1.1B Instruct version averages 49.94, with 71.83 on HellaSwag and 41.55 on ARC-Challenge, ahead of the base OpenELM-1_1B. Apple reports up to 2.36% higher accuracy than OLMo-1.2B while using roughly half the pretraining tokens. Training drew on RefinedWeb, deduplicated PILE, a subset of RedPajama, and Dolma v1.6, totaling about 1.8 trillion tokens of mostly English text, so the model is best suited to English prompts and short instruction-following tasks.

Running locally

The BF16 weights are about 2.2 GB, so the model runs on a GPU with roughly 4 GB of VRAM, and 4-bit quantization brings that under 1 GB. Load it through Hugging Face Transformers with trust_remote_code=True, since OpenELM ships custom modeling code. It relies on the Llama-2 tokenizer and needs add_bos_token=True; Apple's generate_openelm.py script handles this and supports speculative decoding for faster inference. The 2,048-token context window limits it to short inputs.

License

The weights are published under the Apple Sample Code License (apple-amlr), which is more restrictive than permissive licenses such as Apache 2.0 or MIT. Apple releases the models without safety guarantees and recommends users run their own testing and filtering. Read the license terms before using the model commercially.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

OpenELM-1_1B-Instruct is a 1.08-billion-parameter instruction-tuned language model released by Apple in April 2024. It belongs to the OpenELM family, which uses a layer-wise scaling strategy to allocate parameters efficiently across transformer layers. Apple released it alongside the full training and evaluation framework built on the CoreNet library.

In BF16 precision the model's weights are about 2.2 GB, so it runs on a GPU with roughly 4 GB of VRAM and fits comfortably on most modern cards. With 4-bit quantization the memory footprint drops below 1 GB, and the model is small enough to run on CPU or Apple Silicon for testing.

The weights are openly downloadable from Hugging Face, but the model is distributed under the Apple Sample Code License (apple-amlr) rather than a standard permissive license like Apache 2.0 or MIT. That license is more restrictive than typical open-source terms, so review it before any commercial deployment. Apple did release the complete training, fine-tuning, and evaluation code to support open research.

Load it with Hugging Face Transformers using trust_remote_code=True, since OpenELM ships custom modeling code. The model uses the Llama-2 tokenizer (meta-llama/Llama-2-7b-hf) and needs add_bos_token=True. Apple also provides a generate_openelm.py helper script in the repo, and passing repetition_penalty=1.2 is recommended for cleaner generations.

OpenELM-1_1B-Instruct has a maximum context length of 2,048 tokens, set by the max_context_length value in its configuration. That window is short by current standards and suits short prompts, single-turn instructions, and lightweight chat rather than long-document tasks. The model was trained primarily on English data, so it is best used for English text.