[ot] 3B / 3GB quantized edge language model

Fri Sep 22 17:57:57 PDT 2023

I might be wondering if this is useful for hobby finetuning. Only 8k
context length though (some models have 128k now), although ALiBi is
purported to be extendable to longer context lengths than it was
trained on.

https://huggingface.co/papers/2309.11568

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

We introduce the Bittensor Language Model, called "BTLM-3B-8K", a new
state-of-the-art 3 billion parameter open-source language model.
BTLM-3B-8K was trained on 627B tokens from the SlimPajama dataset with
a mixture of 2,048 and 8,192 context lengths. BTLM-3B-8K outperforms
all existing 3B parameter models by 2-5.5% across downstream tasks.
BTLM-3B-8K is even competitive with some 7B parameter models.
Additionally, BTLM-3B-8K provides excellent long context performance,
outperforming MPT-7B-8K and XGen-7B-8K on tasks up to 8,192 context
length. We trained the model on a cleaned and deduplicated SlimPajama
dataset; aggressively tuned the \textmu P hyperparameters and
schedule; used ALiBi position embeddings; and adopted the SwiGLU
nonlinearity. On Hugging Face, the most popular models have 7B
parameters, indicating that users prefer the quality-size ratio of 7B
models. Compacting the 7B parameter model to one with 3B parameters,
with little performance impact, is an important milestone. BTLM-3B-8K
needs only 3GB of memory with 4-bit precision and takes 2.5x less
inference compute than 7B models, helping to open up access to a
powerful language model on mobile and edge devices. BTLM-3B-8K is
available under an Apache 2.0 license on Hugging Face:
https://huggingface.co/cerebras/btlm-3b-8k-base.