[old][ai] Open Source Alternative to Megatron-based Language Models Released February

Wed Mar 9 04:07:28 PST 2022

https://mystic.the-eye.eu/public/AI/models/GPT-NeoX-20B/

The full weight files weigh in at 308033580802 bytes (286.88 GiB).
The slim weight files, which usually means precision is reduced to
float16 (sometimes float8), weight in at 41112854242 bytes (38.29
GiB).

Traditionally the entire model is loaded into VRAM to evaluate it,
although it can also be streamed in and out or distributed across
multiple machines with some hacks. There is additional overhead than
just the weights, and significantly additional overhead if the model
is further being trained for a specific task.

https://arxiv.org/abs/2101.00027
https://arxiv.org/abs/2201.07311

# GPT-NeoX-20B

## Model Description

GPT-NeoX-20B is an autoregressive transformer language model trained
using [GPT-NeoX](https://github.com/EleutherAI/gpt-neox). "GPT-NeoX"
refers to the aforementioned framework, while "20B" represents the
number of trainable parameters.

\* The embedding matrix is padded up to 50432 in order to be divisible
by 128, but only 50257 entries are used by the tokenizer.

The model consists of 44 layers with a model dimension of 6144, and a
feedforward dimension of 24,576. The model dimension is split into 64
heads, each with a dimension of 96. Rotary Position Embedding is
applied to the first 24 dimensions of each head. The model is trained
with the same vocabulary size as in GPT-2/GPT-3, but with a new
tokenizer trained on [the Pile](https://pile.eleuther.ai/), our
curated pretraining dataset (described below).

## Training data

GPT-NeoX-20B was trained on [the Pile](https://pile.eleuther.ai/), a
large-scale curated dataset created by EleutherAI.

## Training procedure

GPT-NeoX-20B was trained for 470 billion tokens over 150,000 steps on
96 40GB A100 GPUs for around three months. It was trained as an
autoregressive language model, using cross-entropy loss to maximize
the likelihood of predicting the next token correctly.

## Intended Use and Limitations

GPT-NeoX-20B learns an inner representation of the English language
that can be used to extract features useful for downstream tasks. The
model is best at what it was pretrained for however, which is
generating text from a prompt.

Due to the generality of the pretraining set, it has acquired the
ability to generate completions across a wide range of tasks - from
programming to fiction writing.

## Limitations and Biases

The core functionality of GPT-NeoX-20B is taking a string of text and
predicting the next token. While language models are widely used for
tasks other than this, there are a lot of unknowns with this work.
When prompting GPT-NeoX-20B it is important to remember that the
statistically most likely next token is often not the token that
produces the most "accurate" text. Never depend upon GPT-NeoX-20B to
produce factually accurate output.

GPT-NeoX-20B was trained on [the Pile](https://pile.eleuther.ai/), a
dataset known to contain profanity, lewd, and otherwise abrasive
language. Depending upon use case GPT-NeoX-20B may produce socially
unacceptable text. See Sections 5 and 6 of [the Pile
paper](https://arxiv.org/abs/2101.00027), or [the Pile
Datasheet](https://arxiv.org/abs/2201.07311) for a more detailed
analysis of the biases in the Pile

As with all language models, it is hard to predict in advance how
GPT-NeoX-20B will respond to particular prompts and offensive content
may occur without warning. We recommend having a human curate or
filter the outputs before releasing them, both to censor undesirable
content and to improve the quality of the results.