[old][ai] Open Source Alternative to Megatron-based Language Models Released February

Wed Mar 9 04:33:29 PST 2022

> The full weight files weigh in at 308033580802 bytes (286.88 GiB).
> The slim weight files, which usually means precision is reduced to
> float16 (sometimes float8), weight in at 41112854242 bytes (38.29
> GiB).

Just a note that I might be wrong here about what full and slim mean.

>
> Traditionally the entire model is loaded into VRAM to evaluate it,
> although it can also be streamed in and out or distributed across
> multiple machines with some hacks. There is additional overhead than
> just the weights, and significantly additional overhead if the model
> is further being trained for a specific task.

Can also add that people have been training models on low-end hardware
by tracing and training only a subset of the parameters at once.
Traditionally all are trained at once. Systems also support a form of
checkpointing that discards and regenerates the derivatives when
needed, as I've mentioned in a spamlog somewhere.