I have successfully compiled and run GLM-130b on a local machine! It's now running in `int4` quantization mode and answering my queries.
I'll explain the installation below; if you have any questions, feel free to ask! github.com/THUDM/GLM-130B
130B parameters on 4x 3090s is impressive. GPT-3 for reference is 175B parameters, but it's possible that it's over capacity for the data & compute it was trained on...
I feel like a #mlops hacker having got this to work! (Though it should be much easier than it was.)
To get GLM to work, the hardest part was CMake from the FasterTransformer fork. I'm not a fan of CMake, I don't think anyone is.
I had to install cudnn libraries manually into my conda environment, then hack CMakeCache.txt to point to those...
Even then the command-line to compile didn't work, so I edited it manually to actually compile. Haven't really used C++ seriously since I left #gamedev, but muscle memory is there... github.com/thudm/fastertr…
The code from the repository doesn't compile:
- There are includes to "stdio.h" missing in many places; likely I have a different linux setup.
- One : namespace character is missing in a std:vector def.
You then have to do a couple things:
- Sign-up to get the full GLM weights, 260Gb downloads quite fast.
- Convert the model to int4 with convert(.)py, check it's actually 4x smaller.
- Edit config/model_glm_130b_int4.sh to change path & add --from-quantized-checkpoint.
Then to do inference:
- Call `bash scripts/generate.sh` to get it working.
- If any dependency went wrong, you'll find out here.
- It installed the wrong `apex` by default, I had to do it manually.
- Install that setup(.)py install --cuda_ext and --cpp_ext github.com/NVIDIA/apex
From there, it takes 472.4s to load the model, but it's not on a SSD yet. (I downloaded the files to a TB archive disk, and should move the quantized model now.)
Then you can type queries...
Quantization to int4 does hurt performance, but less than I would have thought. I think only the weights are quantized and the inference is float16... need to check the paper again.
Inference takes about 65s to 70s for a single query. Results seem fine in general chat/trivia, but honestly I'm more interested in some of the specialized abilities... need to figure out prompts for those!
When I say "seems fine" it basically hallucinates like all other models... so useless in production.
(e.g. OpenAI Five was not open-source, climate change and poverty not a focus for OpenAI, etc.)
If you'd like me to try a prompt, let me know!
(Disclaimer: It doesn't seem to be on the same level as GPT-3, but it's an important step for less centralized LLMs.)
To reduce latency, I think more GPUs definitely help if that's an option.
From what I've seen & read, there appears to be model parallelism as GPU utilization is pretty good — not perfect but consistent 91%.
The rig I'm using is a 4-slot setup with 3090s. Originally had some trouble powering it because they sold me a 1000W power unit and it needed to be 2000W to be comfortable with full utilization and other power spikes. timdettmers.com/2023/01/16/whi…
Figuring out the power usage and cost, seems to be around 1 cent of EUR per query, assuming 1 minute per query at near full utilization and current electricity prices. (Probably over estimate, but not including any hardware costs.)
If you're on a continent where electricity prices have not been impacted by sanctions, it'd be much cheaper than 1 cent per query. Thanks for the calculations, @lovetheusers!
I broke down the previous query into parts, better than no-op but still not very good:
Struggling to find use cases for this except the int4 quantization technology demo aspect. If you have any ideas for what GLM could be useful for, let me know...
(Willing to do a few more queries to find out before giving up.)
I've tried about 10 different prompts and most are nonsense. Lots of repetition, I think contrastive search would help a lot in the generation... (Not worth pursuing otherwise.)
It likes to repeat questions, and then degenerates.
I can't type in line breaks with their default console, but here's the query and response: (got it wrong)
> Mountain A is 5 feet, Mountain B is 20 inches, Mountain C is 1 feet tall. Question: Which is the tallest? Answer:
Why are models trained like FlanT5 (on many different prompts) useful as bots, whereas larger models like GLM seem to be only statistical models of language?
Maybe all models need to be trained on a variety of specific tasks to understand what's expected.
Maybe the pretraining dataset that Google uses (C4), even ignoring the specialized tasks used by Flan, is better than the one used to train GLM?
Even as a statistical model of language it doesn't seem very good and quickly degenerates. I wonder if that's because of int4?
Assuming int4 is only slightly worse than int8 or float16, my conclusion is that how you train your model is so much more important than its size...
But since code hogs 3/4 of the GPUs at 100% when not running queries, I'm turning GLM off for tonight.
Send queries for tomorrow!
Since it's a dual language Chinese / English model, if you have any Chinese language questions, I can test those too...
@Ted_Underwood on Mastodon: "I’m just glad to know 130B is doable; that’s the main thing."
Yes! A clear goal for anyone building local privacy-centric LLMs could be to train a FlanT5 style model that can run quantized at int4.
You could fit 30B parameters per 24Gb GPU.
By default I just found it's got NUM_BEAMS=4, which explains why the execution time didn't seem to be proportional to the number of tokens in the output.
I'll try TOP_P and TOP_K tomorrow, those are likely going to be faster — let's see about the quality.
GLM130b knows a lot as a text token model, but much of is knowledge seems hard to access. Significant work is needed to get it to perform as a chatbot, problem solver, etc.
Could be via clever prompting & search heuristics, or better training?
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Write to GitHub, ask for contact details of their Data Protection Officer. If they refuse, explain it's mandated by GDPR to provide contact details. Ask DPO what's their policy on personally-identifying information under GDPR. Post response!
If you go via Support, you'll probably have to ask three times for DPO contact because frontline Google Support is not GDPR aware, and will refuse a few times to see if you're serious.
Not sure if intentional or incompetence!
Remember that the DPO is "protected from internal interference" from the organization itself (GitHub, Google) so if they give you 'internal policy' as an excuse, remind them that their role mandates they prioritize data & privacy issues first and foremost.
I never realized just how fragile tokenization can be when you're crafting LLM prompts!
Say your model was trained to summarize with "\n\nTLDR:" and you decide to include an extra space after ": " so that the space is excluded from the generated output: it's different tokens.
So the next sentence could be "This research ..." but the statistics would get messed up because of the extra space, as the tokenizer would have tokenized " This" to include the space before the capital.
I'm not sure this really is "engineering" it's more like prompt hacking...
With closed source models, you may not even be able to see what's going on within the tokenizer (or immediately after), so it'd make the guessing game even worse!
It's been 36h since this thread, with many constructive discussions since!
One front I failed on goes something like this:
"I thought you were an AI coder. How come you want CoPilot to be withdrawn? Do you want to cancel large models?"
First, there's no risk of CoPilot service being terminated and the technology abandoned. I don't want to see that and that's not the objective of their lawsuit either.
Second, I think medium- to large-models are absolutely worth pursuing technologically!
I'll be honest: part of AI ethics today feels misguided, hard to ever define, disconnected from reality. That goes hand in hand with the ethics washing.
However, I resonate with other parts and that's where I draw my line: consent. This goes hand in hand with following the law.
AVIF is based on VP10 codec, like a successor to WEBP which is based on the VP8 codec. Google owns & controls VP10, so has interests in promoting it instead of superior alternatives.
This means there'll be ~50% more energy used, and thus carbon, for internet bandwidth. 🙄
Most of our internal pipelines for data processing in #ML are based on JPEG-XL. For instance, the FFHQ dataset with "visually lossless" compression takes now 84Gb instead of 995Gb. Transfers faster, loads faster, trains faster.