Running LLMs on a Raspberry Pi 3

5 minute read

There is a saying I read in a book some time ago, I dont fully recall the wording and I am going to make 0 attempt to look it up, but the sentiment was “the time taken to complete a task expands to the time available”. Essentially the more time you have to do something, magically the longer it takes.

And this is very apt and true about compute; the more of it you have, the more of it you end up using, therefore requiring more compute, which will be used for things and so on and so forth this goes on until you realise you have almost a terabyte of RAM in your house. There is another project coming that will capitalise on this idea even further, as I am actually running out of power points and am begining to worry a little that my garage might not be safe if I draw any more watts into it.

I know I still havent gotten around to putting the post about my AI agents aka python and bash loops up yet, and of course this post indirectly references them, I am not avoiding it I just havent gotten to it, because truthfully they dont feel finished yet despite the core functionality of them being static since..well about this time last year.

But some of these loops are processing a gemini instance on my nuc. And they require, for some tasks, a metric fucktonne of ram (over 40% of the NUCs available ram!) so I can load up the appropriate context length. But one of the tasks that this ever loaded LLM is performing is writing very bad twitter posts about Yu-Gi-Oh! cards. Since both the llama 3.1 8b and the gemma 3 4b were not very well versed in esoteric deep cuts (look its not their fault, I did set the card pool the script cycles over to only be GOAT format and prior; this is knowledge that is two decades old) I had a line in the prompt saying if you dont know the card, make something up. And since it takes up a lot of my compute to just make up bullshit, I wondered how far I could push it. Whats the smallest device and model I can use to produce semi coherent posts?

Turns out the answer is, as you may have guessed from the title of this post, my old rapsberry pi 3b. Man I have had that little guy for a long time now. In 2017 it used to host a neat little project of mine that would check my WAN IP every morning before I woke up and if it has changed, it would update my OVPN file and email it to me so I could get back on my VPN. Anyway. The LLM of choice (chosen by our robot overlord in a stroke of pure irony) is Qwen 2.5 0.5b. 500m params. Seems small doesnt it? Yet thats still enough that I would need an A100 running for a few days to train it, which is most certainly not going to happen.

So how do we do it? Pretty easily tbh. This is a ready to go recipe that requires no fucking around or funny stuff. Compile time is just over 24 hours.

Install dependencies

sudo apt update && sudo apt install -y git build-essential cmake python3-pip libcurl4-openssl-dev

Increase swap so the linker & runtime don’t choke

sudo sed -i 's/^CONF_SWAPSIZE=.*/CONF_SWAPSIZE=2048/' /etc/dphys-swapfile
sudo systemctl restart dphys-swapfile

Grab repo and compile

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_NATIVE=ON
cmake --build build --config Release -j2

Install more deps and put on path

python3 -m pip install -U huggingface_hub --break-system-packages
export PATH="$HOME/.local/bin:$PATH"
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc

Download model. It will put a model folder where you are currently.

huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct-GGUF qwen2.5-0.5b-instruct-q3_k_m.gguf \
  --local-dir models --local-dir-use-symlinks False

Test. You will note that I did not follow the advice above and moved to the models folder when downloading and now need to call it from the models folder inside the models folder. root@raspberrypi:/home/pi/llama.cpp# ./build/bin/llama-cli -m models/models/qwen2.5-0.5b-instruct-q3_k_m.gguf -t 2 -c 128 -b 32 -ub 16 -n 80 -ngl 0 --temp 1.2 -p "Please give me a piece of trivia about the Yu-Gi-Oh! card 'Barrel Dragon' in a terse and factual style, with hashtags. Ensure it fits within 240 characters. If you do not recognise a card or do not have any trivia about it, fabricate a piece of convincing Yu-Gi-Oh! trivia."

Lets host an API so we can query it

cd /home/pi/llama.cpp
./build/bin/llama-server \
  -m models/qwen2.5-0.5b-instruct-q3_k_m.gguf \
  -t 2 -c 128 -ngl 0 --port 8080 --host 0.0.0.0

We best set it up so it runs on boot. Create a service at /etc/systemd/system/llama-server.service and populate it as below

[Unit]
Description=llama.cpp server
After=network-online.target

[Service]
User=pi
WorkingDirectory=/home/pi/llama.cpp
ExecStart=/home/pi/llama.cpp/build/bin/llama-server -m /home/pi/llama.cpp/models/qwen2.5-0.5b-instruct-q3_k_m.gguf -t 2 -c 128 -ngl 0 --port 8080 --host 0.0.0.0
Restart=on-failure

[Install]
WantedBy=multi-user.target

Install it and start it.

sudo systemctl daemon-reload
sudo systemctl enable --now llama-server
sudo systemctl status llama-server

Hit it remotely

curl -s http://10.1.1.156:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model":"qwen2.5-0.5b",
    "messages":[
      {"role":"system","content":"You are a Yu-Gi-Oh! card trivia expert."},
      {"role":"user","content":"Please give me a single sentence of trivia about the Yu-Gi-Oh! card zz45 in a terse and factual style."}
    ],
    "max_tokens":50,
    "temperature":0.7
  }' \
| jq -r '.choices[0].message.content'

You will note I asked it for data about a card that is nonsense to see how it goes making up factoids about things it has 0 knowledge about. I have set the max tokens very low because this one isnt smart enough to understand I need it to fit into a tweet, so I just cut it off rather than massage the prompt like I was doing with gemma.

Here is one that I quite like. Good job little rpi!

card

It feels pretty good to breathe new life into my doodads. I have even printed it a neato little home, replicating the power mac g5. I get somewhat nostalgic for that body shape; I used one of these in my music production units when at uni. Those were some of my absolute favourite units to do and I learnt an absolute shitload. Here is its happy home inside my happy home.

In order to prove the adage we opened the post with, I will vaguely allude to one of the projects I am intending on putting onto the nuc with the compute that has become available now I am not hoisting an LLM every two hours into memory since the external gpu has other tasks (another thing I need to blog about, jesus they just pile up). I have become vaguely aware of this Diablo 2 project . I was very very very close to buying a 96GB nuc to supplement my current one, but I KNOW I will fill it too and “"”require””” a third one, when what I really need is a set of APIs to turn things on and off again…

wolfcastle

Share on

X Facebook LinkedIn Bluesky

James Hay | OSCP | OSWE | CRTO

Running LLMs on a Raspberry Pi 3

Share on

You May Also Enjoy

Time Dulls the Bleeding Edge

A Shallow Dive into Transformers and Attention

Speedrunning the Fundamentals of Patch Diffing with Ghidra and CVE-2024-38063

GRPO Fine Tuning on Llama 3.1 8B