Sometime last year I first started exploring local AI models and even made a thread about it somewhere on here with my initial attempts. Started with two 5060ti 16gb cards and Ollama and never really got anywhere useful with it. The biggest problem is that while Ollama is easy it's quite restrictive and inefficient... it's not really a good path forward if you're serious about local AI. Anyway, the AI landscape has changed significantly plus my workflow has also changed. In the meantime I sold one of the 5060tis for more than I paid for it and the other one is in my main workstation/gaming PC.
In the meantime, what I used the most for my web development and other projects is GitHub Copilot BUT at the end of this month they're going from including a massive amount of usage with a fixed price subscription (for $40/month I could do almost anything and everything using some really good, expensive models) to per token pricing. Based on my past usage I would be spending $500-1000/mo for what I just used this month using the new GHCP pricing. So I downgraded to the $10/mo plan so I have access to the most advanced frontier models as needed but so far I've been playing with my local AI models and they can do 80% of the work. We shall see how my usage goes.
Anyway, I made (mostly vibe-coded) a custom piece of software that uses Docker and Llama-cpp that gives me a nice web UI that I can use to manage devices and models. I also spent a few bucks on GPUs. Haha. More on that in a bit.
I can chat with it using the web interface but more useful than that it gives me an OpenAI-compatible API that I can integrate with stuff. Right now I mostly just use it in OpenCode or occasionally in the Continue VSCode plugin but I'm also working on integrating it with more things. But the main thing about my custom bit of software is that it supports multiple GPU vendors, mixed GPUs, and pooling (within the same vendor). Yes I could have done it with vLLM which is more powerful and performant, but that's more work to configure and this way it does what I want, the way I want. Plus I'm lazy and a nice web UI I can click stuff in is less work than configuring and managing vLLM.
My biggest challenge right now is properly implementing real time lookups and web searches. I mainly used Grok for this ($30/mo plan) but I've since gone down to the free plan and just spread my free usage across Grok, Gemini, ChatGPT, and Claude. $30 is $30! Once I have implemented web search and real time data access into my app I will use a lot less of the cloud services. Because I prefer to keep my data under my control. I have also been doing just fine without Claude Code, their usage is just too restrictive. Although from my understanding now that they rent a bunch of compute from xAI they loosened this. But I think if I do get another AI subscription it will be Cursor. We'll see.
Software aside, I have two "AI servers" now:
1. 1x AMD AI Pro R9700 32GB GPU, Intel Core i7-14700F CPU, 64GB DDR5, 1TB NVMe SSD. I'm also going to add an Arc A380 6GB card to this just as a cheap low power way to run small models concurrently with the larger models without powering on the other system. Currently I just use the CPU for this but it's more power-efficient to use a small GPU instead of the CPU and our power costs are pretty high here. Ultimately if local AI really does alleviate all my GHCP usage I will probably get a second R9700 but I need to get a better platform/motherboard first because the existing motherboard only runs the second PCI-E slot at x4 which will bottleneck the GPU. This is my primary AI server.
Originally I had two Arc B60s instead of the single AMD R9700 but they were just too unstable. I tried them in various computers but it was a mess. So I returned them and exchanged them for the AMD card. I'm much happier with it. Although 48GB of VRAM would have been great!
2. 3x NVIDIA RTX3050 8GB GPU, Intel Core i7-9800X, 64GB DDR4, 512GB NVMe SSD. This one was a hodgepodge of cheap leftover parts combined with a few other things I got a good deal, otherwise it's really not efficient and not the ideal route. It's a secondary server I use for testing and various smaller models but sometimes the NVIDIA CUDA stack just works better than the Vulkan stack I'm using on the other server for AMD. Initially I was using ROCm for AMD but that thing is so trash and so broken in so many ways AMD should be ashamed of themselves...
Yes a 64GB or 128GB Mac Mini or Studio would be more efficient but I love being able to tinker with stuff and my custom thing runs on Ubuntu so that wouldn't really do what I want.
For the models, there are so many to list that I'm playing with. Qwen3.6 really is insanely good for a local, not huge model!
Oh, and it's warm in here! I think my bedroom looks more like a datacenter (albeit a very sloppy one with a hodgepodge pile of desktop PCs) than a bedroom.
As the project gets more stable, secure, and reliable, I might post a link to my open source project, but for now, just wanted to share and discuss the hardware and local AI in general. Anyone else doing local AI at home? And if so, on what hardware, with what software, what models, and what workflow?
In the meantime, what I used the most for my web development and other projects is GitHub Copilot BUT at the end of this month they're going from including a massive amount of usage with a fixed price subscription (for $40/month I could do almost anything and everything using some really good, expensive models) to per token pricing. Based on my past usage I would be spending $500-1000/mo for what I just used this month using the new GHCP pricing. So I downgraded to the $10/mo plan so I have access to the most advanced frontier models as needed but so far I've been playing with my local AI models and they can do 80% of the work. We shall see how my usage goes.
Anyway, I made (mostly vibe-coded) a custom piece of software that uses Docker and Llama-cpp that gives me a nice web UI that I can use to manage devices and models. I also spent a few bucks on GPUs. Haha. More on that in a bit.
I can chat with it using the web interface but more useful than that it gives me an OpenAI-compatible API that I can integrate with stuff. Right now I mostly just use it in OpenCode or occasionally in the Continue VSCode plugin but I'm also working on integrating it with more things. But the main thing about my custom bit of software is that it supports multiple GPU vendors, mixed GPUs, and pooling (within the same vendor). Yes I could have done it with vLLM which is more powerful and performant, but that's more work to configure and this way it does what I want, the way I want. Plus I'm lazy and a nice web UI I can click stuff in is less work than configuring and managing vLLM.
My biggest challenge right now is properly implementing real time lookups and web searches. I mainly used Grok for this ($30/mo plan) but I've since gone down to the free plan and just spread my free usage across Grok, Gemini, ChatGPT, and Claude. $30 is $30! Once I have implemented web search and real time data access into my app I will use a lot less of the cloud services. Because I prefer to keep my data under my control. I have also been doing just fine without Claude Code, their usage is just too restrictive. Although from my understanding now that they rent a bunch of compute from xAI they loosened this. But I think if I do get another AI subscription it will be Cursor. We'll see.
Software aside, I have two "AI servers" now:
1. 1x AMD AI Pro R9700 32GB GPU, Intel Core i7-14700F CPU, 64GB DDR5, 1TB NVMe SSD. I'm also going to add an Arc A380 6GB card to this just as a cheap low power way to run small models concurrently with the larger models without powering on the other system. Currently I just use the CPU for this but it's more power-efficient to use a small GPU instead of the CPU and our power costs are pretty high here. Ultimately if local AI really does alleviate all my GHCP usage I will probably get a second R9700 but I need to get a better platform/motherboard first because the existing motherboard only runs the second PCI-E slot at x4 which will bottleneck the GPU. This is my primary AI server.
Originally I had two Arc B60s instead of the single AMD R9700 but they were just too unstable. I tried them in various computers but it was a mess. So I returned them and exchanged them for the AMD card. I'm much happier with it. Although 48GB of VRAM would have been great!
2. 3x NVIDIA RTX3050 8GB GPU, Intel Core i7-9800X, 64GB DDR4, 512GB NVMe SSD. This one was a hodgepodge of cheap leftover parts combined with a few other things I got a good deal, otherwise it's really not efficient and not the ideal route. It's a secondary server I use for testing and various smaller models but sometimes the NVIDIA CUDA stack just works better than the Vulkan stack I'm using on the other server for AMD. Initially I was using ROCm for AMD but that thing is so trash and so broken in so many ways AMD should be ashamed of themselves...
Yes a 64GB or 128GB Mac Mini or Studio would be more efficient but I love being able to tinker with stuff and my custom thing runs on Ubuntu so that wouldn't really do what I want.
For the models, there are so many to list that I'm playing with. Qwen3.6 really is insanely good for a local, not huge model!
Oh, and it's warm in here! I think my bedroom looks more like a datacenter (albeit a very sloppy one with a hodgepodge pile of desktop PCs) than a bedroom.
As the project gets more stable, secure, and reliable, I might post a link to my open source project, but for now, just wanted to share and discuss the hardware and local AI in general. Anyone else doing local AI at home? And if so, on what hardware, with what software, what models, and what workflow?