Preliminary update on my local ai experience with cline

Hello all, and happy thanksgiving. I mostly wanted to provide some thoughts about local ai somewhat vibe coding and my somewhat new framework desktop.

First of all its amazing what a little bit of optimization can do. I had initially struggled to even get the ai tools working, but things are much better now. I am able to stably run ai for hours and no issues. Some nice discoveries:

amd_gputop- Wonderful utility that will tell you how much of your local graphics card is doing, and how much memory it is using. I find it essential to helping me know how big models and context are.

Context Cache- This is mostly for the increasing # of tokens. It defaults at 8G for llama.cpp with jan (the ai console) I had a session with Gemini, and found out that changing it to 32GB is WAY better by the tune of 300%!

Edit the model.yaml file and include the following in YAML format:

parameters:
  ctx_len: 32768

This line help increase the context length to 32gb. I came up with this number based upon the # of my context length which was 32GB for a 256k context window. You might need to adjust this for different context lengths. It was mostly 'pasting' the llama.cpp output from the logs to Gemini, which analyzed it and found the performance discrepancies. Its extraordinarly good at looking at these. It did take a LONG time to get there, however. The file that you need to edit lives in the SAME directory as your model in the jan directory. Its called 'model.yaml' and you can use the find command to locate it.

The Long Road Home

I had just started out wanting to use cline with vscodium to do some local coding, since I REALLY don't want BIG TECH to see my thoughts esp. for personal projects. I also like to understand HOW things work. My first misunderstanding was the correct # of model parameters to use. The Gemini ai had suggested a 430b parameter model, but as expected, this was TOTALLY wrong (since the poor framework desktop ONLY has 128GB of ram!) and the model is 350GB in size when fully loaded. The thing about ai models is they have to be FULLY loaded into memory. This doesn't include context which can take up GIGABYTES of memory! The "correct" answer was a 32b qwen3 coder quantized to 6bits. This gave me about 60-100 tokens per second and was quite reasonable for speed.

Additionally, choosing the 'right' model really does depend on what one is doing. There are "instruct" models which are good for planning, Coding models which are good for coding, and "dense" models which load EVERYTHING EVERY time (and are thus quite slow on the strix halo aka the framework desktop), and mixture of experts, which ONLY load the part of the graph which are sufficient for using what is necessary. Dense models require EVERY parameter to be checked EVERY time, where the mixture of experts ONLY loads the relevant part of the model into the context, making it MUCH faster if you don't have HBM4 memory available.

So at least as of this writing the qwen3 coder 30b quantized to 6bits seems to work nicely with the context hack mentioned above. I also reccomend using JAN as it seems more stable on fedora core 43 which seems most stable as of the time of this writing, Nov 25, 2025. I am VERY happy with the performance of the local coding, and managed to make a PERL (I despise python!) program that will render a web page locally as a markdown document. It also generated full test suite (which to my eyes SEEMED to work) the context was about 80,000 tokens for the entire conversation generating the test plan. Also better to use Gemini to generate the 'test' plan request, and then feed the output to the lcoal model for coding (it makes a nice description of what the testing suite should do.