LocPilot setup Local coding on a Framework Desktop/Strix Halo PC

I have recently gone through a fairly satisfactory 'local vibe coding' or 'locpilot' (Local copilot!) solution. Briefly, it involves several ingredients. For more advanced tasks, alas this is a HYBRID model, the coding happens locally, and the architecting happens with a 'bigger' chat model. You can do things locally with a bigger model, but planning and coding locally is SLLLOOOWWW. It can be done, but I find the hybrid approach is more realistic.

Cline (an excellent open-sourced agentic coding thingy)
VSCodium- The "free" as in telemetry free version of MS vscode w/o the lock in (and copilot nonsense!)
A framework Desktop with 128GB of ram
4 TB NVME drive (I use a samsung evo Pro 980)
Qwen a3b Coder instruct 30B
A "architect" or paid for chat agent such as ChatGpt, Google Gemini, Or Claude ( you can omit this but more advanced coding will be harder)
Jan (MODEL hosting made easy and opensourcy)

General High Level Through of the Arch

The setup involves the following high-level steps

Setting up the framework (I use fedora core 43 as my os)
Figuring out which kernel to use (as of this moment its 6.18.3 with the backdated firmware.) to achieve stability unless you stick with 6.17.x
Avoiding ROCM (its a bit of a pain to get working)
Setting the appropriate settings for your kernel to let the framework load large models. Go to section 6 of the guide from this link.
Installing JAN
Getting and quantizing the model to 6 bit quantization (gives you the best bang for your buck). Just look for Quantizing models with llama.cpp. Basically run the command with the "ORIGINAL" non-gguf formatted safetensors download).
Modifying the model.yml to be WAY faster (and changing the llama.cpp hardware options)
Setting Up Vscodium & Cline
Importing the model into Jan
Starting the api server (openapi) in jan and running the 6bit quantized model

Model.yml - The Speed Increase Settings

This might be the most important section and took me the longest time to figure out (With going back and forth with AI and benchmarks and personal exploring!)

# Basic Model Info
name: qwencoder-Q6_K.gguf
model_path: /models/jan/qwencoder-Q6_K.gguf
size_bytes: 25092534080
embedding: false

# Prompt Template (ChatML)
prompt_template: |
  <|im_start|>system
  {system_message}<|im_end|>
  <|im_start|>user
  {prompt}<|im_end|>
  <|im_start|>assistant

# Inference Parameters
parameters:
  temperature: 0.2
  top_p: 0.9

# Hardware & Engine Settings
settings:
  engine: llama.cpp
  ctx_len: 262144
  n_batch: 8192
  n_ubatch: 1024
  n_gpu_layers: 99
  flash_attn: true
  # These two probably don't do anythign since they belong in llama.cpp config
  cache_type_k: q8_0
  cache_type_v: q8_0

Configuration Breakdown

Item	Explanation
name / model_path	These specify the file name and the exact location on your storage where the model lives.
size_bytes	This defines the file size (approx. 23.4 GB), used by the software to calculate if you have enough RAM/VRAM.
embedding	Set to false because this is a chat/coding model, not a specialized model for mathematical text-search.
prompt_template	This is the "wrapper" that formats your text so the AI knows which parts are instructions and which are user questions.
temperature	Set at 0.2 to keep the AI focused and predictable, which is ideal for logical tasks like coding.
top_p	A safety filter that ensures the AI only chooses from the most likely word options (top 90%).
engine	Specifies `llama.cpp` as the underlying "motor" that performs the mathematical calculations.
ctx_len	The "memory" limit, allowing the AI to remember up to 262,144 tokens (roughly 150-200 pages of text) at once.
n_batch / n_ubatch	These determine how many chunks of data the AI processes at once to speed up initial prompt reading.
n_gpu_layers	Set to 99 to ensure the entire model is loaded onto your Graphics Card (GPU) rather than the slower System RAM.
flash_attn	An optimization technique that speeds up processing and reduces memory usage during long conversations.
cache_type_k/v	These settings attempt to compress the AI's "short-term memory" to save space, though here they are likely placeholders.

Explanation of Values

Config Item	Value	Why it's a good choice (Optimality)
name / path	`qwencoder-Q6_K`	The `Q6_K` format provides near-perfect logic with reduced file size, making it the "gold standard" for high-end local setups.
temperature	`0.2`	A low value prevents the AI from being "creative" with code syntax, ensuring more predictable and syntactically correct outputs.
top_p	`0.9`	This acts as a safety net, discarding unlikely word choices while still allowing enough variety to avoid repetitive loops.
ctx_len	`262144`	This massive "memory" allows you to paste entire library documentations or multiple source files without the AI losing context.
n_batch	`8192`	A high batch size speeds up the initial "ingestion" of your prompt by processing 8,192 tokens at a time.
n_gpu_layers	`99`	Setting this to a high number ensures the entire model is loaded into VRAM, which is significantly faster than using system RAM.
flash_attn	`true`	This is a mathematical optimization that allows the AI to handle long conversations faster and with less memory overhead.
cache_type	`q8_0`	Compressing the context cache to 8-bit allows that huge 262k context window to fit into memory without sacrificing reasoning quality.

This model file lives in the "Cline" data directory, which you may need to find yourself with the corresponding name Mine is /Models/jan/llamacpp/models/qwencoder-Q6_K.gguf in my setup. with the root being "Models" as my root mount point. I told jan to put its data files there instead of my home directory, because I wanted to use a separate NVME drive for the model.

LLAMA.CPP model provider setup

The Cache type settings seem to have NO effect in the models.yml, and seems to live in the Jan GUI. this may change, but as of this moment its located in "settings, "Model Providers", Llama.cpp, Gear icon, scroll down until you see KV Cache K Type, set to "q8_0" same for KV Cache V Type to the same value.

For me this was probably the hardest thing to find, since most of the AI directions seem to think its int the model.yml (which it seems to think is a JSON file!).

I also recommend setting the timeout for llama.cpp to 3600 to keep it from unloading stuff.

Also enable MLOCK, and continuous batching, and set ubatch to 1024 .

Jan API Server Setup

Its easiest to choose the 'Qwen 3 coder' model as the default in the JAN chat ui if you want autostarts to work when using Jan. Althought I don't cover it, its EASY to run llama.cpp with similar parameters and skip Jan entirely, but I won't cover that here.

Basically to to the api server, set the password, increase the time for wait (if you want to), set the api password, and choose "start". If you loaded the qwen3 coder in your chat like the directions said, you will be ready for the Cline coder part.

Cline Setup

Install vscodium. You may need to install the hack to allow you to use the extension store on MS (not that I would EVER encourage ANYONE to violate terms of service. You have been warned!). Install Cline add-in.

Once you install this, simply choose the endpoint that is listed in the menu of Jan api server, put in the passcode you set. I also like to enable "Skills", Disable browser access (it causes cline to try to do image based stuff which makes the model mad), Under the api config, I like to increase the context size to 256,000, disable image capability. I like to turn on compact caching (not sure if it does too much but seems to be that the AI likes it)

{
    "workbench.colorTheme": "Solarized Dark",
    "terminal.integrated.profiles.linux": {
        "fish": {
            "path": "/home/<your_home_U-Use-Fish-right/bin/fish"
        }
    },
    "terminal.integrated.defaultProfile.linux": "fish",
    "comments.openView": "never",
    "git.openRepositoryInParentFolders": "always",

    /* Cline & AI Configuration */
    "cline.useCompactPrompt": true,
    "cline.autoCompact": true,
    "cline.contextWindow": 262144,
    "cline.providerSettings": {
        "openai-compatible": {
            "useCompactPrompt": true
        }
    }
}

Memlock memory fixes

You need to edit some of the memlock issues found in jan. Here is the proper file which lives in /etc/security/limits.conf

* soft memlock unlimited
* hard memlock unlimited

Fish script to launch JAN and fix SQLVECTOR better and faster RAG

There is a bug in jan that prevents it from loading a proper sqlvector which makes RAG retrieval for docs MUCH harder. You have to extract the image, and be sure to change paths as appropriate to your config. This is for MY config. YOU HAVE TO EXTRACT the appimage. IN your Jan appimage directory or THIS WILL NOT work. Also use this script to launch Jan itself.

#!/usr/bin/fish

# 1. Define Paths
set JAN_DIR "$HOME/jan/squashfs-root"
set TARGET_DIR "$JAN_DIR/usr/bin/resources/bin"
set EXT_DIR "$HOME/.local/share/jan.ai.app/extensions"
set SQLITE_VEC_URL "https://github.com/asg017/sqlite-vec/releases/download/v0.1.6/sqlite-vec-0.1.6-loadable-linux-x86_64.tar.gz"

# 2. Ensure the Target Directory exists inside the extracted AppImage
mkdir -p "$TARGET_DIR"

# 3. Get the extension if we don't have it yet
if not test -f "$EXT_DIR/sqlite-vec.so"
    echo "Downloading sqlite-vec..."
    mkdir -p "$EXT_DIR"
    set TEMP_DIR (mktemp -d)
    curl -L $SQLITE_VEC_URL -o "$TEMP_DIR/sqlite-vec.tar.gz"
    tar -xzf "$TEMP_DIR/sqlite-vec.tar.gz" -C "$TEMP_DIR"
    mv "$TEMP_DIR/vec0.so" "$EXT_DIR/sqlite-vec.so"
    rm -rf "$TEMP_DIR"
end

# 4. Copy the extension to the exact internal path Jan expects
# Jan looks for the filename 'sqlite-vec' (no extension) or 'sqlite-vec.so'
cp "$EXT_DIR/sqlite-vec.so" "$TARGET_DIR/sqlite-vec.so"
ln -sf "$TARGET_DIR/sqlite-vec.so" "$TARGET_DIR/sqlite-vec"

# 5. Launch the extracted binary
echo "Launching Optimized Jan..."
cd $JAN_DIR
./AppRun

Guide to locpilot - LOCAL ai NO MS!