Taking on local QLoRA fine-tuning with a single Turing-generation GPU, the RTX 2070 (in progress)

This article is in progress. I am writing down where I currently stand (environment setup → baseline measurement → getting the training loop to run), without polishing it after the fact. The accuracy comparison after training will be added once I finish the run.

What I’m doing, and why

I have a personal RAG system that I keep growing for my own use. It takes the raw notes I throw at it and automatically sorts them into kinds — lessons, knowledge, values, design decisions — then summarizes and structures them. It indexes all of that so it can be retrieved with a hybrid of vector search and keyword search plus re-ranking, and — this is the crucial part — it automatically injects only the relevant pieces of the accumulated knowledge into an AI assistant (such as Claude Code) on every turn. The conversations I have are themselves taken back in, re-sorted, and returned. It is a closed loop that keeps turning scattered notes into a long-term memory an AI can use (a second brain), and I have grown it while measuring retrieval accuracy as a number.

At the entrance of that loop sits the “sorting (classification)” step I just mentioned. What I am building this time is the classifier that handles that step, made as a fine-tuned small model that runs locally.

There are three motivations.

I want to actually learn fine-tuning (the method of training the weights) by doing it end to end with my own hands.
That classification is currently outsourced to a general-purpose large model on every call, so I want to replace it with a dedicated small model that is cheaper, faster, and more stable.
And above all — the training data is my own personal notes, so I do not want to upload it to an external cloud. That is why I set “complete it entirely on my own laptop” as the very first constraint.

The hardware on hand is a single gaming laptop with a Turing-generation GPU, the RTX 2070 (8GB VRAM). For training it looks modest next to the latest generation, but checking “how far can I get with the tools I already have” was part of the point of this challenge.

The whole flow is below. What I am doing follows the same discipline I run every day on my own RAG: “measure → change → measure.”

 ┌──────────┐   ┌────────────┐   ┌────────────┐   ┌────────────┐   ┌──────────────┐
 │ 1. Data  │ → │ 2.Baseline │ → │ 3. Train   │ → │ 4.Evaluate │ → │ 5. Compare   │
 │   prep   │   │  (before)  │   │  (QLoRA)   │   │  (after)   │   │ before/after │
 └──────────┘   └────────────┘   └────────────┘   └────────────┘   └──────────────┘
     same "measure → change → measure" loop as before/after evaluation

The dullest and most important step: facing the data

When people hear “fine-tuning” they picture the learning algorithm, but what actually ate my time was preparing the training data. An honest inventory of my data looked like this.

It is scarce and skewed to begin with (a mix of well-populated kinds and kinds with only a dozen-odd items)
There is no ground-truth data for the “none of the above” class, so I have to create it myself
The cleaned-up notes and the raw notes I actually want to classify have a different texture of writing

Clean data never falls from the sky. This grubby preprocessing is the real work, and how you decide to cut corners here directly shapes the result. In the end I shaped the data into a usable form, prepared a few hundred training examples, and split off a portion for evaluation.

The iron rule: measure “where you are” before training

Once the data is ready, you do not start training right away. If you are going to improve something, you first need the number from before the improvement. Without it you cannot claim “it got better” or “it got worse.”

So I had the general-purpose large model classify the evaluation split as-is, and measured its accuracy. The result was about 48% overall. Breaking it down by class, I also found it was almost entirely missing one kind of note. “There is plenty of room to grow” became visible as a number, not an impression — and that alone keeps every later decision from wobbling.

The heart of the code: the QLoRA setup

This is the core of fitting inside the 8GB limit. Compress the base to 4-bit and freeze it, and train only a small LoRA adapter.

   ┌────────────────────────────────────┐
   │  Base model — FROZEN (4-bit)  ❄     │       ┌────────────────────┐
   │  billions of weights, NOT updated   │ ────► │ LoRA adapter        │
   │                                     │       │ (trainable, tiny)   │
   └────────────────────────────────────┘       └────────────────────┘
       Train only the small adapter (~0.1–1%) → fits a modest GPU

# Load the base in 4-bit (NF4) = a large cut in VRAM usage
bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,  # Turing (RTX 2070) can't do bf16 -> fp16
)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B-Instruct",
    quantization_config=bnb,
    device_map="auto",
    dtype=torch.float16,
)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

# Add only a small trainable adapter (LoRA) on top of the frozen base
peft_config = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
)

The point — starting with compute_dtype — is that the numeric formats are stated explicitly from the start. Leave this vague and a GPU generation gap will trip you later (next section).

The climax: I fell over on the “environment,” not the code

This is where the real work begins — except what tripped me was not the algorithm but, every single time, the ground-level environment. The steps themselves are in the official docs and the write-ups. But run them on your own machine and walls appear that no textbook mentions. I list them in the order I hit them.

#	Symptom	What it really was	How I fixed it
1	Unknown type error on library import	A generation mismatch between core libraries (one too new, the type the other expected was gone)	Realigned the foundational libraries to a generation that matches the upper layer
2	Config file load dies on a decoding error	Japanese Windows’ default character encoding can’t read a UTF-8 file	Start the program in UTF-8 mode
3	Training API rejects an argument	An API change across library versions (an argument had been renamed)	Adjusted the call to the convention of the installed version
4	Stops on training step 1	On Turing (RTX 2070), the newer numeric-format (bf16) operation was unimplemented	Made the numeric handling explicit and removed the mixed-precision mechanism to avoid it

Number 4 was especially stubborn: stating the dtype explicitly was not enough, and in the end turning off mixed precision (the speed-up mechanism) itself was what finally got the training loop moving. I learned only after stepping on it that this is a classic Turing-generation pitfall.

As code, the decision is just two lines. But being able to say “why I turn this off here” is the difference between copying and building it yourself.

args = SFTConfig(
    per_device_train_batch_size=1,   # smallest, to fit 8GB; earn effective batch via accumulation
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=False, bf16=False,          # <- decision: on Turing, mixed precision (fp16) trips on the gradient.
                                     #    give up speed, prioritize "it runs reliably" first
    optim="paged_adamw_8bit",        # memory-frugal optimizer
    gradient_checkpointing=True,     # redo part of the compute to save activation memory
)

Clear one and the next appears. It can wear you down. But writing down each spot where I got stuck and “why I fixed it that way” turned it into an asset that does not disappear. For the next person to attempt the same setup — me a few months from now, for instance — this record itself becomes a map.

Training ran. But reality was “time”

After clearing every stumble, the training loop finally started turning. The loss (the number for how wrong the model is) was being computed properly and going down. That moment of relief is good no matter how many times you feel it.

But the last thing standing in the way was speed. On the RTX 2070 (Turing), about 100 seconds per step. Running it to the end was estimated at roughly 6 hours. And throughout, the GPU is occupied and the machine keeps running hot.

Here I made one realistic call. Pause for now. And on the next run, be decisive about the text length and the number of epochs, to bring it into a realistic amount of time. Rather than forcing a 6-hour run out of perfectionism, “make one lap first and get a result, then run the real thing if it looks good” — that, I judged, is the faster way to learn.

Where I stand right now (in progress)

✅ Built the training environment separately, without touching the production RAG (GPU recognized too)
✅ Shaped the training data and split it for training and evaluation
✅ Measured the before number (about 48% overall)
✅ Confirmed the training loop runs (cleared every stumble)
⏳ Run the training to completion → produce the after number and compare ← next, from here

So it is still “mid-way.” But I have crossed the steepest mountain — the environment setup. What remains is to secure the time to run it to the end, and to measure the result honestly.

In closing: what you can copy, and what you can’t

“With AI, anyone can do it easily,” people often say. Half of it is true. The code can indeed be copied. The official steps and the samples are all there if you search.

But there is something you cannot copy. Where it got stuck, what I threw away, and why I chose those settings — the judgment. This 6-hour reality, and how you decide to cut it down in the face of it. What happens on a Turing-generation GPU. How to dodge the character-encoding trap. These “things you only understand after stepping on them” are not written in the manuals.

The value always remains there. So I record it, stumbles and all. When I finish the run, I will write the continuation, together with the after numbers.

The conceptual sorting-out — what is the difference between prompt tuning and fine-tuning — is in a separate article.

🔗 Concept edition: What is the difference between “prompt tuning” and “fine-tuning”?

Glossary (words used in this article)

Term	Meaning
Fine-tuning	Update the model’s weights with examples to train its behavior
Personal RAG	A self-built long-term memory system that searches accumulated knowledge and passes it to the AI as context
Vector search / hybrid search / re-ranking	Search documents by both meaning (vectors) and words (keywords), then reorder the top results before passing them
Training data	Examples of “input → correct answer” for learning; a few hundred items in this article
QLoRA	A memory-frugal method: quantize the base to 4-bit and freeze it, train only a small LoRA
Quantization / 4-bit / NF4	Representing weights in low bits to save VRAM (NF4 is a 4-bit format)
VRAM	GPU memory; where the model and computation live. The GPU here has 8GB
LoRA adapter	A small trainable matrix added on top of the frozen base
Frozen	Not updated during training (saves VRAM and compute)
Loss / gradient	A number for how wrong it is / how to move each weight
Mixed precision (fp16 / bf16)	Lowering numeric precision to speed up and save memory; bf16 targets relatively newer GPU generations (Turing can’t)
Gradient accumulation / gradient checkpointing	Bundle small batches to earn an effective batch / redo part of the compute to save activation memory
Optimizer (paged_adamw_8bit)	The mechanism that updates the weights; the 8-bit version is memory-frugal
Baseline (before)	The reference value before improvement (here, the general model’s accuracy, ~48%)
recall / accuracy	Evaluation metrics for classification
Pipeline (Data prep → Baseline → Train → Evaluate → Compare)	The flow: data prep → pre-measurement → training → evaluation → comparison