Finetuning LLMS for classification
- Different LLM Architectures
- Methods to adapt LLMs for classification
- Tiny mental model (why they feel different)
- Finetuning Approaches for Classification
- My experience with Kaggle Competition
- Lessons from the Leaderboard
Most of us think of large language models (LLMs) as tools for generationâthey write essays, answer questions, and spin up entire conversations. But what happens when you ask them to do something more structured, like classification? That was the question I wanted to explore when I joined the recent Kaggle competition on finetuning LLMs for classification.
The task looked simple at first glance: given a piece of text, predict a label. But doing this well with LLMs isnât as straightforward as dropping in a prompt. You have to decide how to adapt a generative model into a classifier, which brings its own set of questions:
- Do you fully finetune the model, or use parameter-efficient methods like LoRA/QLoRA?
- How do you handle long sequences without blowing up GPU memory?
- Which architectures actually strike the right balance between leaderboard performance and training cost?
Over the course of the competition, I tried out different approachesâfrom starting with DeBERTa baselines to experimenting with preference-pair setups and LoRA adaptersâand learned what really works (and what doesnât) when you push LLMs into classification territory.
This blog is my attempt to document that journey. Iâll cover the core concepts of finetuning LLMs for classification, walk through the trade-offs I faced, and share practical lessons you can apply if youâre looking to move beyond âpromptingâ and actually train LLMs for structured decision-making.
Different LLM Architectures
At a high level, Transformer models come in three flavors. Knowing which one youâre holding helps you decide how to turn it into a classifier.
Encoder-only (BERT / RoBERTa / DeBERTa)
- Pretraining objective: Masked-Language Modeling (MLM) â bidirectional context.
- Strength: Strong text understanding; compact, fast; great when you just need a vector and a head.
- How to classify: Take the
[CLS]
token (or pooled embedding) â small MLP classification head. - Pros: Efficient, stable training, great for short/medium sequences.
- Cons: Usually smaller context windows; less natural for generation.
Decoder-only (GPT, LLaMA, Gemma, Mistral)
- Pretraining objective: Causal LM (next-token prediction) â left-to-right context.
- Strength: Great at generation and following instructions after SFT/RLHF.
- How to classify (2 common ways): 1) Label tokens: prompt the model and force the next token(s) to be a label (e.g., âpositive/negative/neutralâ). 2) Head on hidden states: use final hidden representation (e.g., of the last token) and add a classification head.
- Pros: Leverages instruction-following; easy to deploy one model for both gen + classify.
- Cons: Heavier; careful prompt/label design or head wiring needed for stable accuracy.
EncoderâDecoder (T5 / FLAN-T5 / UL2)
- Pretraining objective: Denoising/âspan corruptionâ â map input â output text.
- Strength: Natural sequence-to-sequence framing; robust instruction-tuned checkpoints (FLAN).
- How to classify: Text-to-text: input â âlabelâ as text (e.g., âentailment/neutral/contradictionâ). Optionally constrain decoder to label set.
- Pros: Clean task formulation; strong few-shot behavior after instruction tuning.
- Cons: Two-pass compute (encode + decode); can be slower than encoder-only for pure classification.
Methods to adapt LLMs for classification
Encoder-Only Models (BERT, RoBERTa, DeBERTa)
Architecture Transformer encoders process the input sequence bidirectionally and output contextual token embeddings.
Adaptation
- Append a classification head (linear layer + softmax) on top of the [CLS] token or pooled representation.
- Train end-to-end with cross-entropy loss.
Pros
- Efficient and lightweight for short/medium texts.
- Pre-training objectives (MLM) align well with classification.
Cons
- Limited to classification/embedding tasks (no generation).
Example Fine-tuning DeBERTa-v3 with a small feed-forward head for sentiment analysis.
Decoder-Only Models (GPT, LLaMA, Mistral)
Architecture Causal language models trained for left-to-right generation.
Adaptation Approaches
- Prompt + Label Tokens
Example: âReview: This movie was great! Sentiment:â â âPositiveâ
- Classes represented as natural language tokens.
- Softmax Head on Hidden States
- Add a classification head on the final hidden state (similar to encoder-only).
- Parameter-Efficient Fine-Tuning (PEFT)
- LoRA/QLoRA inserts small trainable matrices into attention layers.
- Updates <1% of parameters.
Pros
- Same model can handle both generation and classification.
- Naturally aligns with instruction-style prompting.
Cons
- Heavier inference cost compared to encoder-only models.
- Requires careful design of label tokens.
Example Fine-tuning LLaMA-2-7B with QLoRA for toxicity classification.
Preference-Based & Pairwise Approaches
Overview These methods are useful when classification is framed not as predicting a single categorical label, but as deciding which option is preferred between alternatives.
Examples
- DPO (Direct Preference Optimization)
- Trains directly on preference pairs instead of categorical labels.
- Pairwise Classification
- Input two candidate responses, and predict which one is preferred.
Use Cases
- Particularly effective for tasks where human judgment matters.
- Example: Kaggle LLM Classification Fine-Tuning competition (predicting which response a user would prefer).
Zero-Shot & Few-Shot Prompting
Overview Instead of fine-tuning, these approaches rely on prompt engineering to guide the model.
Types
-
Zero-Shot Example: âClassify the following review as Positive or Negative: âŚâ
-
Few-Shot Provide 2â3 examples inline within the prompt to guide the model.
Pros
- No training required â only inference.
- Directly leverages massive pretraining knowledge.
Cons
- Performance can be unstable and sensitive to exact prompt wording.
- Generally underperforms fine-tuned models on benchmarks.
Retrieval-Augmented Classification
Overview For tasks requiring external context or knowledge grounding, models can incorporate retrieved documents into the classification pipeline.
Approach
- Retrieve relevant documents from a knowledge base or corpus.
- Concatenate them with the input text.
- Pass the combined input to the classifier for prediction.
Example Classifying legal case outcomes using retrieved precedents for context.
Notes
- Often combines encoder-based retrievers (e.g., dual-encoders, dense retrieval) with decoder-based classifiers.
Tiny mental model (why they feel different)
Finetuning Approaches for Classification
Once you decide to turn an LLM into a classifier, the next question is how much of the model should you actually train? Thereâs no single answerâdifferent approaches balance compute cost, memory usage, and performance.
Here are the main strategies I explored (and struggled with) during the competition:
Prompting
Zero-shot / Few-shot
Ask the model to output a label using a carefully designed prompt. No training needed.
- Fast & cheap
- No training pipeline
- Prompt brittle
- Inconsistent on niche data - RAG cannot be used.
Instruction tuning
SFT on labeled instructions
Supervised finetuning on instruction-style examples so the model follows label prompts reliably.
- Easy to prototype
- Stronger than raw prompting
- Less control than task-specific heads
- May plateau
PEFT
LoRA / QLoRA / Adapters
Freeze the backbone and train small adapter parameters (LoRA ranks). QLoRA adds 4-bit quantization.
- Great cost â performance
- Fits larger backbones on single GPU
- Integration overhead
- Hyperparams matter
Full finetuning
Update all parameters
Train every weight end-to-end with a classifier head or label tokens.
- Highest ceiling
- Max specialization
- GPU/time expensive
- Overfit risk on small data
My experience with Kaggle Competition
When I entered the LLM Classification Fine-Tuning competition, I wanted to explore how different approaches beyond âjust fine-tune a transformerâ could help in practice. Over the course of the competition, I tried three distinct strategies:
Test-Time Inference Tricks
đ Notebook: Link to notebook
đ Score: 1.101
I first experimented with test-time inference adjustments, where I modified how predictions were aggregated or sampled.
- Methods included temperature scaling, probability smoothing, and different ways of combining logits.
- These tweaks gave marginal improvements, but couldnât outperform stronger training-time strategies.
TeacherâStudent Distillation
đ Notebook: Link to notebook
đ Score: 1.09
My next attempt was to distill a larger teacher model into a smaller student.
- The teacherâs softened probability distributions guided the student toward better generalization.
- The student model trained faster and was more lightweight, but suffered a drop in accuracy compared to directly fine-tuning a strong base model.
- Still, it offered insights into trade-offs between efficiency and leaderboard performance.
Hybrid: XGBoost + TFâIDF Ranking
đ Notebook: Link to notebook
đ Score: 1.07228
For variety, I built a feature-based pipeline: extracting TFâIDF features and training an XGBoost classifier on top.
- This was surprisingly competitive on smaller validation splits.
- However, it lacked the robustness and semantic depth of transformer-based models.
- It served as a good sanity check against the neural approaches, and showed how far a âclassicalâ ML method could still go.
Lessons from the Leaderboard
While my individual methods had mixed success, I noticed that the best result notebooks on Kaggle all used some form of ensembling. Ensembling multiple modelsâwhether different architectures, seeds, or training strategiesâconsistently pushed results higher than any single approach.
This was a humbling reminder that in practical ML competitions, combining diverse strengths often beats finding the âperfectâ single model.
Here is a table for quick comparison:
Method | Idea | Strengths | Weaknesses |
---|---|---|---|
Test-Time Inference Tricks | Adjust prediction sampling & scaling | Easy to implement, fast | Marginal gains only |
TeacherâStudent Distillation | Distill knowledge from larger teacher | Efficient, smaller student models | Accuracy drop vs. direct fine-tuning |
XGBoost + TFâIDF Ranking | Classical ML on top of TFâIDF features | Competitive on small splits, interpretable | Weak semantic understanding, less robust |
Ensembling (observed) | Combine multiple models/strategies | Consistently strong results | More compute, harder to manage |
⨠In the end, this competition wasnât just about climbing the leaderboard for me, it was about experimenting with different paradigms of model training and seeing how they compared. Each approach taught me something unique about trade-offs in LLM fine-tuning, and those lessons are what I carry forward.