2026-06-29

it took four SFT rounds to teach IDK-1 to count to three

IDK-1 is now IDK-1-Instruct. it's on HuggingFace. it took four rounds of supervised fine-tuning to get there.

quick recap if you haven't been following: IDK-1 is a 106M parameter indonesian small language model trained from scratch — LLaMA-style architecture, 40K BPE vocab, GQA, RoPE theta 500K, logit soft-capping. pre-trained on Wikipedia ID + CulturaX ID, ~2.64B tokens, Kaggle free tier. SFT is the next step: teach it to follow instructions instead of just completing text.

the format is ChatML. every training example is a user message and an assistant response wrapped in special tokens. the model learns: when it sees the user's question, generate this kind of answer. 4810 pairs total. written, augmented, and verified across several sessions.

round 1: 1390 pairs, val loss 3.0506. the model understood it was supposed to do something different. it didn't understand what. it followed the format loosely but drifted immediately — wrong topics, noise from the pre-training data bleeding through, no real instruction following.

round 2: 3010 pairs, val loss 2.1709. actual improvement. it started following format. factual questions about indonesia came out mostly correct. count instructions didn't work — 'sebutkan 3 hal' would give you 4, 5, 7 items. it knew it was supposed to list things. it didn't know how to stop.

round 3: 3810 pairs (added 500 count-following pairs + 300 short factuals), val loss 2.0808. count-following got better. not fixed. content was mostly right but the count was still wrong. went from 'completely ignoring count' to 'trying and usually failing'.

round 4: 4810 pairs (added 1000 strict count-following pairs — every single one verified to have the exact right number of items), val loss 1.3670. this is the one that worked. 'sebutkan 3' gives exactly 3. every time. not 2, not 4. the model finally learned that the number in the prompt is a hard constraint.

the jump from 2.0808 to 1.3670 is not subtle. four rounds of continual fine-tuning, each starting from the previous best checkpoint with a lower learning rate (2e-5 → 3e-5 → 1e-5 → 5e-6). the data did the work — not hyperparameter magic.

what works well: count-following instructions, short factual Q&A in indonesian, practical tips, basic conversation. what doesn't: open-ended reasoning on complex topics. ask it about IoT technology or artificial intelligence in depth and it'll drift. this is expected — 106M params with noisy pre-training has a ceiling. SFT can reshape the output format, it can't inject knowledge that was never learned during pre-training.

the model and dataset are live. ripkiiiii/IDK-1-Instruct on HuggingFace, with the 4810-pair SFT dataset at ripkiiiii/IDK-1-Instruct-Data. Apache 2.0, free to use.

next on the roadmap: DPO alignment pass (200-500 preference pairs, no reward model needed), the indonesian mini benchmark (200 questions, 5 categories), and eventually a paper. IDK-1 was always meant to be a full cycle — pre-train, SFT, DPO, eval, publish. we're past the halfway point.

← back