Long Horizon · Substrate

TamilLM

The Tamil substrate. A bilingual foundation model on Mamba 2/3 - built so that everything else Tamil at Murai Labs has a real model to stand on.

Why this exists

Tamil has 80M+ speakers worldwide and a literary tradition going back 2,000 years. It also has no foundation model of its own.

Every "Tamil AI" today is an English-thinking model in Tamil clothing - bolted on, lossy, and culturally tone-deaf. TamilLM is the substrate I'm building so that doesn't have to keep being true.

Civilizational memory deserves better infrastructure than a translation API.

Approach

Why Mamba 2/3, not Transformers.

Linear inference scaling and constant-memory decoding fit Tamil's agglutinative morphology and long classical document lengths. Transformer scaling doesn't.

Why native tokenization matters.

A 32K tokenizer designed for Tamil orthography - classical, devotional, literary, formal, colloquial, Tanglish, English. Existing English-biased BPE breaks Tamil word formation.

III

Why edge-first.

A Tamil model behind an API isn't cultural infrastructure. Quantized for Jetson Orin and Thor, it can live in homes, schools, and institutions - locally, offline, with no per-token cost.

NVIDIA stack

NeMo Framework for training
TensorRT-LLM for inference
Triton Inference Server
RAPIDS for corpus preprocessing
DGX Spark and RTX-class pre-training
Jetson Orin and Thor for edge deployment
NIM microservices for packaged endpoints

Status

167M curated Tamil tokens currently.
Models trained at 100M and 300M scale on RTX 5090.
Mamba 2/3 architecture comparison work complete.
Scaling to 1B with target 200B+ token corpus.

Related work

UYIR - evolutionary LoRA adapter research, foundation of the MARMAM direction.