The Problem No One Is Solving
Open Google and search for a restaurant in Pune. You might type:
"best misal pav near me"
But your friend might search:
"मस्त मिसळ पाव कुठे मिळेल Pune मध्ये?"
And your cousin might type:
"pune me accha misal kahan milega"
Three queries. Same intent. Three completely different languages (English, Marathi, Hinglish). And most search systems fail at two of these three.
This is the reality of search in India. 500 million+ people code-switch between their native language and English in the same sentence. They don't think about it — it's just how they communicate.
Yet almost every search system, every embedding model, every NLP tool assumes text is in one language. This is broken.
What I Built
I trained and open-sourced two models to fix this:
Marathlish MiniLM
A semantic search model for Marathi-English code-mixed text. Fine-tuned on curated pairs of mixed-language queries.
Hinglish MiniLM
A semantic search model for Hindi-English (Hinglish) code-mixed text. Handles both Devanagari and Romanized Hindi.
Both are based on all-MiniLM-L6-v2 — lightweight enough to run on a CPU, strong enough for production search.
Why MiniLM?
I specifically chose MiniLM over larger models because:
- Speed: 384-dimensional embeddings encode in ~5ms per query
- Size: 80MB model file — deploys anywhere
- Cost: Runs on CPU — no GPU bill
- Quality: MiniLM-L6-v2 is already one of the best general-purpose embedding models
The goal was never to build the biggest model. It was to build the most useful one.
The Training Process
Step 1: Data Collection
This was the hardest part. There's no existing dataset of Marathi-English code-mixed text pairs with semantic similarity labels. I had to create one.
Sources:
- Social media posts (Twitter, Reddit)
- Forum discussions in regional subreddits
- Customer support conversations
- Manually written query pairs
For each source, I created positive pairs (semantically similar) and hard negatives (lexically similar but semantically different).
Step 2: Data Cleaning
Code-mixed text is messy. The same word can be spelled five ways:
- "अच्छा" / "accha" / "acha" / "achha" / "achaa"
I built normalization pipelines but was careful not to over-normalize — preserving the natural variation is part of what makes the model robust.
Step 3: Fine-Tuning
Used Sentence Transformers with MultipleNegativesRankingLoss:
from sentence_transformers import SentenceTransformer, losses
model = SentenceTransformer("all-MiniLM-L6-v2")
train_loss = losses.MultipleNegativesRankingLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=4,
warmup_steps=100,
)The key insight: you don't need millions of examples. With high-quality pairs and hard negatives, even a few thousand examples produce strong results.
Step 4: Evaluation
I evaluated on a held-out set of real user queries:
| Model | Accuracy (Top-5) | Avg Latency |
|---|---|---|
| all-MiniLM-L6-v2 (base) | 62% | 4ms |
| Marathlish MiniLM | 84% | 5ms |
| mBERT | 71% | 28ms |
The fine-tuned model significantly outperforms both the base model and multilingual BERT, while being 5x faster than mBERT.
How To Use Them
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("anuragwagh0/marathlish-minilm")
queries = [
"माझा phone कुठे आहे?",
"where is my phone",
"mera phone kahan hai"
]
embeddings = model.encode(queries)
# All three queries will have high cosine similarity ✓Real-World Applications
These models unlock use cases that were previously impractical:
- Regional e-commerce search: Let users search in their natural language
- Customer support routing: Understand mixed-language support tickets
- Content recommendation: Match content across language boundaries
- RAG systems: Combined with Sarvam-M, build truly multilingual retrieval-augmented generation
I wrote a full codelab showing how to combine these embedding models with Sarvam-M for a complete multilingual RAG pipeline.
What's Next
I'm working on:
- Expanding to more Indic languages (Tamil-English, Bengali-English)
- Fine-tuning for specific domains (e-commerce, healthcare)
- Building a benchmark for code-mixed semantic similarity
The models are open source. Use them, break them, improve them.
Both models are available on Hugging Face: Marathlish MiniLM | Hinglish MiniLM