The Problem No One Is Solving

Open Google and search for a restaurant in Pune. You might type:

"best misal pav near me"

But your friend might search:

"मस्त मिसळ पाव कुठे मिळेल Pune मध्ये?"

And your cousin might type:

"pune me accha misal kahan milega"

Three queries. Same intent. Three completely different languages (English, Marathi, Hinglish). And most search systems fail at two of these three.

This is the reality of search in India. 500 million+ people code-switch between their native language and English in the same sentence. They don't think about it — it's just how they communicate.

Yet almost every search system, every embedding model, every NLP tool assumes text is in one language. This is broken.

What I Built

I trained and open-sourced two models to fix this:

Marathlish MiniLM

A semantic search model for Marathi-English code-mixed text. Fine-tuned on curated pairs of mixed-language queries.

Hinglish MiniLM

A semantic search model for Hindi-English (Hinglish) code-mixed text. Handles both Devanagari and Romanized Hindi.

Both are based on all-MiniLM-L6-v2 — lightweight enough to run on a CPU, strong enough for production search.

Why MiniLM?

I specifically chose MiniLM over larger models because:

  1. Speed: 384-dimensional embeddings encode in ~5ms per query
  2. Size: 80MB model file — deploys anywhere
  3. Cost: Runs on CPU — no GPU bill
  4. Quality: MiniLM-L6-v2 is already one of the best general-purpose embedding models

The goal was never to build the biggest model. It was to build the most useful one.

The Training Process

Step 1: Data Collection

This was the hardest part. There's no existing dataset of Marathi-English code-mixed text pairs with semantic similarity labels. I had to create one.

Sources:

  • Social media posts (Twitter, Reddit)
  • Forum discussions in regional subreddits
  • Customer support conversations
  • Manually written query pairs

For each source, I created positive pairs (semantically similar) and hard negatives (lexically similar but semantically different).

Step 2: Data Cleaning

Code-mixed text is messy. The same word can be spelled five ways:

  • "अच्छा" / "accha" / "acha" / "achha" / "achaa"

I built normalization pipelines but was careful not to over-normalize — preserving the natural variation is part of what makes the model robust.

Step 3: Fine-Tuning

Used Sentence Transformers with MultipleNegativesRankingLoss:

from sentence_transformers import SentenceTransformer, losses

model = SentenceTransformer("all-MiniLM-L6-v2")
train_loss = losses.MultipleNegativesRankingLoss(model)
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=4,
    warmup_steps=100,
)

The key insight: you don't need millions of examples. With high-quality pairs and hard negatives, even a few thousand examples produce strong results.

Step 4: Evaluation

I evaluated on a held-out set of real user queries:

Model Accuracy (Top-5) Avg Latency
all-MiniLM-L6-v2 (base) 62% 4ms
Marathlish MiniLM 84% 5ms
mBERT 71% 28ms

The fine-tuned model significantly outperforms both the base model and multilingual BERT, while being 5x faster than mBERT.

How To Use Them

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("anuragwagh0/marathlish-minilm")

queries = [
    "माझा phone कुठे आहे?",
    "where is my phone",
    "mera phone kahan hai"
]

embeddings = model.encode(queries)
# All three queries will have high cosine similarity ✓

Real-World Applications

These models unlock use cases that were previously impractical:

  • Regional e-commerce search: Let users search in their natural language
  • Customer support routing: Understand mixed-language support tickets
  • Content recommendation: Match content across language boundaries
  • RAG systems: Combined with Sarvam-M, build truly multilingual retrieval-augmented generation

I wrote a full codelab showing how to combine these embedding models with Sarvam-M for a complete multilingual RAG pipeline.

What's Next

I'm working on:

  • Expanding to more Indic languages (Tamil-English, Bengali-English)
  • Fine-tuning for specific domains (e-commerce, healthcare)
  • Building a benchmark for code-mixed semantic similarity

The models are open source. Use them, break them, improve them.


Both models are available on Hugging Face: Marathlish MiniLM | Hinglish MiniLM