Recommendation System Patterns That Work at Scale (And What Startups Can Steal)

I've spent a significant part of my career building recommendation systems — recipe discovery, product search, feed ranking, autocomplete — at companies serving millions of users. I've filed multiple patents covering ML approaches in this space and seen firsthand what works in production versus what only works in research papers.

I'm not going to describe any specific company's proprietary systems. What I will share are the general architectural patterns that consistently drive results in production recommendation systems. These patterns aren't secrets — they're established ML techniques published in academic literature, used across the industry. But they're dramatically underused by startups that could benefit enormously from them.

Here are the five patterns that matter most, distilled for teams with smaller datasets, smaller budgets, and smaller engineering teams.

Pattern 1: Embeddings are your foundation

The single most impactful architectural decision in any recommendation system is whether to represent your items, users, and queries as embeddings — dense vector representations that capture semantic meaning.

The approach works like this: you train a model to produce vectors for your items (products, recipes, articles, whatever you're recommending) such that similar items are close together in vector space. Then recommendation becomes a nearest-neighbor search: given a user's history or query, find the items closest to what they're looking for.

This sounds straightforward, and the basic version is. But the difference between a mediocre embedding-based system and a great one lies in three decisions:

What "similar" means. Do you define similarity by content (items that look alike), behavior (items that users interact with together), or both? Pure content similarity gives you recommendations that feel redundant — "you bought milk, here's more milk." Pure behavioral similarity gives you recommendations that feel random — "people who bought milk also bought batteries." The best systems combine both signals, typically through multi-task learning where the embedding model is trained on both content features and behavioral signals simultaneously.

How you train. The training data isn't your item catalog — it's your users' behavior. What did they search for, what did they click, what did they buy, what did they skip? The implicit signals in user behavior are more valuable than any explicit item metadata. The model learns that items A and B are similar not because their descriptions match, but because users who engaged with A also engaged with B.

How you retrieve. Once you have embeddings, you need fast nearest-neighbor search. At startup scale, exact search works fine — you can brute-force search through 100K items in milliseconds. At larger scale, you need approximate nearest-neighbor (ANN) algorithms that trade a small amount of accuracy for dramatic speed improvements. Libraries like FAISS, ScaNN, or vector databases like Pinecone and Weaviate handle this well. The approach is well-documented in academic literature — the original papers on locality-sensitive hashing and product quantization date back over a decade.

What startups can steal: You don't need massive datasets to use embeddings. Start with a pre-trained embedding model (sentence-transformers is a good starting point for text-based items), fine-tune it on whatever behavioral data you have (even a few thousand interactions), and build a simple nearest-neighbor retrieval system. The first version of this takes days to build, not months, and it will outperform any rule-based recommendation system.

Pattern 2: Two-stage retrieval and ranking

A recommendation request needs to evaluate potentially millions of items and return the top 10-20. Doing this in a single pass is computationally infeasible for any model complex enough to be accurate. The standard solution — described extensively in research from both academia and industry papers published by teams at Google, YouTube, Pinterest, and others — is a two-stage approach:

Stage 1 — Retrieval: A fast, lightweight model narrows millions of candidates to hundreds. This is where embeddings shine. The retrieval model is optimized for recall — it should surface every potentially relevant item, even at the cost of including some irrelevant ones. Speed is critical here; the retrieval step needs to complete in single-digit milliseconds.

Stage 2 — Ranking: A more sophisticated model re-ranks the hundreds of candidates from retrieval down to the final top results. This model can afford to be slower and more complex because it's only scoring hundreds of items, not millions. The ranking model is optimized for precision — it needs to get the top results exactly right.

This architecture, originally popularized in the YouTube recommendations paper (Covington et al., 2016), lets you use simple, fast models for retrieval and complex, accurate models for ranking. Each stage is optimized for its specific job rather than trying to do everything at once.

What startups can steal: Even at small scale, the two-stage pattern improves results. Your retrieval stage can be an embedding-based nearest-neighbor search. Your ranking stage can be a simple model that scores based on features the retrieval stage doesn't consider — recency, diversity, user-specific preferences. The ranking model can be as simple as a gradient-boosted tree with a handful of features. The architectural pattern matters more than the model complexity.

Pattern 3: Contextual bandits for exploration

The cold-start problem — how do you recommend new items that have no behavioral data? — is one of the hardest challenges in recommendation systems. If you only recommend items with strong historical data, you create a feedback loop where popular items get more popular and new items never get discovered. This is the "exploitation-exploration tradeoff," a fundamental problem in decision theory studied for decades.

Contextual bandits provide an elegant solution. Instead of always showing the items the model thinks are best (exploitation), you occasionally show items the model is uncertain about (exploration). The system tracks whether users engage with these exploration items, and uses that data to improve its understanding of the item.

The key insight — well-established in the bandit literature (see Li et al., 2010 on LinUCB, and Agrawal & Goyal, 2013 on Thompson Sampling) — is that exploration should be context-dependent. You don't explore randomly — you explore items that the model believes could be relevant given the user's context, but doesn't have enough data to be confident. This is "optimism under uncertainty": the system gives the benefit of the doubt to items it hasn't learned about yet.

What startups can steal: You don't need a sophisticated bandit algorithm to get started. Begin with epsilon-greedy: 90% of the time, show the model's best recommendations. 10% of the time, show items that are plausibly relevant but underexposed. Track the results. Adjust the epsilon over time. This simple approach — which you can implement in an afternoon — solves the majority of cold-start problems and takes you surprisingly far before you need anything more complex.

Pattern 4: Personalization is a spectrum, not a switch

"Personalized recommendations" doesn't mean "every user sees a completely different result." In practice, personalization exists on a spectrum, and the most effective systems I've worked on use all levels of that spectrum:

Population-level: The same recommendations for everyone, based on aggregate popularity. This is your baseline and it's surprisingly effective. Popular items are popular for a reason. Research consistently shows that simple popularity-based recommendations outperform poorly-tuned ML models.

Segment-level: Different recommendations for different user segments. New users see popular items. Returning users see items related to their history. High-value users see premium items. This requires no ML — just business logic and basic analytics.

Individual-level: Recommendations tailored to each user's specific history, preferences, and context. This is where embeddings and ML models earn their keep — but only if you have enough behavioral data per user to make individual predictions meaningful.

The mistake startups make is jumping to individual-level personalization before they've exhausted the value of population-level and segment-level approaches. A well-tuned popularity-based system with segment-level adjustments often outperforms a poorly-tuned ML personalization system. And it costs nothing to maintain.

What startups can steal: Start with popularity (what's trending globally), add segmentation (new vs. returning users, high-intent vs. browsing), then add ML personalization. Measure the incremental value at each step. If segment-level personalization gets you 80% of the way there, the marginal value of individual-level ML personalization might not justify the engineering cost at your current scale.

Pattern 5: Experimentation infrastructure is non-negotiable

You can't improve what you can't measure, and in recommendation systems, intuition is unreliable. I've seen changes that "obviously" improve recommendations decrease engagement. I've seen changes that seem trivial drive significant business impact. The only way to know is to test.

Every recommendation system needs the ability to A/B test changes and measure their impact on the metrics that matter. Not just click-through rate — but downstream metrics like conversion, retention, and revenue per user. Click-through rate is a vanity metric for recommendations; it tells you people clicked, not that they found value.

The academic literature on online experimentation is extensive (see Kohavi et al., "Trustworthy Online Controlled Experiments"), and the principles apply regardless of scale.

What startups can steal: You don't need a sophisticated experimentation platform. A feature flag system that lets you show different recommendation logic to different user segments, combined with your existing analytics, is enough to start. The key is the discipline to test changes rather than shipping them based on intuition.

Even at 1,000 daily active users, you can run meaningful A/B tests if your primary metric has enough variance. Run a power analysis (there are free online calculators) to determine how long your test needs to run. The math often works out to smaller sample sizes than people assume. You won't detect a 1% improvement, but you will detect a 10% improvement — and at the startup stage, the changes that matter are the big ones anyway.

Start with what you have

These five patterns — embeddings, two-stage retrieval, exploration, progressive personalization, and measurement — represent the architectural foundation of production recommendation systems across the industry. They're described in published papers, taught in ML courses, and used by companies of every size.

A startup with 10,000 users and 5,000 items can build a production recommendation system using these patterns in two to three weeks. It won't be as sophisticated as what you'd find at a company with hundreds of ML engineers — but it will be architecturally sound, and it will improve as your data and resources grow.

The companies that build great recommendation systems aren't the ones with the most data or the biggest ML teams. They're the ones that get the architecture right from the start and iterate relentlessly based on measurements. These patterns give you that foundation.

Want help with your AI stack?

If this post matches problems you're seeing, we can map the fastest path from architecture decisions to production outcomes.

Talk to Manmeet

Architecture DecisionsAI EngineeringProduction ML

Manmeet Singh

Founder & CEO, AIshar Labs · Ex-Apple, Ex-Instacart · 15 AI Patents

Built ML systems at Apple (Search: Maps, Safari, Spotlight) and Instacart (Search, Recommendations, Ranking). Writes about production AI tradeoffs and system design.

Follow on LinkedIn →

Recommendation System Patterns That Work at Scale (And What Startups Can Steal)

Pattern 1: Embeddings are your foundation

Pattern 2: Two-stage retrieval and ranking

Pattern 3: Contextual bandits for exploration

Pattern 4: Personalization is a spectrum, not a switch

Pattern 5: Experimentation infrastructure is non-negotiable

Start with what you have

Want help with your AI stack?

More from AIshar Labs

Fine-Tuning vs. Prompt Engineering vs. RAG: A Decision Framework from Production

How We Re-Architected a Fintech Startup's AI Infrastructure from $100K to $7K/Year

Five Things I Learned at FAANG About Search Relevance That Apply to Every AI Product