A Guide to Smarter Ranking Systems

Feb 16, 2026

Which product would you trust more? One with a 5-star rating but only one review; or one with a 4.8-star rating, but 1000+ reviews? Most would prefer the latter. Yet, many applications still rank the former first.

So how would you rank a list of items so that the most relevant appear first?

Common Mistakes

In 2009, Evan Miller published a great article about it called How Not To Sort By Average Rating describing the pitfalls of what may seem like an easy problem at first. He gave two examples of common mistakes:

Ranking by average (positive ratings divided by total)

5 ★★★★★ (1) | 4.8 ★★★★★ (1024)

Ranking by average alone is misleading because items with only a few perfect ratings can outrank popular items with thousands of ratings and slightly lower averages. Small sample sizes make those high averages statistically unreliable and is prone to bot manipulation.

Ranking by difference (positive ratings minus negative ratings)

1024 ▲ 800 ▼ | 120 ▲ 0 ▼

Ranking by difference score is misleading because items with many ratings tend to rank higher simply due to volume, even if their like-to-dislike ratio is mediocre. This favors click-baity and controversial content.

Almost two decades later, these unfortunately remain a common occurrence.

Let’s revisit the problem and explore better solutions with a fresh 2026 perspective.

Adjusting for Uncertainty: Wilson Score

This is the solution originally proposed by Miller.

The lower bound of Wilson score confidence interval introduces the idea of scoring confidence. Items with fewer votes get pulled down because we’re less certain about them. Items with many votes stabilize near their average.

In other words, the more samples n we have, the more confident we get that the average is trustworthy, as shown in the following graph comparing different sample sizes:

The bigger the sample size, the closer we get to certainty. Also note that the more the ratings agree together (either all ratings are negative, or all ratings are positive), the higher the confidence gets.

The lower bound of Wilson score confidence interval is given by:

\frac{1}{1 + \frac{z^2}{n}} \left( \hat{p} + \frac{z^2}{2n} - \frac{z}{2n}\sqrt{4n\hat{p}(1-\hat{p}) + z^2} \right)

where:

$\hat{p}$ is the observed proportion of positive ratings ( $p/n$ )
$n$ is the total number of ratings
$z$ is the confidence z-score (1.96 for 95% confidence)

In plain terms, this formula says “we’re 95% confident the true rating is at least this high.”

When to use it: Binary feedback, e.g. Reddit-style upvote/downvote systems.

Bayesian Averaging (Star Ratings)

The Wilson score works great for binary ratings, but what about 5-star ratings?

Bayesian averaging achieves a similar effect by taking prior knowledge into account. The idea is that we know what the typical rating should be on average by looking at our entire dataset of items; we use this global average to mitigate the cold-start problem of not having enough ratings to accurately assess the true value of our item. We artificially pull our item average toward that prior mean until we amass enough ratings.

The Bayesian average is given by:

\frac{w \cdot \mu + \sum_{i=1}^n r_i}{w + n}

where:

$w$ is the prior weight (how many artificial votes $\mu$ represents)
$\mu$ is the prior mean (average rating across the site)
$n$ is the number of ratings
$r_i$ is the $i$ th rating

The following graph compares the Bayesian average to a simple average as the sample size increases (both use the same ratings data):

The simple average wildly fluctuates when there are few ratings before it stabilizes around a 3.0 rating. The Bayesian average smooths out the spikes and gravitates toward the prior mean before eventually blending with the simple average.

Choosing $w$ is more art than science. Too low, and newcomers dominate. Too high, and popular items never budge. Testing with real data is key.

When to use it: Star rating systems (products, restaurants, movies), ordinal scales, scenarios where you want a tuneable balance between new and established items.

IMDB’s Top 250 uses a variant of Bayesian averaging, as explained in their FAQ.

Favoring Freshness with Time Decay

Both previous methods treat all ratings equally, regardless of when they were submitted. But what if quality changes over time? A restaurant’s new chef, a software update, a product redesign… Old reviews become less relevant as time passes.

Time decay addresses this by making recent ratings count more than old ones. A common approach is to use exponential decay, denoted as:

w_t = e^{-\lambda t}

where:

$t$ is the age of the rating
$\lambda$ controls how quickly ratings lose influence
$w_t$ is the resulting influence factor of the rating at time $t$ , between 1.0 and 0.0

Choosing $\lambda$ depends on your domain’s volatility. I find it more intuitive to reason in terms of half-life, which is the time required for the initial rating to decay to half its value. Given a half-life $t_{1/2}$ , we can calculate the corresponding $\lambda$ :

\lambda = \frac{\ln 2}{t_{1/2}}

A higher $\lambda$ causes the quantity to decrease much faster:

News and social content might use a half-life of a few days, while restaurants might use a longer half-life on the order of weeks.

Time decay plays nicely with others. You can use time-weighted ratings as input to Bayesian averaging, or compute Wilson scores on recent votes only.

When to use it: Content feeds, news aggregators, business review websites - any domain where quality or relevance changes over time.

Explore vs Exploit

Previous methods rely on observed ratings. But what about items with no data? They might be hidden gems, just as they might be junk. How can we balance showing proven champions with exploring unknowns?

This problem is known as the exploration-exploitation dilemma.

One strategy is to use a slightly modified version of the Upper Confidence Bound bandit algorithm, which consists in adding an “exploration bonus” to little-known items:

\bar{r} + c \cdot \sqrt{\frac{\ln{N}}{n}}

where:

$\bar{r}$ is the item’s average rating
$N$ is the total number of ratings across all items
$n$ is the number of ratings for this item
$c$ is a tuneable exploration parameter

The exploration bonus favors lesser-known items, leading to serendipity. The fewer ratings an item has compared to the total, the more it is boosted, as shown in the graph below. Note that the graph uses logarithmic scaling, so smaller numerical values appear more spread out on the graph.

Upper Confidence Bound - Exploration Bonus

It is worth mentioning that there exists a similar but probabilistic strategy called Thompson Sampling: for each item, we generate a random number using its average rating and uncertainty $1 / \sqrt{n}$ as standard deviation:

\mathcal{N}(\bar{r}, \frac{1}{\sqrt{n}})

Items with higher sampled ratings rank better.

When to use it: News aggregators, content recommendation systems, A/B testing platforms, anywhere you want to systematically explore uncertain items while still favoring good ones.

YouTube runs a similar approach to balance showing popular videos with discovering potentially viral content.

Engagement Metrics Beyond Ratings

Modern ranking systems increasingly incorporate behavioral signals to complement spontaneous feedback:

Completion rate: Did users finish what they started? For videos, courses, articles, and games, completion is often more meaningful than ratings.
Time spent: Engagement duration suggests genuine value. YouTube’s recommendation algorithm heavily weights watch time.
Conversion rate: For e-commerce, did browsers become buyers?
Repeat behavior: Return visits indicate lasting value, not just first impressions.

These can be combined into composite scores, for example using a simple weighted approach $\sum_{i=1}^n w_i \cdot metric_i$ , where weights should be refined depending on the domain.

When to use it: Platforms with rich behavioral data or scenarios where explicit ratings are sparse or unreliable.

Note that unlike explicit ratings, behavioral signals are collected passively. Users may not realize their browsing patterns influence what they see.

Personalization

Beauty is in the eye of the beholder, and ratings are in the eye of the rater.

All previous strategies produce a single global ranking, but different users have different preferences.

Personalization is a great way to cater to different tastes. Collaborative filtering predicts “what would this particular user rate this item?” based on similar users’ behavior.

The idea is to find similar users based on rating history, and weigh items by how those similar users rated them.

How to tell if two users are similar? Perhaps the most straightforward similarity measure is the Pearson correlation coefficient, which gives us a value between -1 and 1:

1 means perfect agreement
0: no relationship (completely random)
-1: perfect opposite

Let’s build up to the coefficient formula using an example. Imagine two movie reviewers, Alice and Bob. The Pearson correlation asks: “Do they tend to agree on which movies are good or bad?”

Step 1: Find Each User’s Baseline

First, we figure out each person’s average rating $\bar{r}_A$ and $\bar{r}_B$ , only considering movies both have rated. Alice might average 3.5 stars, Bob might average 4.0 stars.

Step 2: Look at Deviations From Baselines

For each movie, we check:

How much higher or lower is Alice’s rating compared to her average?
How much higher or lower is Bob’s rating compared to his average?

Step 3: Check if They “Move” Together

For each movie, multiply their deviations:

If both loved it (both above average): positive × positive = positive
If both hated it (both below average): negative × negative = positive
If they disagree (one above, one below): positive × negative = negative

Add these up: that’s the covariance $cov(A,B)$ . A positive sum means they generally agree.

\sum_{i=1}^n(r_{A,i}-\bar{r}_A) \cdot (r_{B,i}-\bar{r}_B)

Step 4: Normalization

The magnitude of the covariance is not that interesting in and of itself, because it depends on how much people’s ratings vary.

We fix this by scaling by the standard deviations $\sigma_A \cdot \sigma_B$ (which measure how spread out each person’s ratings are). This scales everything to the -1 to +1 range.

\sigma_A = \sqrt{\sum_{i=1}^n(r_A - \bar{r}_A)^2}

All in all, the coefficient is given by:

\frac{cov(A, B)}{\sigma_A \cdot \sigma_B}

Now that we have established how closely Alice’s tastes align with Bob’s, we can predict whether Bob will enjoy a movie he hasn’t seen by combining Alice’s rating of that movie with the degree to which Bob typically agrees with her preferences (and vice versa).

This topic belongs to the larger class of recommender systems, which I’ll cover in a future article.

When to use it: Platforms with sufficient user data (typically 100+ ratings per user), scenarios where personalization matters (streaming, e-commerce). More generally: when one-size-fits-all rankings perform poorly.

While powerful, the main drawback of this approach is the cold-start problem, i.e. new users with no history. Although in this case, we can always fall back to a global ranking strategy.

Closing Thoughts

Evan Miller’s Wilson score article demonstrated that uncertainty in data mattered just as much as the data itself.

This insight can be generalized: good ranking must reflect the multi-dimensionality of the underlying data. It must take into account uncertainty, freshness, serendipity, implicit feedback, personalization.

In practice, robust ranking systems combine several of the strategies we presented. There’s no single “best” ranking algorithm. The right choice depends on the nature of the data and the domain. Most importantly, good ranking systems are built and refined empirically.