A Skill Ranking System for Natural Selection 2

Natural Selection 2 is my favorite game, without compare. I really can’t say enough good things about it, go watch the last big tournament if you want a taste. It combines everything that’s great about the FPS and RTS genres, and requires very high levels of FPS skill, RTS tactics, and team coordination to play well. It’s the only video game I’ve played where a person’s leadership skills can mean the difference between victory and defeat. When someone leads their team to victory in NS2, it doesn’t mean they were the one with the most kills. It’s a very rewarding social experience, with all the highs and lows of a team sport, but you get to shoot aliens.

It is however, a difficult game to learn, and an even more difficult game to play well. Because of this, games in public servers often have unbalanced teams, leading to dissatisfying games for both sides.

To help fix this problem, I designed a player ranking system that was implemented in the most recent patch, Build 267. It’s based on skill systems like ELO but tweaked for the unique properties of NS2.


A useful skill system should be able to predict the outcome of a game using the skills of the players on each team. Rather than trying to figure out which other statistics about a player (kill/death etc) indicate that a player is good, we instead try to infer it the skill levels just from our original objective: predicting the outcome of games.

Designing a skill system like this is different from many other statistics problems, because the world changes in response to the system. It’s not enough to predict the outcome of games, you have to be sure that you can still predict the outcome of games even when people know how your system works. On top of that, you have to ensure that your system incentivizes behaviors that are good for the game.


  • p is the probability of a victory by team 1.
  • G is the outcome of the game, 1 if team 1 wins, 0 if team 1 loses.
  • s_i is the skill of the ith player.
  • t_i is the time spent in the round, an exponential weighting of each second so that being in the round at the beginning counts more than being in the round at the end. Integrals over exponents are easy, so you can implement this as t_i = a^{-\text{entrance time}} - a^{-\text{exit time}}. Set a to something like 2^{2/r} where r is the median round length. This means that playing until the middle of an average round counts more than playing from the middle to the end, no matter how long the round takes.
  • T_i is an indicator variable for the team of the ith player. Its value is 1 if the ithplayer is on team 1, -1 if the ith player is on team 2

The Basic Model

To compute the skill values, the first task is to predict the probability of team 1 winning the round as a function of those skill values.

\displaystyle \log \left(\frac{p}{1-p}\right) = \frac{\sum_i T_i t_i s_i}{\sum_i t_i}

\displaystyle p = \frac{1}{1 + e^{-\sum_i T_i t_i s_i / \sum_i t_i}}

The function \log\left(\frac{p}{1-p}\right) maps the range (0,1) to the range (-\infty,\infty) which is how we can relate a probability to a sum of different factors. (It’s a convenient function for this purpose because it’s deeply related to how conditionally independent components combine into a probability. See Naive Bayes, Logistic Regression for more info)

We’d like to come up with the values for s_i that cause the model to predict the outcome correctly for as many games as possible. I’m skipping a bunch of steps where we maximize the log likelihood of the data. We’ll be optimizing the model using stochastic gradient descent. A pdf for further reading. Basically, whenever the model is wrong, we figure out the direction in which it was wrong, and we change the model to move it in the opposite direction.

\displaystyle s_i\leftarrow s_i + T_i(G-p)\frac{t_i}{\sum t_i}

We predict the outcome of the game based on the players’ skill and how long they each played in the game. After the game, we update each player’s skill by the product of what fraction of the game they played, and the difference between our prediction and what actually happened. 

Properties of the Basic Model

  1. If one team smashes another, but the teams were stacked to make that outcome almost certain, nobody’s skill level changes at the end of the round. Only unexpected victories change skill levels. 
  2. Nothing you do during the round other than winning the game has any effect on your skill level. This means that there’s no incentive to rack up kills at the end of the game rather than finishing it, or to play skulk instead of gorge, or to go offense rather than build. 
  3. The effect on your score is determined by the time you spent playing the round rather than your points, with the beginning of the game weighted much higher than the end. This means that it doesn’t harm you to join a game that is already lost, because you’ll have played for a very small fraction of the weighted time spent in the game.
  4. If the two teams have a different number of players, this is compensated for automatically by normalizing by the total time across both teams rather than just within a team, and the model predicts that the team with more players will win.
  5. Larger games contribute less to each player’s skill level because the presumption is that each player has a smaller effect on the outcome of the game.
  6. If you are a highly skilled player on a low-skill server, the best thing you can do for your skill ranking is to join the underdog team, and lead them to victory.

Accounting for imbalance in races, maps, game sizes.

The basic model assumes that the only factors that contribute to the outcome of the game are the skill of the players. Given the overall win rate of each race, this is clearly not true. To fix the model to account for this, all we have to change is our original formula for p.

\displaystyle \log \left(\frac{p}{1-p}\right) = \frac{\sum_i T_i t_i s_i}{\sum_i t_i} + F(\text{race})

We can determine the value of F for team 1’s race using the historical records for the winrate of that race. We then set F to \log\left(\frac{\text{race win rate}}{1 - \text{race win rate}}\right). This ensures that when teams are even, our prediction matches the historical probability of that race winning.

This needn’t be merely a function of the race however. It could also be a function of the map, the game size, or any other relevant feature that is independent of the players. All that is necessary to set its value is to measure the historical winrate for the team in that set of circumstances (for instance, aliens on ns2_summit with game size > 16), and put it through the inverse logit function as above.

\displaystyle \log \left(\frac{p}{1-p}\right) = \frac{\sum_i T_i t_i s_i}{\sum_i t_i} +F(\text{race}, \text{map},\text{game size}, ...)

Special treatment for commanders.

The basic model assumes that commanders are just like any other player, and that they have the same contribution to the outcome of the game as any other player. This isn’t necessarily a bad assumption, I’ve seen many games where an unusually vocal and motivational player other than the commander was able to call plays and lead a team to victory. 

The much worse assumption however is that the same skill sets apply for being a commander or a player. Players that can assassinate fades and dodge skulks might be useless in the command chair, so it doesn’t make much sense to use the same skill values for both. To fix this, we give each player a separate skill level that indicates their skill at commanding. To distinguish it from the other skill level, we’ll call it \chi. To indicate the time spent commanding i’ll use c the same way (and using the same formula) as for t above.

\displaystyle \log \left(\frac{p}{1-p}\right) = \frac{\sum_i T_i t_i s_i}{\sum_i t_i} +\frac{\sum_i T_i c_i \chi_i}{\sum_i c_i} + F(\text{race}, \text{map},\text{game size},...)

This makes a few questionable assumptions for simplicity. The model will still do useful things even if these assumptions are false, but they do indicate areas where the model won’t perform well.

1. The magnitude of the impact of the commander does not depend on the size of the game.

2. Playing marine commander is equivalent to playing alien commander.

3. A good commander with a bad team vs a bad commander with a good team is expected to be an even match. I suspect this isn’t true because there are few ways that a commander can influence the outcome of the game independently of their team, such that a bad team probably critically handicaps a good commander, and as such it might make sense to use quadratic terms instead. 

Faster convergence

Rather than restricting ourselves to updating the skill levels once on each game, we can optimize them iteratively using the whole history of games, which will cause them to converge much faster as we acquire more data. To do this however, it is necessary to put a prior on the skill levels of each player so that they are guaranteed to converge even when there is little data for a player. To do this, include a Gamma Distribution in the model for each player’s skill.

The gradient of the log likelihood of the gamma distribution is \frac{k-1}{s_i} - \frac{1}{\theta}.This makes the update rule for the player as follows:

\displaystyle s_i\leftarrow s_i + \frac{k-1}{s_i} - \frac{1}{\theta} + \sum_{j \in \text{games}}T_{ij}(G_j-p_j)\frac{t_{ij}}{\sum t_{ij}}

There are two differences between this formula and the update rule above. The first is that rather than just updating the score one game at a time as the games come in, we store the whole history of games the player has played, and iteratively update the player’s skill.

Secondly, on each iteration, we add (k-1)/s to the player’s skill, and subtract 1/\theta from the player’s skill. This expresses our belief that until we know more about a player, their skill probably isn’t too high or too low. The k and \theta parameters control the distribution of skill levels. As k increases, we become more skeptical of very low skill values. As \theta decreases, we become more skeptical of very high skill values.

The mean skill value will be k\theta.

To run this algorithm, we alternate between updating our prediction for the outcome of each game and updating each player’s skill level based on the new predictions for all of the games they’ve played.

Numerical Details

Gradient descent tells you which direction to go, but it doesn’t tell you how far. We have a lot of freedom in choosing how much to update each player’s score after each round to make it satisfying for the player. More precisely, we can add a coefficient \alpha that describes the step size for the updates.

\displaystyle s_i\leftarrow s_i + \alpha T_i(G-p)\frac{t_i}{\sum t_i}

To pick this, it’s useful to get a sense of what skill values actually mean. If the skill values of two players differ by 2 points, then this causes our model to predict that the better player will beat the worse player 90% of the time. (1 / (1 + e^{-2}) = 0.88) Or rather, that a team composed of clones of the better player will beat a team composed of clones of the worse player 90% of the time. So intuitively, how many fair games must a player win consecutively before we’re comfortable predicting that they will beat the people they are playing against 90% of the time?

Suppose the player is playing on 20 player servers, and they are playing fair games. That means the absolute value of the update above will be 0.5/20, or 0.025 each round. With \alpha set to 1, it would take the player 80 consecutive wins on fair games in order to be estimated to be able to beat their old self 90% of the time. This seems excessive! So we should set \alpha to something larger, maybe 8?

This gives us a feel for the difference between the players scores, and how quickly we should update them, but we still don’t have a sense for the absolute magnitude of the player scores. How far should they all be from 0? If the teams have the same number of players, than all that matters is the difference between the scores. We can add any constant, and it cancels out. However, it does effect the model when the teams have different numbers of players. The team with more players will be expected to win.

players 3dThis plot shows the relationship between the average value of the player skill, x, the number of players in the game, n, and the prediction for the outcome of a game that is otherwise fair.

So if the skill values are close to 0, then we predict the team with more players to have no advantage, they win 50% of the time. If the skill values are all around 40, then we predict the team with more players to win over 90% of the time in a 13 player game. This makes it important that we don’t set the starting absolute value of the skill levels to be too high or too low.

To initialize the scores, a reasonable choice is to set them to some multiple of the log of the player’s playtime.

Aside: Why the logit and not erf?

This formulation of a skill system differs from many others in that it uses the logistic distribution’s cdf to model win probabilities rather than the gaussian distribution’s cdf. There are two reasons I chose this.

  1. NS2 allows for frequent upsets late in the game if one team is particularly cunning, and the other is inattentive. This makes a distribution with fatter tails more appropriate.
  2. Using the logit allows us to correct for various factors that cause imbalance in the game without any complicated optimization. Due to the independence assumption, we can just measure winrates empirically in different circumstances, and we don’t have to optimize these coefficients jointly with the player skill.

How do we balance the teams in a real game?

NS2 had a feature where players could vote to “Force Even Teams” but it was generally unpopular. The skill values weren’t very accurate, and the strategy used to assign players had some undesirable properties. It used a simple heuristic to assign the players since the general problem is NP-complete. Because it reassigned all players, they would find themselves playing aliens when they wanted to play marines, or vice versa, and would switch back right after the vote.  To fix this, we designed an algorithm that minimizes the number of swaps that occur to reasonably balance the teams.

  1. Optimize the team assignments of the players who have not yet joined a team. You can do this using whatever method you wish.
  2. Consider each pair of players.  Find the pair where swapping them minimizes the absolute value of the skill difference. Iterate until there is no pair that reduces the absolute value of the skill difference.

While this greedy heuristic of swapping only the best pairs one at a time may be a poor solution for coming up with the optimal balanced teams, it is great for keeping as many players as possible happy with the team assignments. Because the swap chosen is the one that minimizes the objective function the most, we can get quickly to a local optima by swapping very few players.


One comment

  1. Gibberish

    You have made a (good) assumption that there are at least two skills that a player may have (FPS vs Commander).

    The problem is I don’t think two skills are sufficient to model player ability.

    For new players there is a clear bias towards marines (most new players can shoot better than they can wall jump). Even for with skilled players there can be a clear bias (although for some players it is for aliens).

    Additionally there are combinations, for example I played on a server where 2 players were very vocal about going “lerk pack” (flying round together) as soon at they had res.

    By itself it may not have been too difficult to counter, but it acted as a rallying cry for the rest of the team to follow them in, hence with both those players on the same team there was a significant chance of a win.

    Your point about only using the winning/losing to determine the team balance was a good one.
    Perhaps there is a simple solution:

    Each time a new round start evenly divide up the winning team between aliens and marines.
    I realize that that will cause contention (players that want to play a particular side), therefore it might be worth having a 30 second team pick period at the start of each round where everyone can specify:
    Marines, Aliens or Random.

    Critically the vote is not disclosed to other players until the 30 second period is up.
    If the winning team is “jumbled” okay everyone can have their ideal pick.

    Otherwise the server will need to force assign some players at random.

    I realize even this proposal can be “gamed”, but if you allow players to pick there side of choice its near impossible to prevent a small group (2-3) of very highly skilled players from stacking a server.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: