The goal is to come up with a smooth expression that is proportional to the probability that a player will have a good game on a server. For a good game, all conditions should be simultaneously true, so we express this as a product. Each component is defined below.
This function determines the viability of a game from the number of players. Games start to become viable as they pass 12 players.
We don’t just want to put the player on the biggest game available though, we also want to maintain a healthy server pool with many viable games. To do this we also want to help seed servers by putting players on the server that will increase the viability of as many player slots as possible as much as possible. To do this we give servers a bonus based on the derivative of viability.
We only care about increasing the viability of the server if there are open slots, so in the combined function we scale the effect of the derivative of viability by the number of open player slots. Seeding servers is also only a secondary goal. The primary goal must be giving the player a good game, or players will just ignore the ranking or “play now” button, so seeding is given a lower weight in the combined function.
The combined function ensures that servers with almost enough players for a good game get priority over servers that are already healthy, but that otherwise players will end up on a server with a healthy number of players.
The player’s latency to the server is as important a factor as whether the server has enough players. Once it crosses 100 milliseconds, it is a key factor.
All other things being equal, we’d like to put the player on a server with similarly skilled players. However, when a player joins a server, they will change the mean score of the server, so the amount we care about the server’s average skill level should increase as the number of players on the server increases. We use a statistic inspired by the z score. (The mean skill is available to us, but the variance is not, so this assumes a default.) Rather than doing something expensive with erf as would be customary to convert a z score into something like a probability, we use a simple exponential. Aside from being a cheaper function, it prevents the ranking from prioritizing skill over ping even when the skill difference is tremendous.
Servers in NS2 are graded on their ability to keep up with the load placed on them. According to matso, you don’t want to play on a server with score less than -10, and everything above 10 is about equivalent.
We give special priority to favorites, and to servers that have open slots. For “play now,” joinable is just a filter. For server ranking, it’s somewhat useful to see servers you expect to see in the list even if you can’t join them, just so you know they still exist.
It is however, a difficult game to learn, and an even more difficult game to play well. Because of this, games in public servers often have unbalanced teams, leading to dissatisfying games for both sides.
To help fix this problem, I designed a player ranking system that was implemented in the most recent patch, Build 267. It’s based on skill systems like ELO but tweaked for the unique properties of NS2.
A useful skill system should be able to predict the outcome of a game using the skills of the players on each team. Rather than trying to figure out which other statistics about a player (kill/death etc) indicate that a player is good, we instead try to infer it the skill levels just from our original objective: predicting the outcome of games.
Designing a skill system like this is different from many other statistics problems, because the world changes in response to the system. It’s not enough to predict the outcome of games, you have to be sure that you can still predict the outcome of games even when people know how your system works. On top of that, you have to ensure that your system incentivizes behaviors that are good for the game.
To compute the skill values, the first task is to predict the probability of team 1 winning the round as a function of those skill values.
The function maps the range to the range which is how we can relate a probability to a sum of different factors. (It’s a convenient function for this purpose because it’s deeply related to how conditionally independent components combine into a probability. See Naive Bayes, Logistic Regression for more info)
We’d like to come up with the values for that cause the model to predict the outcome correctly for as many games as possible. I’m skipping a bunch of steps where we maximize the log likelihood of the data. We’ll be optimizing the model using stochastic gradient descent. A pdf for further reading. Basically, whenever the model is wrong, we figure out the direction in which it was wrong, and we change the model to move it in the opposite direction.
We predict the outcome of the game based on the players’ skill and how long they each played in the game. After the game, we update each player’s skill by the product of what fraction of the game they played, and the difference between our prediction and what actually happened.
The basic model assumes that the only factors that contribute to the outcome of the game are the skill of the players. Given the overall win rate of each race, this is clearly not true. To fix the model to account for this, all we have to change is our original formula for p.
We can determine the value of for team 1’s race using the historical records for the winrate of that race. We then set to . This ensures that when teams are even, our prediction matches the historical probability of that race winning.
This needn’t be merely a function of the race however. It could also be a function of the map, the game size, or any other relevant feature that is independent of the players. All that is necessary to set its value is to measure the historical winrate for the team in that set of circumstances (for instance, aliens on ns2_summit with game size > 16), and put it through the inverse logit function as above.
Gradient descent tells you which direction to go, but it doesn’t tell you how far. We have a lot of freedom in choosing how much to update each player’s score after each round to make it satisfying for the player. In addition, the optimization algorithm we are using is a special case of “Stochastic Gradient Descent” which has attracted a flurry of research interest in the last few years due to its popularity in Deep Learning. Thanks to this, there’s a lot of research on how to get it to converge quickly. Our case is special in several ways. Our problem is sparse, only a small fraction of players will be updated on each game. The size of the gradients is bounded and stable. Our underlying parameters are non-stationary, but change at a relatively low rate.
To accommodate these properties, we use a blend of an Adagrad based learning rate (to achieve fast convergence) and a constant learning rate (to accommodate non-stationarity.) Despite the intimidating paper, Adagrad has a very simple formulation which has added to its popularity. Roughly speaking, the update to a player’s skill will be inversely proportional to the square root of the number of games they have played plus some constant.
In these formulas, , , and are tunable parameters. and determine the relative size of the Adagrad and constant factors. determines how large the update will be for the player’s first game. The algorithm is simple enough to implement in a spreadsheet that simulates the behavior, so you can use this to choose these parameters.
The above system uses one value for each player’s skill. When players are better at one team than another, this leads to a number of issues:
We could fix these by just separating skill into two separate values, but we’d also like our solution to have the following properties.
The proposal that solves the majority of these issues, rather than learning two separate skills, is to learn an average and a difference. In addition to the common skill value used above, we’ll learn an offset that turns the average into their skill for each team when added or subtracted.
The learning rate of the skill value is the sum of two components, the AdaGrad learning rate, and the constant learning rate. To learn the correct value for this offset, while ensuring the system converges at the same rate, we will not apply the AdaGrad learning rate to the offset, only the constant learning rate. This will cause a player’s initial games to affect both skill values equally, and later games to only affect the skill for the team they played on.
The skill values this learns without the offset are backwards compatible, so balancing algorithms will not need to be modified. Despite the fact that each player will still only have one skill value for balancing, there is still a benefit to doing this. A player’s skill will no longer randomly walk based on their team assignment. It will always be the average of their marine and alien skill, rather than being effectively a weighted average of their skill for the teams they’ve played on recently. Additionally, by collecting this data, and returning it from hive, we can open it up to modders to balance teams more aggressively if they desire.
To compute this, we need to do the following. Modify the skill value used in the game prediction to be:
The gradient computation is identical to the skill gradient, but we do not use the sign of the team they played on in the computation. Winning on marines, or losing on aliens contribute the same gradient.
This formulation of a skill system differs from many others in that it uses the logistic distribution’s cdf to model win probabilities rather than the gaussian distribution’s cdf. There are two reasons I chose this.
Rather than restricting ourselves to updating the skill levels once on each game, we can optimize them iteratively using the whole history of games, which will cause them to converge much faster as we acquire more data. To do this however, it is necessary to put a prior on the skill levels of each player so that they are guaranteed to converge even when there is little data for a player. To do this, include a Gamma Distribution in the model for each player’s skill.
The gradient of the log likelihood of the gamma distribution is .This makes the update rule for the player as follows:
There are two differences between this formula and the update rule above. The first is that rather than just updating the score one game at a time as the games come in, we store the whole history of games the player has played, and iteratively update the player’s skill.
Secondly, on each iteration, we add to the player’s skill, and subtract from the player’s skill. This expresses our belief that until we know more about a player, their skill probably isn’t too high or too low. The and parameters control the distribution of skill levels. As increases, we become more skeptical of very low skill values. As decreases, we become more skeptical of very high skill values.
The mean skill value will be .
To run this algorithm, we alternate between updating our prediction for the outcome of each game and updating each player’s skill level based on the new predictions for all of the games they’ve played.
The basic model assumes that commanders are just like any other player, and that they have the same contribution to the outcome of the game as any other player. This isn’t necessarily a bad assumption, I’ve seen many games where an unusually vocal and motivational player other than the commander was able to call plays and lead a team to victory.
The much worse assumption however is that the same skill sets apply for being a commander or a player. Players that can assassinate fades and dodge skulks might be useless in the command chair, so it doesn’t make much sense to use the same skill values for both. To fix this, we give each player a separate skill level that indicates their skill at commanding. To distinguish it from the other skill level, we’ll call it . To indicate the time spent commanding i’ll use the same way (and using the same formula) as for above.
This makes a few questionable assumptions for simplicity. The model will still do useful things even if these assumptions are false, but they do indicate areas where the model won’t perform well.
1. The magnitude of the impact of the commander does not depend on the size of the game.
2. Playing marine commander is equivalent to playing alien commander.
3. A good commander with a bad team vs a bad commander with a good team is expected to be an even match. I suspect this isn’t true because there are few ways that a commander can influence the outcome of the game independently of their team, such that a bad team probably critically handicaps a good commander, and as such it might make sense to use quadratic terms instead.
NS2 had a feature where players could vote to “Force Even Teams” but it was generally unpopular. The skill values weren’t very accurate, and the strategy used to assign players had some undesirable properties. It used a simple heuristic to assign the players since the general problem is NP-complete. Because it reassigned all players, they would find themselves playing aliens when they wanted to play marines, or vice versa, and would switch back right after the vote. To fix this, we designed an algorithm that minimizes the number of swaps that occur to reasonably balance the teams.
While this greedy heuristic of swapping only the best pairs one at a time may be a poor solution for coming up with the optimal balanced teams, it is great for keeping as many players as possible happy with the team assignments. Because the swap chosen is the one that minimizes the objective function the most, we can get quickly to a local optima by swapping very few players.
]]>The image I’ve picked to work with has a lot of issues. The paper isn’t evenly lit, and the lights shining on it have different temperatures. It was taken with a tiny point-and-shoot camera. Nevertheless, with only a few steps, we’ll get a workable image out of it. Our strategy is to produce an image that contains only the tone of the paper using a few filters for shortcuts, and little elbow grease. Then, we’ll use layer modes to remove the effect of the paper tone from the image.
First, find the darkest region of bare paper on the image, and use the eyedropper to get its color. Remember the average value of the three color channels. In this image, the darkest bare paper has a value of about 110. We’re going to use this value to try to remove all of the lines from the image.
Duplicate the image on a new layer. Then open run Filters > Enhance > Despeckle using the following settings. Turn off “Adaptive.” Turn off “Recursive.” Set the radius as high as it will go. Set Black level to the value of the darkest bare paper (in this case 110), and White level to 256.
What we’re doing here is asking Gimp to look at every 30×30 region, and find the median of all the pixels that are brighter than 110 and dimmer than 256. Since we determined that the dimmest area of bare paper is 110, by specifying this range we are getting the full range of the paper tone, but as little as possible of the lines in the drawing, and hopefully, getting just the color of the paper out of the filter over most of the image.
It worked for some of the image, but not all of it. In some regions of the hair you can see that there weren’t any pixels that met the range constraints of 110 to 256, so it left the image alone. You can still see the shadow of the original drawing in areas that were particularly dark. We can get a little further by running it again.
From here we have to clean up manually. I do this using the smudge tool and paintbrush to pull the color in from the regions I trust into the regions that have been tainted by the drawing. I recommend duplicating the layer before doing this.
This final image should be our best guess at what the paper would look like without a drawing. Keep referring to the original drawing as you tweak it. Once you’re satisfied that it looks like a blank version of the original drawing, set the mode of the layer to “Divide.”
Why does this work? Physics tells us that the color of the light multiplies with the color of the surface (the albedo) to produce the color that we see. The areas of blank paper show us the color of the light since the paper itself is white. To recover the original drawing, all we have to do is invert that multiplication (divide) to recover the color of the surface. After a removing any remaining hue and cropping, we’ve got an image we can work with.
]]>
To describe one fraction, use a binomial. To describe your uncertainty about that fraction, use a beta. To describe a collection of related fractions, use a beta-binomial. To fit a beta-binomial, use this code.
Imagine you’re in the middle of building something awesome, when you realize that you have a problem that sounds something like one of the following:
A reddit user has submitted a bunch of URLs. Each one has a different number of upvotes and downvotes. How good is this user?
They’ve just posted a new link. That link has one downvote. Do we show it to more people or let it die in obscurity?
An author on a news site has written a collection of articles, each with a different bounce rate. How effective is this writer at driving traffic to the rest of the site?
They’ve just posted a new article. Should we promote it?
A supplier produces a bunch of different products sold on your website. Each one has a different average star rating. How good are the products from this supplier?
They’ve just added a new product to their catalog. How should we rank this against items from other suppliers that already have ratings?
In all of these situations you’ve measured a whole bunch of fractions. You think that these fractions have something in common because the same entity is responsible for them, but you don’t think they are exactly the same. Even if you had an infinite amount of data, each item would still be different. The collection of things made by each user or author or supplier has a range of quality. How can we accurately describe this range? Once we’ve described it, how can we use this to guess the quality of the next thing this entity does?
Henceforth, I’m going to use the reddit example, but it applies to a ton of different situations. For instance, It is one of the most common statistical problems we encounter in search.
We have a whole bunch of fractions. We want to summarize them in some way to describe each user. Aggregating them seems like the reasonable thing to do. How could we aggregate them?
Add all the numerators, add all the denominators, and divide?
Consider the case of a notorious spammer who gets downvoted every time he posts, except one time he gets lucky, and a post takes off. The average for this spammer would be high, even though almost all of his submissions are horrible.
Count each fraction as 1 and take the average of all of them?
Consider a user who regularly makes fantastic comments with lots of upvotes, but periodically gets into a flame war with a single user who downvotes every comment in the thread. Those comments with a single downvote would get the same weight as the great comments, even though they represent the opinions of a very small number of people.
Weight each fraction by its denominator (equivalent to #1) except cap the denominator at something reasonable so that the average isn’t dominated by a single fraction?
Now we’re getting somewhere. This is how the system we’re going to derive behaves. It figures out the cap automatically, and it’s not quite a cap, but if you want to implement the simplest approximation of the Right Way, weight each fraction by (where is some arbitrary constant, and is the denominator) and call it a day. However, the Right Way is much power powerful than this so read on!
Reddit’s karma system currently accumulates the total number of upvotes minus downvotes. This is good for encouraging users to post, but not at predicting the quality of a user’s posts. We can do better.
Before we work on inferring something from a collection of fractions, let’s figure out how to deal with one fraction.
Any time you are measuring a fraction and describing it as a probability, you are talking about a binomial distribution. If you test something times and have positive results, then the distribution of outcomes follows
In the reddit example, would be the total number of votes on an item, and would be the number of upvotes. If we want to describe the quality of a post, really we want to know the value of for this post. That is, if we collect an infinite number of votes on an item, what fraction of them will be upvotes?
What can we conclude about from the information we have about a post so far? Let’s use Bayes’ Rule to change the conditioning.
. It is integrated over all of the possible values of so it doesn’t depend on any particular value of (just a normalization.) For simplicity right now we’ll assume that our prior of is flat. This means that if we temporarily switch from to we can drop both terms. A common trick when working with complicated distributions is to say “We know this integrates to 1, so let’s drop all of the normalization terms and figure out what they have to be once we know the form of the distribution.”
The difference between this formula and the one we started with is that is now the variable, and the other terms are fixed. This makes it a different distribution, the beta distribution, because it is now a continuous distribution over instead of a discrete distribution over .
We’ll get back to the beta distribution in a second. For now we’re just concerned with . For the data we’ve observed so far, what’s the most likely value of ? Let’s maximize its log with respect to . (Since log is monotonic, maximizing the log of a positive function is the same as maximizing the function, and it almost always makes for easier math.)
Wow, that’s about as intuitive as it gets. When you estimate the probability of an event as the fraction of times it happened (), you are really choosing the most likely parameter of the binomial distribution, even though you probably didn’t know it.
Now we know how to deal with one fraction, and how to guess its . Each fraction for a user will have a different , but we think they are still related somehow. What describes the relationship between the different ‘s?
The beta distribution is a smooth distribution over the interval . It looks like a bell curve, but the way a bell curve would look like if you squished it into a box. and are its two parameters. describes the center of the distribution, and describes how tight it is.
Notice that the beta distribution has the same form as the binomial distribution, but with and exchanged for and , continuous versions of factorial (gamma), and as the random variable instead of a parameter. This fact is why the beta distribution is important.
When we assume we have no information about the distribution of , then observe samples from the binomial, our uncertainty about follows a beta distribution. What happens if instead of a flat prior we assume a beta distribution for the prior of ? The combined distribution of and becomes:
As above, we’ll change the conditioning by noting that the prior probability doesn’t depend on the individual value of and dropping the normalization terms.
From the knowledge that this is a probability distribution and thus integrates to 1, we can put in gamma terms that we know will normalize the distribution and change back to .
Look how nicely that collapsed! We end up with another beta distribution for . For this reason, the Beta distribution is the “Conjugate Prior” of the binomial. If your prior knowledge about is a beta, your posterior after observing some binomial samples is also a beta.
This makes the beta distribution an extremely convenient choice for representing the range of quality of each user! If we express that range of quality as a beta distribution, then we can easily make good guesses about the quality of a post no matter how many votes it has by maximizing the above formula! Our best guess as derived in the previous section becomes
So if we have a beta distribution for each user, doing the right thing to rank their posts is now exceedingly easy. Notice how the and are effectively fictitious counts, as if we observed some additional data. These are called “pseudocounts” and they crop up frequently in the math of combining priors with evidence. Later on we’re going to work out in great detail how to fit the beta distribution for each user, but despite sounding complicated, it finally has a very simple interpretation. Each user’s posts will start out their life with upvotes and downvotes, determined by the history of votes they’ve gotten on their other posts.
To figure out the beta distribution for each user, somehow we have to infer it from the collection of numerators and denominators of their posts. We’ll do this using the combined distribution.
This combined distribution of binomial samples generated from a beta is called the Beta-Binomial.
This section is considerably more advanced than the previous sections, so if you are convinced of the usefulness of the math and just want to use it, download the C++ code. Otherwise, read on!
The joint probability of the data and the values of takes the following form.
If we knew the values of , we could maximize this w.r.t. and , but we don’t know the values of . If we knew the values of and we could maximize this w.r.t. and find the values of for each binomial (as we derived above), but we don’t know the values of and .
This is a perfect opportunity to use Expectation Maximization, an optimization technique that lets us alternate between fitting and fitting and , and converges to a locally optimal fit. Instead of integrating out and maximizing the resulting log likelihood, we’re iteratively maximizing the average value of the log likelihood under our current guess of the distribution of . (Integrating out is possible, but doing it this way instead has other benefits that I’ll mention later.)
Using the “Soft EM” version of the algorithm, we’ll be iteratively maximizing
where and are the values we are fitting, and and are the values from the previous round. That’s a mouthful of an expression if you aren’t familiar with EM. What’s going on here?
As we discussed above, from a binomial observation () and a beta distribution prior (,) we can come up with a beta distribution for the value of . Let’s rewrite the expectation.
The first term is a data dependent constant that doesn’t depend on and so we can drop it from the optimization.
If you know a bit about information theory, this term should look familiar. It’s the negative of the cross entropy between and . The distribution that minimizes the cross entropy will be the distribution itself. What this means intuitively is that we’re trying to pick and to make the prior distribution as similar as possible to each posterior. Since each binomial is different, and thus has a different posterior, these are competing objectives. This is true of EM in general, but is particularly useful here. We’ll come back to this later.
Let’s break apart the Beta inside the log.
These expectations of the log s are nicely derived on the wikipedia article for the beta distribution in the section on geometric means.
is the digamma function. In the limit, it behaves like . The main difference between the two functions for our purposes is where x is small, close to 0, where asymptotes much more quickly.
We’re ready to take some derivatives.
When we set the partials to 0 and solve we get particularly beautiful relationships.
Unfortunately, solving this involves inverting the digamma function, which everyone seems to cite someone else for. I’m no exception. At this point I’ll refer you to Thomas Minka’s Estimating a Dirichlet Distribution where he works out both fixed point (slow) and newton iteration (fast) methods to solve for alpha and beta from this set of equations.
Without worrying about how to solve the terms on the left side, there’s a lot we can infer just looking at the expressions. Notice how the right side is just the average of a statistic collected over all of the samples. behaves similar to when the values are large, which lets us simplify the equations for the purposes of understanding them. (Note that this simplification does not produce good numerical results, this digression is just to help understand what this is doing.)
This tells us that the mean of the beta will be approximately the geometric mean of the smoothed values of for each binomial.
Let’s see what else we can figure out from these expressions. Consider the case where we’ve only measured 1 binomial sample, or all the binomials we’ve measured have the same and . In that case, the above equations are solved when and . Uh Oh. We’re running this iteratively, so both and will diverge! What’s going on?
We didn’t include a prior for and . We’re computing the maximum likelihood estimate, instead of the maximum a posteriori estimate. This means that the beta distribution can become as implausible as it wants to, so long as being implausible makes the data more likely. In this case, The beta distribution is turning into a delta function at the value of of the single binomial.
It’s doing the right thing, there isn’t a bug in our math. The delta function at that value is the maximum likelihood estimate when all of the binomials have the same and , but it’s clearly not what we want. Note too that this isn’t that much of an edge case. It can easily happen just by chance that all of our measured binomials are the same. It would be really bad if something so simple caused our code to run forever. We need either a prior, or some regularization.
Unfortunately, the beta distribution doesn’t have a convenient conjugate prior for us to use. Instead, let’s borrow the idea of “pseudocounts” from the binomial example. Our fit is expressed as a nice summary statistic averaged over all the binomials, and we showed how these represent cross entropies from the beta implied by each binomial. Let’s just add some fake ones. We’ll make up a beta with parameters and give it a weight of relative to the other samples. Formally speaking, we’re adding a regularization term to the optimization that penalizes the beta’s divergence from an example beta as below.
These collapse nicely with our previous derivation, and turn into an equally convenient update rule. These terms force convergence, because they don’t update on each iteration.
The nice thing about these expressions is how intuitive they are. The difference of digammas may seem like a strange statistic, but finally you’re just averaging this function over all of your data, and then inverting it, just like you would average the values to compute a mean, or average the squared error to compute a variance.
Because this fitting process is expressed in terms of a summary statistic, we can easily implement a streaming version of this algorithm without any extra derivation. We buffer some data, fit the beta, then assume that these summary statistics of the difference of digammas are good enough to describe the points we’ve seen so far. We then discard those points, and keep only the current beta and a weight to reflect how much data we’ve already processed. This beta and weight posterior for our current fit become the prior for fitting the next buffer of data.
This is the advantage of this approach over integrating out and optimizing the likelihood function directly. It allows you to express an intuitive (approximate) prior for what the beta should look like, and permits a streaming version by describing your current fit in the same form as that approximate prior.
In these charts I’ve simulated what real data might look like. In order to see the difference between them, there are two sets of plots, one for a beta-binomial, and one for a binomial.
Beta-Binomial
Binomial
This is repeated 10,000 times. Both distributions have the same mean. The interesting pattern is in their spreads.
In the Beta-Binomial, the distribution continues to spread out as increases. In the binomial case, it stays tight around the slope of the mean.
As increases for a beta-binomial, the distribution of converges to the underlying beta. As increases for a binomial, converges to a point at .
These graphs are here solely to demonstrate the futility of understanding these distributions through a histogram. It is a very poor tool to understand a fraction computed over varying .
Worry more about what you are measuring than how you are measuring it. Getting the math right is a secondary consideration. You do this stuff after you have something simple that basically works. Break out the big guns to make it better, not in the hopes that it will make something work if the stupid version doesn’t already do pretty well.
For instance, In this article I’ve assumed that the ratio of upvotes to votes is the relevant measure of quality, but it may not be the best. It might be more interesting to measure the ratio of upvotes to impressions. Or maybe to measure independently the ratio of upvotes to all votes, and the ratio of all votes to impressions. There are many aspects of a user that you could (and should) attempt to describe in a complete system.
However, whatever aspect you are measuring, if you can express it as a fraction, chances are this is the technique you’ll want to use to describe it.
]]>Guess which of these graphs shows the normal/gaussian distribution.
The correct answer is A. The rest are the logistic, cauchy, and beta distributions respectively.
Even if you picked it out correctly, it isn’t easy to figure out. All of them look like “bell curves,” they are all symmetric, they all taper off at what looks like a similar rate. In that case, why do we care about the difference between them? Why don’t we just pick whichever one is most mathematically convenient and not worry about it?
The difference lies in their tails, the parts of the graph where the plot disappears into the x axis. That difference, imperceptible in these graphs, makes these distributions behave incredibly differently, as I’ll show below.
Probabilities are very rarely added together, and probability distributions even more rarely. The basic operation of probability is multiplication. This arises from Bayes’ Rule, which describes the relationship between the joint and conditional distributions of random variables:
This makes it important to understand how each distribution combines via bayes rule with other distributions; that is, how it multiplies.
Before you look at the graphs below, try to guess what each one will look like. I’ve made two copies of each distribution, shifted their centers to -4 and +4, and multiplied the two together. This simulates having two sources of evidence about the same variable that substantially disagree.
Despite the fact that these distributions had very similar looking shapes, their products are entirely different. The distributions are shifted so that the center of one is 8 standard deviations from the center of the other, well out in the range where the plots are indistinguishable from the x axis. Clearly there’s a lot going on in this invisible part of the graph!
What should we plot to give us an intuition about this? The basic operation shouldn’t be so surprising!
The log function asymptotes at 0. This makes it a perfect choice to visualize very small values of a function by expanding their range. These plots show the log of each probability distribution, and they are all easily distinguishable at a glance. They also make it plausible to predict how the distributions will multiply together. Since multiplying the variables is equivalent to adding the log of the variables, all we have to do is figure out what happens when you add two of these curves to each other.
And as you can see, these make the logs of the products of the distributions make a lot more sense
These shapes still aren’t immediately obvious, but you’ve got a fair shot at guessing them from the log plots alone. If all you had was the probability density plots, you wouldn’t have any recourse but to go immediately to the math.
Log probability distributions also show up when you are fitting a model to data. Regardless of whether you are using Bayesian or Frequentist methods, fitting a distribution to a set of data is going to involve maximizing a likelihood function (potentially with some additional multiplied terms.) You’ll be selecting the distribution by maximizing where is the vector of data, is the vector of parameters, and is a regularization term, typically a prior distribution for . Again we find that we’re dealing with a product of probabilities.
The first step of maximizing a function is typically to take its derivative. To make the derivative simpler, it’s easier to work with a sum of many terms than a product of many terms. Since log is monotonic, maximizing a nonnegative function is equivalent to maximizing its log, so we can instead maximize the log probability and turn the product into a sum. We’ll be maximizing . The term is exactly what we’ve been plotting. The log probability density arises for both mathematical convenience and intuitive convenience!
To determine what impact each sample will have on the likelihood, we can just read off the value from the log plot. Consider an outlier that is well away from the rest of the distribution. For a gaussian, the impact on the likelihood will be proportional to . For a logistic distribution, the impact will be roughly proportional to . This is what is meant when the gaussian is described as “sensitive to outliers.” Samples far away from the mean will yank the whole fit around because they have such a large impact on the likelihood.
Log probabilities pervade in Information Theory as well. The negative log probability is called the “surprisal,” and every other quantity is defined in terms of an expected surprisal.
How much all this matters depends on what type of question you want to answer with your model. If you just want to know where the bulk of the data is, then your choice of distribution doesn’t matter that much. If you are trying to determine the probability of exceptional events, events far away from the mean, then the size of the tails is the only thing that matters. An event six standard deviations from the center of a logistic distribution is several thousand times more likely than an event six standard deviations from the center of a gaussian distribution.
Here are some real world examples where answering an important question requires estimating the probability of an exceptional event.
A professional chess player has an off game and loses to an amateur. How much should their rating decrease?
The United States Chess Federation has switched from the gaussian distribution to the logistic in order to make ratings more stable.
A widely sold security is backed by a large number of mortgages with a low rate of default. What rating should this security receive?
The financial crisis that brought the world economy to its knees was caused largely by bad statistics. Analysts assumed that mortgage defaults are independent events, and thus that the total number of defaults in a collection of mortgages is normally distributed, with very small tails. In reality, the state of the housing market and the overall economy ties defaults together, so large numbers of them defaulting at once is much more likely than a normal distribution would predict. This left the world’s financial institutions completely unprepared when large numbers of them did default at once.
What are the chances of a magnitude 9.5 earthquake hitting San Francisco in the next 10 years?
The Gutenberg-Richter law describes the probability distribution of earthquakes of different magnitudes. The log probability of an earthquake is linear in its magnitude. Disagreements about the slope of that line make an exponential difference in the likelihood of a large earthquake.
A Climate model predicts a 1°C increase in global mean temperature. If the climate were to get several standard deviations hotter than that, the soil would lose the ability to hold moisture, and terrestrial plant life would end. How likely is the end of the world?
Unfortunately for us, climate outcomes are fat-tailed.
Suppose someone tells you that the sun will come up tomorrow. Unless you’re in the depths of depression, this probably isn’t surprising. You already knew the sun will come up tomorrow, so they didn’t really give you much information. Now suppose someone tells you that aliens have just landed on earth and have picked you to be the spokesperson for the human race. This is probably pretty surprising, and they’ve given you a lot of information. You’d be pretty pissed off if nobody told you. The rarer something is, the more you’ve learned if you discover that it happened.
Suppose you have an eight-sided die with the sides labelled A B C D E F G H. Suppose you want to record a series of rolls on a computer. How many bits are required to encode each roll? Well, there are 8 possibilities that are all equally likely, and with bits you can encode possibilities, so this requires bits per roll.
Now, suppose someone tells you that they rolled a vowel. How many bits does it take to encode that roll, now that you already know that the roll was a vowel. There are only two possibilities, A and E, so you can store that roll in bit. They’ve just saved you 2 bits by giving you some information.
This is the central thing to understand about information theory. There’s some randomness whose outcome is encoded in bits. Someone tells you something about an outcome, and as a result you can store it in fewer bits. The difference in the number of bits is the amount of information you learned. In this case, knowing that they rolled a vowel saves you two bits, so that’s how much information they gave you.
Now what happens if someone tells you that they didn’t roll a vowel? Now there are 6 possible outcomes, which isn’t a power of 2, so we can’t trivially map it to some number of bits and we have to start doing math.
Suppose we’re encoding rolls of a die with many sides, and each possible roll is no longer equally likely. How many bits should we use to store each roll? Intuitively we’d like to make the common cases shorter and the rarer cases longer, so that on average, the messages are shorter. More formally, let be the length of the string of bits we use to encode outcome . We would like to minimize the expected value of , or .
What are the constraints on how short our encodings can be? We certainly can’t have more than two outcomes encoded in only 1 bit, but how can we generalize this? Think of this in terms of the fraction of the “namespace” that each outcome uses. If an outcome is encoded as 1, then you’ve already used half of the namespace. No other outcome can start its encoding with 1. Suppose you’re encoding an outcome as 11. You’re now using half-of-the half-of-the namespace. No other outcome can start with 11, but they are free to start with 0 or 10. Formally, this means that if we encode an outcome using bits, it uses of the namespace. We now have the nice constraint that .
Minimizing under this constraint gives[1] you the optimal encoding length for each as .
This behaves in the way we would expect. The rarer an outcome, the longer its encoding, and if all of the outcomes are equally likely, we give each one an encoding of length just like we’re used to. Now that we know the optimal length of each encoded outcome, what’s the expected encoding length for an event?
We call this quantity the “entropy” of a distribution . Mathematically it is identical to the quantity of the same name from thermodynamics, and you can think of it as a measure of the “spread” or “disorder” of a probability distribution. It has its peak value when all outcomes are equally likely, and its minimum value when there is only one possible outcome. Careful readers will notice that is undefined when . Thankfully though, the limit at 0 is 0. This leads to the interesting result that if there is one outcome that is absolutely certain to happen, you can encode it in 0 bits. There are some things so obvious (certain) that they aren’t worth saying at all!
Back now to our eight-sided die. The entropy of the original die roll is 3, and the entropy of the die roll if we know that the letter rolled is a vowel is 1. What’s the entropy of the die roll if we know the letter rolled is not a vowel?
Now we have something whose unit is “bits” but whose value includes fractions of a bit. What can we do with this? After all, if we’re only storing one roll, we still need 3 bits to store 6 possibilities. The trick is that we can use fewer bits if we are storing more rolls at once. There are 2 wasted possibilities in those 3 bits we used for the first roll, and if you’re clever, you can use those to encode some information about the next roll. If we’re clever enough, and storing enough, 2.58… is the lower bound on the number of bits required per roll that you’ll converge to with an optimal compression scheme.
So if someone tells us that the roll isn’t a vowel, they’ve given us 3 – 2.58 = 0.415 bits of information. Consider that if they ruled out half of the possibilities, they’d have given us 1 bit. Since they ruled out less than half of the possibilities, it makes sense that they gave us less than one bit.
Suppose now that someone has agreed to tell us whether or not the roll is a vowel, but we don’t know in advance which it will be? What’s the expected value of the information they will give us? The roll will be a vowel 2/8 of the time, and not-a-vowel 6/8 of the time. So, take the expectation of how much information they give us in each case. On average, they will give us 0.8 bits of information about each roll. This quantity we’ve just computed is called the “information gain” of knowing whether the roll is a vowel.
So more formally, if is the random variable indicating our roll, we’ve computed . It takes 3 bits to store each roll if we know nothing in advance. We’ve also computed and ; the entropy of the roll given that we know the roll was a vowel, and the entropy of the roll given that we know the roll wasn’t a vowel, respectively. What we’ve computed above is:
This formula is the form of information gain for any two random variables, just sum it over all values of the variable you are conditioning on (in this case, V is either true or false.) Now, lets take the information gain, distribute the probabilities, and rearrange a little bit.
We have the original entropy of R minus the expected value of the entropy of R given that we know the value of V. This second term we call the “conditional entropy” and it is denoted . You can think of conditional entropy as the expected number of bits that are “left” in R once you know V.
Information gain is an extremely useful quantity in machine learning. It tells you how much value your classifier could possibly extract out of using a given feature by itself, and is commonly used for feature selection. Anytime you need to sort anything, sorting by information gain/KL-divergence or G-test score will almost certainly give you great results.
Shall I walk,or shall I ride?
“Ride,” Pleasure said.
“Walk,” Joy replied.
-W.H. Davies
I’ve been backpacking for most of my life. I started with my family when I was 8, and I’ve learned a lot over the years about how to do things.
The purpose of this document is to be a collection of everything I think it is useful to know when you are backpacking. You do not need to know all of this. However, the suggestions in here will allow you to be safer, more comfortable, and hopefully to have more fun. If this is your first time backpacking and all this information is intimidating, stick to the one-line summaries, the gear list, and the things in bold. Most mistakes are more likely to build a little character than to put you in real danger. You’ll be fine.
I’ve tried in this guide to provide low cost options throughout. Backpacking doesn’t have to be and shouldn’t be an expensive hobby. Once you have the essential gear, backpacking is the cheapest way of vacationing that I’m aware of.
Over the years I’ve found that backpacking is one of the most enjoyable things that I do. I count several of my backpacking trips among the best experiences of my life. When you go backpacking you’ll get a lot of exercise, see some amazing things, share a lot of great experiences, and make a lot of memories. I hope this document will help people to learn to love it as much as I do.
One line summary: Don’t wear cotton if you can help it. Get good boots that fit well and break them in. Wear wool socks and polypropylene liners. For warmth, dress in layers. Bring a knit or fleece hat and something that will keep you dry.
When allocating your backpacking budget, boots should be your first priority. A good pair of boots is expensive, but there are a few reasons that I advise spending money on them. First off, a good pair of boots is usable in many situations other than backpacking. If you decide at some point that backpacking isn’t for you, you can still use them for tramping through snow to work or school, yard work, wood working, construction, camping, hiking, or any time your feet need protecting. A good pair of boots allows you to do a wide range of things safely, and it’s good to have one around. Secondly, if you take care of them and don’t lose them, a good pair of boots will last you 10+ years. The pair of boots I bought when I was 13 lasted me through my first year of college (when I lost them.) Thirdly, no matter where or under what conditions you are hiking, the quality of your boots will have a big impact on how much you enjoy yourself. If your boots don’t work well for you, you’re likely to be wincing along with blistered heels, soggy feet, and in danger of spraining your ankles. If your feet are still growing I’d recommend borrowing a pair, or buying them off someone who has grown out of them. There are other things where it is worth it to spend money, but boots are the least borrowable.
Everyone has their own personal taste in boots, but I’ll tell you the essential baseline of things to look for and then my own preferences. The most important quality of a backpacking boot is that it fit well. When walking downhill your feet shouldn’t slide forward much. If they do, the tips of your toes are going to be hurting from banging into the front of your boots. When walking uphill, your heels shouldn’t slide up much. If they do, you are likely to end up with blistered heels from the rubbing back and forth. Secondly, the boots should cover your ankles with stiff material. You should still be able to make circles with your foot, but it should be nearly impossible for you to roll your ankles. Some people prefer a lower top boot, but unless you are also investing in trekking poles I don’t think this is safe. Thirdly, the boot should have a thick sole that will keep you from feeling bumps underneath it and will grip rocks. Some of these recommendations are controversial, particularly among a movement known as ultralight backpacking. Briefly, the central idea of the movement is that if you are rigorous about minimizing pack weight, you can get by with much lighter shoes, even tennis shoes or sandals. I’ve chosen not to cover that in this document because I think it is very difficult for a beginner to do inexpensively and safely. It is also not my preferred way of backpacking because you end up sacrificing a lot of comfort and convenience in order to reduce the weight of your pack. However, some people find that it makes their experience a lot more enjoyable, and it is worth learning about once you have been on a few trips. Eric’s Ultralight Backpacking Page is a good resource on the subject.
These days when I look for boots, the primary thing I look for is a boot with very few seams. This is for two reasons, waterproofing and durability. Treated leather repels water well on its own, but more seams make more opportunities for water to get in, and make it more difficult to waterproof the boot. You’ll inevitably slip into a stream at some point, and a well waxed boot with a high top will minimize the amount of water sloshing around your feet for the rest of the day. The two things that wear out first on a boot are the seams and the soles. Boots can be resoled, and seams can be glued, but the boot will last longer if there are fewer seams to begin with. Leather can take a lot more abuse than whatever they sew it up with. There are a lot of boots out there that are made porous fabrics, and advertise all sorts of features. These are expensive, and frankly I don’t think they last. Your boots are going to get scraped across rocks, soaked in water, caked with dust and mud, pummeled, squished, frozen, and potentially even eaten by wild animals (no joke.) Fabric that advertises being “breathable” just isn’t going to cut it long term, and won’t buy you much comfort after its first few days on the trail. So, get the sturdiest stuff available. Gore-tex can help on the inside, but make sure the outside is leather. I wear a pair of Lowa Banffs that I bought a few years ago after losing my previous pair. My mom wears a pair of Gore-tex lined Vasque Skywalker leather boots that are over ten years old (similar to these .)
If you can’t afford expensive boots, don’t sweat it. If your boots fit well you’ll be fine.
When you buy a new pair of boots, don’t immediately take them out on the trail. The boots will take some time to conform to your feet even if they fit well, so break them in ahead of time. Wear them instead of your normal shoes for the week prior to the trip. If you are going on a long trip with them, take them out for a weekend trip beforehand. This helps to prevent blisters and can also catch any manufacturing flaws early. Six months before my first 10 day backpacking trip I bought a really nice pair of boots, and went on a weekend hike to prepare for the trip. After a few miles of hiking one of the grommets tore out of the leather. I got through the weekend ok, but it would have been much harder to deal with it for 9 days of hiking. (My friends got a hearty laugh out of this, because I was actually bragging about how awesome my new boots were when they broke and sent me flying flat on my face.)
To care for your boots, wipe the dirt off with a wet towel and oil them with a product such as Nikwax .
Wear a pair of wool socks over a pair of polypropylene liner socks. Don’t wear cotton. The wool and polypropylene will keep your feet dry and cushioned. The liners will absorb most of the friction of your foot moving around and help prevent blisters. You’ll need 2 pairs each of these for a weekend, 3-4 for a longer trip. If you want to spend money get SmartWools . Otherwise you can find wool socks that aren’t as comfy but also aren’t as expensive. If you hike a day in cotton socks they’ll end up as a moist, pulpy, off-white, smelly mass, and you’ll probably have blisters.
Dress for the season here, but don’t wear cotton if you can help it. The evenings always get much colder in the countryside than in the cities, and much colder than that when you are up in the mountains. As such, it’s good to have a pair of long pants with you even in the middle of summer if you are going to be at any elevation. If you want to spend money, I recommend getting a pair of quick-drying zip-off nylon pants from REI or similar. Otherwise, you can often find pants that are part polyester at cheap clothings stores such as Walmart or Value City. It is convenient that polyester is considered a “lesser” material to make things out of, because it makes durable faster-drying pants cheap to buy. Unfortunately it seems that polyester is now too cheap even for Walmart, and it’s been harder to find them. If this happens to you, try a thrift store. Polyester dress pants work better than you would expect. For shorts, a pair of mesh running shorts will do. Shorts are there mostly to provide a little bit of sun protection. Don’t bring jeans. They are heavy, bulky, and they will never dry.
Dress for the season, but remember that no matter where you are, it will probably be colder at night than you expect. Bring at least a long-sleeved shirt. Avoid cotton. It dries slowly, and is useless for warmth when it is wet. Fleece, wool, polypropylene, or any synthetic fabric work great. To find non-cotton you can buy expensive stuff at REI or Patagonia, or cheap stuff at Value City or Walmart. For colder weather, the key for warmth is to dress in layers. This allows you to regulate your temperature over the course of the day. The clothes that make you comfortable when you are standing still will be too warm when you are walking. The clothes that keep you comfortable at camp in the afternoon will be too cold once the sun goes down. When I’m backpacking in the winter, I wear a polypropylene T-shirt, a long-sleeved acrylic shirt that I bought at Value City, a fleece vest, a thin fleece coat, and a nylon windbreaker/rain jacket. I usually end up taking off the windbreaker and outer later of fleece just before I start hiking, and the vest later on.
The most economical option here is a nylon poncho. They are effective, lightweight, and very cheap. For summer backpacking, I wouldn’t use anything else. They are less useful in wind, so find one with snaps along the side. For winter backpacking you’re better off with a raincoat and rainpants. With a poncho your arms and calves will get wet, and this is more of an issue in the winter.
Packs are expensive. For your first few trips, borrow one from a friend, and have them help you adjust it for your body. If you are looking to buy one I can’t give much advice, because I’ve been using the same dilapidated pack for 11 years now.
When the pack has a load in it, nearly all of the weight should be on your hips and sacrum. The shoulder straps are mostly there to keep the pack vertical and close to you. If the pack doesn’t need anything moved around to accommodate you, you can fit it pretty well by just tightening the straps in the right order. Start by putting some weight in the pack and tighten down the pack around the load so that it isn’t moving around. There are three main points of adjustment, hip belt, shoulder straps, and load lifters. The load lifters are the two straps that run from the shoulders up to the top of the pack. Loosen up all of these straps so that the pack hangs limply from your shoulders. Lift your shoulders so that the hip belt is at the level of your hips, fasten it, and tighten it around your hip bones. Next, tighten the shoulder straps so that they touch your shoulders solidly around all sides. Your shoulders shouldn’t be carrying any of the weight. Finally, tighten down the load lifters until the pack presses up against your back.
For anything other than middle of summer camping at low elevation, buy or borrow a down-filled mummy bag. It should be rated for at least 30°F. Mummy bags have a hood that extends up over your head and allow you to close off the bag such that nothing but your nose and mouth is exposed. Synthetic fillers work fine but I’ve found them to deteriorate over time much more than down does. If the night time temperature will be warmer than 60°F you can bring just sheets or a blanket. The tent will keep you pretty warm on its own. When you are packing your sleeping bag, line your stuff sack with a garbage bag. Stuff in your sleeping bag and then fold over the garbage bag before you tighten down the drawstring. This is very important for keeping your sleeping bag dry.
A sleeping pad will keep you from feeling the rocks underneath your tent, and will insulate you from the cold ground. It is essential if it will be chilly at night. Your sleeping bag’s filler will be compacted under your weight so it will only protect the top of you. If you have the money to spare, therm-a-rest self inflating pads are great. If you want something cheaper but not as comfortable, ridge rests aren’t bad. You can also just try to find a piece of soft eggshell foam from a packing crate and use that.
Tents are also expensive, but very borrowable. Make sure you’ve got a rain fly and a ground cloth. Generally you want something small and lightweight. You don’t need more floor space than your bodies will take up for sleeping since you’ll keep your pack outside. If you are going somewhere rocky try to bring a tent that stands up on its own when it isn’t staked in. It will often be hard to find good places to put stakes, and you don’t want to have to rely on them if you can help it. I’ve been using the Half Dome 2 HC tent from REI for the last two years, and found it to work really well. Eventually you can get more adventurous .
There are a few things that you should have with you every time you are miles out in the woods, whether you are backpacking or not.
In addition to the outdoor essentials above, you’ll need some additional items.
Pack your clothes inside of a garbage bag with the top folded over. Put the heaviest items closer to the small of your back. Keep your rain gear, snacks, and whistle easily available. Use ziplocks and stuff sacks liberally. Keep all of your smellable things in a small number of places so you don’t miss any when it’s time to pack up things in the bear keg/bag.
Once all of this gear is in your pack, and you’ve filled up your water bottles, it should definitely weigh less than 1/3 of your body weight, and hopefully less than 1/4.
Calories calories calories! The general principle here is that you want the most calories per pound spread through a variety of food groups. Calories per unit volume is also a concern. Eat a lot of protein and carbohydrates, and make sure that something you are eating regularly has salt in it. You’ll sweat a lot of it out. Most food packaging is bulkier than it needs to be, and less waterproof than you want it to be. Before you go, divide up your food and repackage it into zip-lock bags.
It’s nice to cook group meals for dinner because it means fewer dishes to do and generally less weight to carry. My general strategy for good food on the trail is to have the bulk of the meal be lightweight and calorie dense, then have some heavier but fresher ingredient to add. All of the “one big pot” meals can be improved by adding a can of chicken, tuna, or corn. Also consider adding sun-dried tomatoes, mushrooms, etc. Most dishes require you to heat up a big pot of water, so this is a good opportunity to boil untreated water to sterilize it rather than filtering it. Bring it to a rolling boil for 1 minute. [1]
Never step on anything you can step over. Never step over anything you can step around.
When crossing a stream, unfasten your hip belt. If you fall in, you don’t want your pack to drag you downstream.
If you’ve got a heavy pair of backpacking boots, you don’t need to worry about your ankles. The thick pieces of leather on either side of them will keep them from rolling, so you can mistreat them without maiming yourself. This allows you to walk in different and more efficient ways than you otherwise would, particularly when going downhill. You can effectively just let yourself fall forward and catch yourself with a bouncy step. It’s hard on your knees, but very fast, and very energy efficient. If you want to avoid joint trouble or feel off balance with 40 pounds on your back, buy a trekking pole (expensive) or find a sturdy stick in the woods (cheap) and take the downhills slower.
I’ve found that 1 mile per hour is a consistently good estimate of how long it takes to get somewhere with a group of mixed experience, including breaks and meals. If there are first-time backpackers in your group, don’t count on going faster.
When you get tired, eat and drink.
Having a map and compass won’t do you much good unless you know how to use them. Your compass should have three parts. A rectangular base, a turnable circular plastic housing, and a magnetic needle inside the plastic housing. The rectangular base should have an arrow on it in the center of one edge that is parallel to the sides. The circular housing should be marked with degrees. I’ll give you a quick run through of how to orient your map.
First, learn the magnetic declination for your region of the country. True north differs from magnetic north, so to compensate for this you have to offset your compass by a certain number of degrees. To do this, take the circular plastic housing and rotate it so that the arrow on the rectangular base lines up with your magnetic declination. I’m in the bay area in California at the moment, so when I want to compensate for the declination, I rotate the housing so that the arrow on the base lines up with 344° (360° – 16°). When I’m in Cincinnati, Ohio I rotate it so that the arrow lines up with the 3° mark.
On the circular plastic housing you should see a hollow arrow that often looks like a house. Now that the declination is set, hold your compass level, and rotate it in your hand until the red part of the needle floats inside of that house. The arrow on the rectangular base is now pointing north. Use this to orient your map. Figure out which direction on the map is north, (usually up) and line that up with the arrow on the base of your compass keeping the needle inside of the house. Look at the features on the map and line them up with the features you can see around you to figure out where you are and where you need to go.
Set up your tent at least 15 feet from the campfire, and 100 feet from water and the trail. Find a level spot and look at the dirt to make sure it isn’t a place where water pools or flows through when it rains. Areas of open ground surrounded by deposits of buoyant leaves and bark are a tell-tale sign. (I’ve woken up in a pool of water before, and it isn’t fun.)
Make absolutely sure that your ground-cloth is covered by your tent and no part of it is sticking out from under it.This is a very common mistake. If rain hits your ground-cloth it will pool under your tent.
Staying warm at night is a fine art with some counter-intuitive properties, even with a good sleeping bag. The key thing to understand is this: the best insulator is dry, immobilized air. Your sleeping bag will only insulate you in the places where it is fluffy because the stationary air inside of the filler is what is doing the work. To use it most effectively, try to lie as straight and pencil-thin as you can and let it fluff up around you. If part of you is pressing up against the side of the bag, it will compact the filler in that spot leaving just the two thin layers of nylon, and the heat will leak out. It may feel warmest at first to curl up in a ball, but you’ll end up colder in the middle of the night than if you’d laid out flat and suffered through the cold for the first few minutes. The increased fluffiness of your bag will make up for your increased surface area. Equally counter-intuitive is the effect of wearing clothes to bed. You’ll feel like bundling up in your thickest coat, but if you have a good sleeping bag you’ll find that you are warmest wearing a hat, socks, and as little else as is appropriate. The cause of this is dampness. You’ll sweat throughout the night, and your sleeping bag breathes better than your clothes, letting it evaporate. Trust your sleeping-bag and you’ll wake up dry, warm, and comfortable instead of damp, cold, and sticky.
Find wood of all different sizes, from stuff the size of matchsticks to stuff the size of your arms. Generally the driest wood will be up off of the ground. If it is down amongst the leaves it is likely to be rotting and soggy. Look for dead branches of trees or fallen trees that have branches still sticking up in the air. Large pieces of wetter wood are ok, but it is essential that the small stuff be as dry as you can find. The big pieces can dry out on the fire once you are confident that it won’t go out. Also, you’ll need something to start your fire with. The most reliable thing I’ve found for this is dry grass. It lights immediately, and burns just long enough to get your matchstick sized pieces of wood lit.
Build fires only in existing fire rings if you can find one. A fire kills the soil underneath so that nothing can grow there. Outside of a fire ring, a fire can smoulder underground when you think you have put it out, and become a potential forest fire in dry weather. Pay attention to posted guidelines about what kind of fires are allowed during the summer. In many areas, only campstove fires are permitted.
I’m going to tell you how to build a “lean-to” style of fire here, because I’ve found those to be the most reliable. Find a dry log 2-4 inches in diameter and lay this in your fire ring. Take your dry grass and crush it into a loose ball about the size of your fist or a little smaller. Place the ball of dry grass next to the log. Now take a few of your driest and thinnest pieces of wood and lean them up against your log with the grass underneath. Make sure there is plenty of room for air to flow through, and for you to fit your match in. Have many more pieces of dry thin wood available because you will need to throw them on quickly.
Now you are ready to light your fire. Light your match and cup your other hand around it to protect it from the wind. Hold it slightly vertically for a second or two so the flame can travel up the matchstick. Don’t move the match until at least a quarter inch of the matchstick is on fire. Bring the match down to your ball of dry grass, and slide it underneath. The grass should light immediately, and you’ll need to act quickly now to take advantage of it. Throw small pieces of dry wood on the areas with the most flame, being careful not to smother the fire. Once your wood has reliably caught fire, put on progressively larger pieces. You’ll get an intuitive sense for the rate to do this pretty quickly. Pretty soon you’ll have a roaring little fire, and you can start being more careless with what you put on it.
Before you go to bed make sure that your fire is completely out. Pour water on it, and stir the wet coals. Turn off your flashlights to look for coals that are still burning.
“Take nothing but pictures. Leave nothing but footprints.”
When you use the bathroom, do it 100 feet from your campsite, and 100 feet from water. Dig a hole at least six inches deep to poop in and bury it when you are done. Pack out any trash or extra food, even biodegradable things like apple cores or banana peels. These things attract animals, and you don’t want them around your campsite.
If you are camping for more than a weekend, you’ll need to thoroughly clean your dishes to avoid getting sick. Start by licking them as clean as you possibly can. Next, pour water in your bowl, and swirl it around to pick up as much of the remaining food as possible, then drink the water. (This is kinda disgusting to do, so lick your dishes well. It’s only really necessary in high traffic areas, but is a good thing to know how to do regardless.) Heat up a pot of water with a few drops of Camp Suds in it and scrub your dishes with the slightly soapy water. Pour the waste water through a strainer 100 feet away from the campsite. You’ll be packing out whatever is left in the strainer inside of ziplock bags. Finally, dip your dishes in boiling water to sterilize them and remove the soap residue.
Before you leave your campsite in the morning, line up and walk from one end of the campsite to the other looking for trash or any forgotten gear.
You can happily camp without knowing any knots, but they are an essential skill for pitching a tarp, wind-proofing a tent, hanging bearbags, or setting up clotheslines.
If you are going to learn one knot, learn the taut-line hitch. This knot is good for anytime you need an adjustable loop. You can slide this knot to increase the size of the loop, but it will hold under tension. This is a great knot to tie at the end of any rope you are putting a stake through. You can put the stake in the ground wherever you can find a soft spot, and then slide this knot up the rope towards your tent to tighten it against the stake.
Photograph courtesy of David J. Fred
If you are going to learn two knots, also learn the bowline. This is used for putting a fixed loop at the end of a line. To make this knot, create a small loop in the rope. Take the free end of the rope and thread it through the hole. Wrap it around the fixed bit of rope, and put it back through the hole. To remember it think of this, “The rabbit comes out of the hole, around the tree, and back into the hole.”
If you are going to learn 3 knots, learn the clove hitch. A clove hitch is great anytime you need a fast knot to secure a rope to a post or larger rounded object. You can use this to tie the top of a bearbag shut, and to tie the other end of the rope off to a tree.
If you’re interested in the subject, and would like to learn more knots, here’s a great resource on essential knots .
One line summary: Leave notice. Stick together. Pay attention to your body. Don’t get in over your head.
Make sure friends who aren’t with you know where you are going and when they can expect you back. Leave a note on your car at the trailhead that contains emergency contact information and the date you expect to return. Get any backcountry permits you need from the local ranger station and let them know when you’ll be back and where you’ll be going.
At the very least, stay within earshot of the person before and behind you, and always stop at every intersection no matter how obvious it is which way to go. If someone gets hurt, you want to be able to come to their aid quickly, and it is surprisingly easy to get lost in your thoughts and walk off down the wrong trail without noticing it.
If you are separating from the group to go explore, try to stay in a group of four people or more. If something happens to one of you, this leaves one person to stay with the injured person and two people to go for help.
In hot dry weather, try to drink a quart of water an hour. This may seem excessive, and it probably is, but it ensures that you will never get dehydrated. If you are thirsty, you are already dehydrated. If your urine is yellow and not clear, you are already dehydrated. “Yellow and stinky, take a drinky.” No, I did not come up with that, and yes, it makes me cringe too. If the air is dry you won’t notice how much you are sweating, so it’s especially important to drink a lot of water.
Pack your clothes and your sleeping bag inside of garbage bags. If you get wet, find shelter and change into dry clothes. Wear a hat.
The sun is your enemy, especially if you are melanin-challenged like me. Put on sunscreen every day, and again at lunch if necessary. Wear a sun hat if you don’t want to put it on your face. When you are up high there’s less air between you and the sun. Sunburn sucks, and with dehydration can lead to heat exhaustion.
Or travel with someone who does. You should know how to treat hypothermia and heat exhaustion, how to treat cuts blisters and burns, how to stop bleeding, how to make a splint or a sling, and how to treat for shock.
When doing risky things, set up rules ahead of time and stick to them. When you are cold, tired, and at high elevation, you won’t make good decisions. At the top of Mount Everest you have the mental capacity of an 8 year old, but you don’t have to be nearly that high to notice the effects of low oxygen. Take some time while you are warm, safe, and well-rested to make some rules about when to turn back, when to find a campsite, or whatever it is that will keep you safe. I’ll give you two examples from my trip to Hawaii that illustrate this.
In the middle of our trip, Rolli and I decided to try to hike up to the top of Mauna Loa. We slept at sea level that morning in a hostel, and by noon we had driven up 11,000 feet to the trail head. I was already feeling the effects of the altitude when we arrived. I wasn’t very well prepared for the hike either. All I was wearing were shorts, a long-sleeved acrylic shirt, and a poncho. Within a half hour of hiking, it started to snow. We made two rules about what we were going to do, one when we started the hike, and one when we saw the first snow start to fall. First off, we agreed that no matter how far we’d gotten, we’d turn back at 2:00. This would give us enough time to get back to the car well before nightfall, even if we took longer on the return trip. When it started to snow we quickly saw that visibility would be an issue. The top of Mauna Loa is miles and miles of lava fields with few distinguishing features. The trail is marked only with big stacks of rocks called cairns . We agreed that if at any point we couldn’t see the next 3 cairns we’d head back. We didn’t end up getting to the summit, but we got to the edge of the caldera when the first rule kicked in and we turned around.
Later on in the trip, we had a 3 day hike planned to and from Waimanu valley . We’d obtained a permit to go into the valley, but the permit was canceled due to danger of flash flooding. The campsites in Waimanu valley are on a sandbar that separates a swamp from the ocean, and this is one of the rainiest places on earth. It hadn’t rained in two days when we arrived to the area however, and the park service wouldn’t make it out to the valley to test the ground water until after we’d be back on the plane to Pittsburgh. We checked the stream in nearby Waipi’o Valley, and found it fordable, so we decided to go for it. We agreed beforehand that if we saw any precipitation at all, we’d each sleep half the night while the other watched for rain or rising water, and head to higher ground if necessary. It turned out that our fears were unfounded. We passed a bunch of local people on the hike in who I’m sure never bother to get permits, and there wasn’t a drop of rain while we were there. Still, deciding ahead of time what to do kept us from having to quickly make the right decision in a dangerous situation.
If there’s something that you can’t do or aren’t comfortable doing, let people know. Don’t suffer in silence, and don’t worry about holding people back. Safety is more important.
See previous section.
This is the Boy Scout motto for a reason. Know where you are going, what the weather will be like, where you can find water, and approximately where you will be camping. Double check all of your essential gear. Break in your boots ahead of time.
You’ve got wet boots, blistered feet, a sunburned face, achy joints, cold fingers, a headache, your lungs feel like they’re 2 sizes too small, and you’re the last person up every hill. Still, keep your head up. Your fellow hikers will admire you for your attitude more than anything else.
If you are hoofing it all day and getting into camp just before dark you won’t be having a lot of fun. There will be magical moments . Take the time to enjoy them. Bring a sketchpad. Keep a journal. Bask in the sun. Plan your trip so that you have time to do these things. If there are people new to backpacking in your group, don’t count on going faster than 1 mile per hour. If you are going for a week, pick your favorite campsite and plan a layover day. Leave your troubles at the trailhead and live by the sun.
You will remember things better if you have names for them. Learn the names of the peaks and the names of the animals and plants. Bring an Audobon Society Fieldguide. The extra weight will be well worth it.
Hoot and Holler! Do a victory dance. Take pictures of your own grinning face in front of the vista you’ve just reached. Take off your sweaty shirt and swing it over your head. Go skinny dipping. Sunbathe (briefly.) Test for echoes off of cliffs. Slide down snow patches on your butt. Climb stuff. Explore. Or, if it’s more your style, sit quietly and soak in the marvel of what’s around you.
One of the great things about college a cappella groups is that they attract people who have a love of music regardless of whether they have any formal training. Some of the people I’ve most enjoyed listening to have done most of their singing in the shower. Inevitably though, written music is essential for communicating musical ideas, and so a natural conflict arises between people’s desire to contribute creatively and their ability to coordinate the group to sing what they are thinking.
Thankfully a little guidance and music notation software puts arranging within the reach of even those who are musically illiterate. Having an instant playback of whatever you put down is invaluable. I know enough notation to get by, but I’ve never been good at reading it. In this text I intend to write up everything that worked for me over the years producing arrangements that we enjoyed singing. I originally wrote this for my group at Carnegie Mellon, so excuse the physics diversions and the in-jokes here and there.
It’s a great and soulful experience to sing with a group of friends, and it’s even better if they are singing something you wrote. Don’t be intimidated! Dive in!
Picking out notes is hard and a lot of the difficulty comes from how our ears and brains hear pitch. This will get a little technical, but knowing this will help you understand the ways that you can mishear a note, and how to correct for it.
Pitch can be defined as the sensation that our brains create when interpreting repeating patterns of sound. Contrary to what you might think, there is no physical phenomenon that directly maps to pitch. There is a rough correspondence between the frequency of a sound wave and pitch, which is what most people associate with it. Higher frequencies sound like higher pitches, and lower frequencies sound like lower pitches. The A above middle C for instance, corresponds to a pattern of sound that repeats at 440 Hz. However, if you hear a person sing an A, there is much more to the sound than just a pure tone at 440 Hz. The folds of your throat and the shape of your mouth create resonances at other frequencies that are called harmonics. This is what allows us to hear the difference between a human, and say, a flute, even if they are playing the same note perfectly in tune. The shapes and materials of the instruments color the sound by adding harmonics to it. These effects are referred to as the “timbre” of the note.
Interestingly, studies have shown that these harmonics have more to do with our perception of the pitch of a sound than does the presence of the root frequency of the note. It’s possible to correctly hear a note as the A above middle C even when the 440 Hz frequency isn’t in the sound at all! When an instrument is resonating at a certain frequency, the harmonics that it produces have a consistent relationship with that frequency. As a result, even if you can’t hear the original frequency in the sound, your brain can still determine the pitch of the note. In a sense, the harmonics “point” to a certain root frequency that produced them. What we perceive as pitch is our brain’s guess as to the frequency that best explains the pattern of harmonics. When we hear more than one note at a time, the ways in which these harmonics interact can play with our perception of pitch in counterintuitive ways.
When chords are perfectly in tune, it is difficult to mentally separate them into distinct notes. When two notes are in tune, their harmonics are also in tune, and this can confuse your brain’s pitch detection apparatus by pointing to a different note, or causing your brain to assume that one note is just a resonance of the other. This is particularly true when the notes are played by the same instrument, or when they are in a pop song where everything is blurred together into a robust sound soup. In particular, it is very easy to confuse a note with one that is an octave or a fifth apart from it. Sometimes, you can even convince yourself that what you are hearing is a single note instead of a chord. If the notes you are writing aren’t sounding enough like the original song, look out for this. Make sure that everything you thought was a single note actually was, and make sure that the notes you wrote aren’t a fifth or a third off from the actual notes in the song.
You can observe this effect quite a bit in hard rock or heavy metal music. The most commonly used chords in this type of music are known as “power chords,” which consist of a root note, the fifth above it, and the root note again an octave up. When you put these chords through some heavy distortion, the result ends up sounding like a beefier version of the root note instead of 3 distinct notes. Listen to Nirvana’s “Smells like Teen Spirit ” and pay attention to the parts of the song where the distortion is on strong. You can hear the 3 distinct notes in each chord when he plays them at the very beginning without distortion, but after he turns on his fuzz box they blend into a single beefy note.
When you hear a loud sound, your ability to hear soft sounds happening during the loud sound is diminished. This effect is exploited by audio formats like mp3 to achieve their small file size. When a loud sound is occurring, the encoder stores the quieter sounds with less precision, because when you are listening to it, you won’t be able to tell the difference. However, this makes compressed audio formats worse to use for arranging because it can become difficult to distinguish the background melodies that make up the body of the music.
It is harder to determine the pitch of short notes than long notes. This is actually a fundamental property of all waves as any quantum physicist will tell you, but it has particular implications when arranging. On your first attempt at transcribing a part it is easy to put in transitional notes incorrectly. Because the notes are short, there is a good chance you won’t notice the problem when you listen to the arrangement. You may find yourself thinking that something doesn’t sound right but you can’t figure out exactly what is wrong. If this happens, double check those notes.
If you have a real knowledge of music theory, ignore this part and do it however it is really supposed to be done, or if the key signature is immediately obvious, put it in. If you are like me however, and don’t have any real knowledge of music theory, keep reading.
When you are arranging a song you have a minimum of two goals: you want the song to be fun to sing, and you want the song to be fun to listen to. These goals are more independent from one another than you would think. It’s possible to write an arrangement that sounds great to the audience but has individual parts with no continuity or direction. It is a bit harder to write a piece that is fun to sing but sounds bad to the audience, because fortunately we can hear ourselves.