Saturday, March 23, 2013

Random Expectations

A mathematician, a physicist, and an engineer were traveling through Scotland when they saw a black sheep through the window of the train. "Aha," says the engineer, "I see that Scottish sheep are black." "Hmm," says the physicist, "You mean that some Scottish sheep are black." "No," says the mathematician, "All we know is that there is at least one sheep in Scotland, and at least one side of that sheep is black."

It makes me a little sad that the engineer is usually the stupid guy in the joke.

Mathematicians often seem like they have a little trouble interfacing with the rest of the world, you know, because they're odd. I have been known to not get a joke because I over-analyzed it, or to just miss something because I am being too technical, and I'm not even a real mathematician. When I was living in Brazil, I had a tiny umbrella that I carried with me until the first storm completely destroyed it. One day, someone I knew saw it and started teasing me, calling my undersized umbrella "unisex." That probably connotes femininity in Brazilian culture, as opposed to masculinity, but I was confused. I thought that most umbrellas could be used by anyone, the exception being those with more feminine prints, like flowers. So, unisex must be better than the alternative, which would be girly. By the time I had deduced this incorrect interpretation of "unisex", the moment had passed and I had missed the opportunity to be ridiculed in good humor.

Some words have an everyday meaning and a scientific meaning, like work or power. Other words, though, are also misunderstood. We say random to mean unexpected. It is funny, then, that we have some definite expectations on what randomness looks like. If I asked someone to put random dots on paper, the result would probably look like the chart below.

Figure 1: "Random" Data

The thing is, that chart is not random. With a grid, you can see that the dots are spaced fairly evenly, one dot per area. 

Figure 2: "Not Actually Random" Data

Here is a plot of uniform random data I generated in Excel. It doesn't look very much like the "random" data at all. Like my old boss was fond of saying, "Randomness tends to be clumpy."

Figure 3: True Random Data (as random as the number generator in Excel is, anyway)

We have expectations about lists of numbers too. How often would you expect entries on a list of random data to begin with 1? About 10% of the time? In many cases, that is pretty far off.

If the random data spans a few orders of magnitude (powers of 10) or more, it generally follows Benford's Law. The number 1 shows up as a first digit about 30% of the time, and each higher number shows up less frequently, until 9 only appears less than 5% of the time, as seen in Figure 4. The greater the range of the data, the more closely the data follows the Law. One of the classic examples is the length of rivers. Interestingly, it doesn't matter what units are used in the measurements. You could measure rivers in miles, inches, centimeters, or furlongs. The same basic pattern shows up.

Figure 4: Probability of a data point starting with a number

The fact that the units don't matter is a clue to an explanation. Because we can convert the data from one unit to another, it means multiplying the whole data set by any number, say, 2 will result in a different data set that also follows Benford's Law. About 30% of the numbers will start with 1, even though it is not the same group of rivers whose lengths started with 1 before.

That seems kind of strange. We might expect that the 30% pattern would shift around as we multiply. But if we think about it, only some of the rivers that start with 1 will start with 2 after multiplying by 2. The ones that start with a one followed by a 5 or higher will now start with 3. Now think about the numbers in the new group that will start with 1. Everything that started with a 5, 6, 7, 8, or 9 before multiplying by 2 will start with 1 after multiplying by 2. If we add up the probabilities of starting by 5, 6, 7, 8, and 9 we get the same 30%.

The data set needs to cover a few orders of magnitude so that there is data to start with all the numbers. It wouldn't work with people's heights because they don't vary enough.  Most heights measured in inches would fall in the 60s and 70s, and there wouldn't be any that start with 1. But if the larger data points are 100, 1000, or better 10000 times larger than the smallest data points, there should be enough that start with each number to make the pattern work out.

It may strike you as funny that we have expectations on the "unexpected", and that those expectations are often wrong. Randomness is clumpier than we feel it should be, and random things are surprisingly predictable when considered in groups. That is actually what random means, technically: occurring with a certain probability. Predictable patterns emerge when random things are grouped. Kind of like people.

For more on randomness or Benford's Law, check out:
Scishow on randomness: http://www.youtube.com/watch?v=LElyagQ0n_g
Numberphile on Benford's Law: http://www.youtube.com/watch?annotation_id=annotation_143101&feature=iv&src_vid=VbtNy54ya9A&v=XXjlR2OK1kM
More Benford's Law: http://www.youtube.com/watch?v=vIsDjbhbADY

No comments:

Post a Comment