Swedish Locality Areas

This is the third installment in a discovery of leading ones and Benford’s law. The first post was an introduction, please have a look at it to see what I mean by leading ones. The second post looked at leading ones in Swedish population numbers, and how growth and decline leads to Benford’s law.

In this post we take a look at leading ones in the distribution of Swedish localities areas. Locality is the translation of the Swedish word tätort used by Statistiska Centralbyrån, a government agency responsible for producing statistics. In this post I have used their numbers updated as of 31 December 2017. Areas are given in hectares (ha), where 1 ha is easiest visualized as a square with 100m sides. For comparison, a football pitch is typically 0.7 ha.

The histogram below shows the distribution of Swedish locality areas. As described in my first post on leading ones, numbers sort into “bins” according to both their leading digit and order of magnitude. Each vertical bar represents one bin. Note the distinct discrepancy from Benford’s law in the lower area numbers. This is most likely because the data collection does not include locality populations below 200. It becomes impractical for more than 200 people to squeeze into very small areas. The smallest area locality in Sweden is Johannesudd where 290 people live on 16 ha (roughly 23 football pitches). What Is Going On?

Disregarding the smallest localities, we see that leading ones stick out once again. Why does Benford return here? Why would there be a preference for leading ones in locality area sizes? Administrative areas do not change size by growth, but typically at the stroke of a pen. So, have people in power deliberately been choosing numbers favoring Benford’s law? No, of course not.

Visual Aid From Pascal’s Triangle

I will investigate the area distribution by asking the question “in how many ways can a region be subdivided?” To answer that I will be using a tool called binomial coefficients. This is classical mathematics territory, but probably less familiar to the layman. However, there is a certain visual aspect here that makes the discussion easier to follow. Binomial coefficients are often presented in the shape of Pascal’s triangle.

The binomial coefficients are infinite in numbers, and so Pascal’s triangle goes on forever. An illustration thus must cut short at some point, and the figure below shows the first 5 rows. To make a real world connection, assume that there are $$n$$ balls in front of you. The number of different ways you can pick up $$k$$ balls are then given by the binomial coefficients. For example, there are 3 different ways you can pick up 1 out of 3 balls. To read this from the table, find the number where the row for $$n=3$$ intersect the column for $$k=1$$ and you will find the number 3. The short form notation for binomial coefficients is $$\large{{n}\choose{k}}$$. At this point we do not need to dig deeper into the concept. But remember my question at the start of this section? It is the kind of question where you expect to find binomial coefficients in some form. Therefore we take Pascal’s triangle with us to the next section.

The Number of Possibilities

Let us start simple and represent the area of a country, or other administrative region, by a row of squares. Each square is indivisible and they all have the same area, which we set equal to 1 (hectares if you want). The figure below illustrates a region with area 5. Like impatient central planners we want to subdivide this region. With 5 indivisible squares we can divide the region into 2, 3, 4 or 5 subregions. The figure below illustrates these subregions using different colors, and highlights the respective area sizes in white numbers. Stacked on top of each other are the different ways to divide into the same number of subregions. To characterize these subdivisions mathematically we use three parameters. $$n$$ is the number of squares making up the region. $$s$$ is the desired number of subregions. $$i$$ is the area of a subregion that we want to count the appearance of. In our example above, dividing into $$s=3$$ subregions (see second left “stack” above), an area $$i=1$$ appears 9 times, an area $$i=2$$ appears 6 times, etc.

Counting the number of times an area $$i$$ appears allows us to also check how often the different leading digits show up. But we need a more efficient way to check this than to count areas by hand.

Finding Pascal’s Triangle

The table below shows the resulting numbers from different subdivisions $$s$$ for $$n=5$$. This is easily checked with the figure above. Now, clearly the numbers themselves do not make up Pascal’s triangle. But see what happens when dividing each number by $$s$$ in the column it belongs to. The new numbers, in red color next to the original, sets up an upside down Pascal’s triangle. With a minimum of effort we can generalize this into a simple formula, using the binomial coefficients. The number of times $$N$$ that an area $$i$$ appears when dividing a region of area $$n$$ into $$s$$ subregions is

$$N(i)=s\times\large{{n-i-1}\choose{s-2}}\tag{1}$$

This result makes it easy to extend the investigation to much larger regions. We can calculate the number of times an area shows up using the equation (1) instead of painstakingly counting by hand.

The Model Results

As an example, consider a region of area $$n=500$$ divided into $$s=20$$ subregions. Use the formula above to make a list of how many times each possible subregion area $$i$$ appears. The histogram below shows the count of the leading digits in that list. Clearly, the distribution follows closely that of Benford. The histogram represents the fact that more subregion areas have small leading digits than large leading digits. This means that randomly dividing any region into subregions you are most likely to end up with areas having small leading digits.

The next question is then if it is reasonable to assume that creating administrative regions is a random process? Well, boundaries are likely to follow from demography, which has developed over long time and by natural causes. So yes, I think it is fair to describe that process as effectively random.

It should be said that the shape of the histogram above is typical for large values of $$n$$ (hundreds or more). Applying the process on Sweden, we can assume the smallest indivisible area used to divide up localities to be 1 ha. As of 31 December 2017 Sweden had $$s = 1979$$ localities with a total area $$n = 617352$$ ha. This is a large sample, so selecting a random distribution is likely to follow the Benford distribution.

All in all, the results shown point to the relevance of the “number of possible subdivisions” argument in explaining leading ones in administrative area sizes.