The Guardian's reporting of the recording and analysing of voluminous data included a few screen shots of the maps used to show countries' exposure to surveillance. Here's one for the purposes of discussion:
The colours range from green, those with the fewest number of items of surveillance, to red, the most. Putting aside the fact the colour scheme is not easy to interpret (is light green least...or is it dark green?) and that the use of a Mercator projection distorts the size of areas giving false prominence to more northerly and southerly latitudes...let's focus on the data.
So there have been 2,892,343,446 pieces of intelligence data gathered from US computer networks over a 30-day period ending in March 2013. The US is shown the same as Germany, Saudi Arabia, Kenya and Iraq. The map quite clearly leaves readers with the impression that surveillance is pretty similar across these countries.
Let's assume that each of these countries has 3 billion pieces of information collected (a fair assumption of the mapping of totals above). If we calculate the number of pieces of surveillance data per capita we get an entirely different picture
USA, 313 million people = an average 9.5 pieces of surveillance data per person
Saudi Arabia, 28 million people = an average 106.8 pieces of surveillance data per person
Kenya, 42 million people = an average 72 pieces of surveillance data per person
Iraq, 33 million people = an average 91 pieces of surveillance data per person
Mapping these rates would give an accurate picture and one which, crucially, allows us to visually compare one country against another across the map. Without normalising our data to a consistent denominator the map is utterly useless. We can't make any sensible interpretations of the information. So, as it turns out the level of surveillance in the US is an order of 10 times less than Saudi Arabia. That is not a story the above map even vaguely illustrates. So it's worrying if these maps are genuinely being used to inform national security don't you think?
In trying to come up with a way for mere mortals to understand, my better half Linda Beale (@lindabeale) came up with a great analogy. We spent some time over the weekend making some slides to explain the issue...using alcohol, conveniently colour coded green to match the NSA map. It struck us that if we cannot use logic to explain it then let's dumb this down to something most people can identify with...size of a drink.
So here goes with one more shot...have a look at the following picture:
Two glasses of the same size. It should be fairly easy to see that the glass on the right is holding double that of the left. In fact, the right glass is filled with two shots of Creme de Menthe and the right one has only one (or...his and hers if you will!). Now let's look at adding a third glass:
The glass on the right is a different size and shape so how much Creme de Menthe does it hold? Trickier question eh? Is it one shot, two shots or somewhere in between? In fact it's one shot but it doesn't look the same as the glass on the left and it looks significantly less than half of the glass holding the double. Now let's take an aerial view:
Hmm...they all look quite similar but one's a hexagon. Here then is the problem...a choropleth map is a container for data. Looking at a choropleth map is similar to looking down on a whole array of glasses filled with liquids of different quantities and it's impossible to make any sense of the amount of liquid...the totals are a function of the size and shape of the vessel in which they are contained. When the sizes of areas vary substantially the problems of estimating quantities gets even harder unless you adjust for the differences caused by the containers so you can make a sensible assessment of the relative amount of liquid in each glass:
In this example, the top left glass contains one shot, top right, two shots. Both these glasses are the same size so we have a consistent basis for comparison and the darker colour suggests more liquid...a correct visual interpretation. What about the big glass? Actually it contains one shot but because it's spread out across a much wider area the colour is diluted and it appears that the glass holds far less than either of the other two. This would be an incorrect assumption. It's the same as the top left glass and for us to interpret that correctly we need to see the colours the same. In map terms, we need to show similarity of the character of areas using similar symbols so our eyes and brains interpret things properly.
Let's be clear, this isn't some sort of quirky thing that cartographers do to finesse a map, it's not just 'best practice'. It's fundamental data analysis to support proper mapping of data. It's non-negotiable (and it's not even hard to do...). Your map will, quite simply, be utterly meaningless without your data being normalized.
Consider John Snow's famous map of the 1854 cholera outbreak in Soho, London. His map is widely regarded as a classic. He mapped deaths as dots and was able to make an inference as to the potential cause of the cholera outbreak. If he had mapped the same data as totals on a choropleth the map would have been one of the most useless maps ever made; it's unlikely he'd have made any reasonable assumption about the mode of transmission of cholera or had the foresight to trace it to the Broad Street pump. So why do so many people persist in making their maps useless? It strikes me as an odd thing to do but hopefully this little explanation will help those who can think about their mapping in a slightly different way.
Got it? Cheers, it was fun...