Heatmap of UFO sightings (1/3): Introduction

Adjusting heatmaps for population: addressing XKCD 1138

In Machine Learning for Hackers by Drew Conway and John Myles White, they introduce a data set of UFO sightings as an example. Googling around for other work on this data set produces some heatmaps of the sightings, here is one from flowingdata.com:

I was immediately reminded of the XKCD comic:

xkcd heatmap

I decided to try adjusting the UFO sightings data for population, something I've never done before. Are UFO sightings distributed according to population? Or do some areas get more than their fair share of sightings?

TL;DR

my ufo heatmap

The western US gets a lot more sightings than the population would suggest, especially near Area 51. Read on to follow my data adventure.

A little bit more

I used Python to wrangle the data and ggplot2 from R to make the plots. The actual raw data looks like this:

my raw heatmap

I explored a couple different ways to smooth that out and arrive at the final image, which all gave pretty much the same result. Specifically, I looked at:

  • a beta-binomial maximum likelihood approach
  • the R library lme4 for linear mixed-effects models
  • the probabilistic programming language Stan

Up Next

Obtaining and cleaning the data

This details the tedious process I went through to arrive at the 'raw' map above.

Refining the results

Here I show how I came up with the smoother map.