Heatmap of UFO sightings (2/3): Obtaining and cleaning the data

This is the unpleasant part of data science that makes up around 80% of the work. I recommend skipping to part 3 if you don't have much patience. I'll be using Python and the pandas data analysis library.

The original data can be downloaded from Infochimps.

Cleaning steps

After a lengthy process, I arrive at a dataframe that looks like this:

state_county population num_sightings raw_rate
alabama autauga 55246 2 0.000036
alabama baldwin 195540 32 0.000164
alabama barbour 27076 0 0.000000
alabama bibb 22512 1 0.000044
alabama blount 57872 0 0.000000

Plotting those raw rates on a map looks like this:

my raw heatmap

Below are the steps I took:

step action rows remaining
1 convert avro format to pandas DataFrame 61067
2 remove rows with ill-formed dates 60802
3 remove rows before 1990-08-30 54444
4 remove rows with location outside the contiguous 48 states plus DC 45983
5 make location all uppercase and remove parenthetical comments 45983
6 get county, latitude, longitude from mapquest API 42998
7 convert counties to those used in ggplot2 42624
8 get county populations from US census data 42624
9 convert the census counties to what's used in ggplot2 42624

Continue reading if you're interested in the nitty-gritty.

Getting it into a pandas dataframe

The data comes in a few formats, none of which I could get pandas to read correctly. I resorted to using the avro format using a separate library (which required Python 2.x):

convert avro to dataframe download
#!/usr/bin/python2.7

from avro.datafile import DataFileReader
from avro.io import DatumReader
from pandas import DataFrame

avro_file = '/Users/srharnett/Downloads/ufo_data/ufo_awesome.avro'
reader = DataFileReader(open(avro_file, 'r'), DatumReader())
df = DataFrame(list(reader))
df.to_pickle('ufo_data.pkl')

In [1]:
import numpy as np
import pandas as pd
%load_ext rmagic
%R library(ggplot2)
%R theme_set(theme_bw());
In [67]:
!./convert_data.py
data = pd.read_pickle('ufo_data.pkl')
print(len(data))
data.head()
61067
Out[67]:
description duration location reported_at shape sighted_at
0 Man repts. witnessing "flash, followed by... Iowa City, IA 19951009 19951009
1 Man on Hwy 43 SW of Milwaukee sees large, bri... 2 min. Milwaukee, WI 19951011 19951010
2 Telephoned Report:CA woman visiting daughter w... Shelton, WA 19950103 19950101
3 Man repts. son's bizarre sighting of smal... 2 min. Columbia, MO 19950510 19950510
4 Anonymous caller repts. sighting 4 ufo's ... Seattle, WA 19950614 19950611

5 rows × 6 columns

A couple hundred of the 'sighted_at' dates are 0000, and a dozen or so are pre-1900. A couple others have malformed dates ending in '00'. I remove these rows and convert the strings to a datetime data type.

In [68]:
data = data[[x[-2:] != '00' for x in data.sighted_at]]
data = data[data.sighted_at >= '19000101']
print(len(data))
data.reported_at = pd.to_datetime(data.reported_at)
data.sighted_at = pd.to_datetime(data.sighted_at)
60802

90% of the sightings occurred in the twenty year span 1990-08-30 to 2010-08-30, so I restrict the data to this range. This way we can roughly interpret things as a rate of sightings over time.

In [4]:
%%R -i data -h200
seconds_in_a_year <- 3.15569e7
binwidth <- 5*seconds_in_a_year
ggplot(data, aes(x=sighted_at)) + geom_histogram(binwidth=binwidth)
In [5]:
data.sighted_at.max()
Out[5]:
Timestamp('2010-08-30 00:00:00', tz=None)
In [6]:
(data.sighted_at >= '1990-08-30').mean()
Out[6]:
0.89543107134633726
In [69]:
data = data[data.sighted_at >= '1990-08-30']
len(data)
Out[69]:
54444

Cleaning up the location column

The location column is a bit messy, so I cleaned it up similarly to Machine Learning for Hackers.

In [70]:
# 48 contiguous states plus Washington, D.C.
states = 'AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY'
states = set(states.split(','))
has_state = np.array([x[-2:].upper() in states for x in data.location])
data = data[has_state]
len(data)
Out[70]:
45983

There are a lot of little comments inside parentheses that we'd like to get rid of:

In [9]:
data.location[[x.find('(') >= 0 for x in data.location]].tail()
Out[9]:
60885                                Huntsville (near), TX
60890                               Albuquerque (near), NM
60950                             Washington (west of), IA
60975     Laurie (5 mi. W of; 52 mile marker on Osage), MO
61039                                Washington (near), PA
Name: location, dtype: object
In [71]:
import re
pattern = r'(.*)\(.*\)(.*)' # want to ignore anything inside parentheses
r = re.compile(pattern)
def clean_location(loc):
    m = r.search(loc)
    cleaned = m.group(1).strip() + m.group(2).strip() if m is not None else loc.strip()
    return cleaned.upper()
In [74]:
for loc in data.location.tail():
    print('%50s %s %s' % (loc, '-->', clean_location(loc)))
for loc in data.location[[x.find('(') >= 0 for x in data.location]].tail():
    print('%50s %s %s' % (loc, '-->', clean_location(loc)))
                                   Los Angeles, CA --> LOS ANGELES, CA
                                      Hartwell, GA --> HARTWELL, GA
                               Franklin Square, NY --> FRANKLIN SQUARE, NY
                                      Brighton, CO --> BRIGHTON, CO
                                     Fort Knox, KY --> FORT KNOX, KY
                             Huntsville (near), TX --> HUNTSVILLE, TX
                            Albuquerque (near), NM --> ALBUQUERQUE, NM
                          Washington (west of), IA --> WASHINGTON, IA
  Laurie (5 mi. W of; 52 mile marker on Osage), MO --> LAURIE, MO
                             Washington (near), PA --> WASHINGTON, PA
In [12]:
data['clean_location'] = [clean_location(x) for x in data.location]

Getting county information -- Mapquest API

Now I'll use the mapquest geocoding API to get county and latitude/longitude information for each location.

In [13]:
import requests
from itertools import zip_longest, islice
In [14]:
MAPQUEST_API_KEY = r'???'
params = dict(key=MAPQUEST_API_KEY, thumbMaps=False)
batch_url = 'http://www.mapquestapi.com/geocoding/v1/batch'

def extract_county(api_result):
    best_guess = api_result['locations'][0]
    location = api_result['providedLocation']['location']
    county = best_guess['adminArea4']
    return location, county

def extract_lat_lng(api_result):
    best_guess = api_result['locations'][0]
    lat, lng = best_guess['latLng']['lat'], best_guess['latLng']['lng']
    return lat, lng

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return zip_longest(fillvalue=fillvalue, *args)

def get_mapquest_results(locations):
    ''' splits up list of locations into chunks of 100 and submits to mapquest API 
        returns dictionary location-->county '''
    if MAPQUEST_API_KEY == '???':
        raise ValueError('get your own API key')
    location_to_county = {}
    location_to_latlng = {}
    for chunk in grouper(locations, 100):
        params['location'] = chunk
        r = requests.get(batch_url, params=params)
        for api_result in r.json()['results']:
            location, county = extract_county(api_result)
            lat, lng = extract_lat_lng(api_result)
            location_to_county[location] = county
            location_to_latlng[location] = (lat, lng)
    return location_to_county, location_to_latlng

You'll need a Mapquest API key to actually run this.

In [15]:
location_to_county, location_to_latlng = get_mapquest_results(data.clean_location.head())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-7d1e181c7a2e> in <module>()
----> 1 location_to_county, location_to_latlng = get_mapquest_results(data.clean_location.head())

<ipython-input-14-5ee03ed65190> in get_mapquest_results(locations)
     24         returns dictionary location-->county '''
     25     if MAPQUEST_API_KEY == '???':
---> 26         raise ValueError('get your own API key')
     27     location_to_county = {}
     28     location_to_latlng = {}

ValueError: get your own API key
In [16]:
import pickle
location_to_county = pickle.load(open('city_to_county.pickle', 'br'))
location_to_latlng = pickle.load(open('coordinates.pickle', 'br'))
In [17]:
for i, (k,v) in enumerate(location_to_county.items()):
    print(k, '|', v)
    if i > 5:
        break
print()
for i, (k,v) in enumerate(location_to_latlng.items()):
    print(k, '|', v)
    if i > 5:
        break
OKLAHOMA COUNTY, OK | Oklahoma County
TUNKHANNOCK, PA | Wyoming County
AIKEN, SC | Aiken County
INTERSTATE 80/25, WY | 
U. S. AIR FORCE BASE, FL | 
HENIKER, NH | Merrimack County
CARLBAD, NM | 

OKLAHOMA COUNTY, OK | (35.468494, -97.521264)
TUNKHANNOCK, PA | (41.540464, -75.945621)
AIKEN, SC | (33.563871, -81.71615)
INTERSTATE 80/25, WY | (42.999627, -107.55145)
U. S. AIR FORCE BASE, FL | (28.260731, -82.420978)
HENIKER, NH | (43.186268, -71.826307)
CARLBAD, NM | (34.421369, -106.108388)
In [18]:
data['latlng'] = [location_to_latlng.get(x) for x in data.clean_location]
data['county'] = [location_to_county.get(x) for x in data.clean_location]

Here I'm dropping rows where Mapquest couldn't find a county, or where the location is outside the contiguous United States.

In [19]:
print(len(data))
data = data[['latlng', 'county', 'clean_location']].dropna()
print(len(data))
data = data[data.county != '']
print(len(data))
data['lat'] = [x[0] for x in data.latlng]
data['lng'] = [x[1] for x in data.latlng]
data = data.query('25.12993 <= lat <= 49.38323 and -124.6813 <= lng <= -67.00742')
print(len(data))
data['state'] = [x[-2:] for x in data.clean_location]
data = data[['lat', 'lng', 'county', 'state']]
data.head()
45983
45983
43025
42998
Out[19]:
lat lng county state
0 41.661240 -91.530128 Johnson County IA
1 43.041072 -87.909421 Milwaukee County WI
2 47.213267 -123.106122 Mason County WA
3 38.951551 -92.328594 Boone County MO
4 47.603229 -122.330280 King County WA

5 rows × 4 columns

Map it out, looks similar to the flowingdata and xkcd maps, i.e. simply a map of population density.

In [20]:
%%R -w800 -i data
usa <- map_data("usa")
p <- ggplot(usa, aes(x=long, y=lat)) 
p <- p + geom_polygon(alpha=.1) 
p <- p + geom_point(aes(x=lng, y=lat), data=data, alpha=.01, color='blue', size=3) 
p + coord_map("conic", lat0=30)

Binning by county

To bin by county, I'm going to use the US county map from ggplot. Unfortunately the county names from the mapquest API don't match these exactly, so we're going to have to do a little more munging.

In [42]:
%%R -o county_sightings
county_sightings <- map_data("county")
head(county_sightings)
       long      lat group order  region subregion
1 -86.50517 32.34920     1     1 alabama   autauga
2 -86.53382 32.35493     1     2 alabama   autauga
3 -86.54527 32.36639     1     3 alabama   autauga
4 -86.55673 32.37785     1     4 alabama   autauga
5 -86.57966 32.38357     1     5 alabama   autauga
6 -86.59111 32.37785     1     6 alabama   autauga
In [43]:
county_sightings = DataFrame(dict(state=county_sightings[4], county=county_sightings[5]))
county_sightings['num_sightings'] = 0
county_sightings = county_sightings.drop_duplicates().set_index(['state', 'county'])
county_sightings.head()
Out[43]:
num_sightings
state county
alabama autauga 0
baldwin 0
barbour 0
bibb 0
blount 0

5 rows × 1 columns

Our data uses the state abbreviations (e.g. AL for alabama) so we'll need to convert these.

In [44]:
state_map = {'alabama': 'AL', 'arizona': 'AZ', 'arkansas': 'AR', 'california': 'CA', 'colorado': 'CO',
       'connecticut': 'CT', 'delaware': 'DE', 'district of columbia': 'DC', 'florida': 'FL',
       'georgia': 'GA', 'idaho': 'ID', 'illinois': 'IL', 'indiana': 'IN', 'iowa': 'IA', 'kansas': 'KS',
       'kentucky': 'KY', 'louisiana': 'LA', 'maine': 'ME', 'maryland': 'MD', 'massachusetts': 'MA',
       'michigan': 'MI', 'minnesota': 'MN', 'mississippi': 'MS', 'missouri': 'MO', 'montana': 'MT',
       'nebraska': 'NE', 'nevada': 'NV', 'new hampshire': 'NH', 'new jersey': 'NJ', 'new mexico': 'NM',
       'new york': 'NY', 'north carolina': 'NC', 'north dakota': 'ND', 'ohio': 'OH', 'oklahoma': 'OK',
       'oregon': 'OR', 'pennsylvania': 'PA', 'rhode island': 'RI', 'south carolina': 'SC',
       'south dakota': 'SD', 'tennessee': 'TN', 'texas': 'TX', 'utah': 'UT', 'vermont': 'VT', 'virginia': 'VA',
       'washington': 'WA', 'west virginia': 'WV', 'wisconsin': 'WI', 'wyoming': 'WY'}
state_map_rev = {v: k for k,v in state_map.items()}
In [24]:
from collections import defaultdict, Counter

The mapping of counties in our data set to the ggplot ones is a pretty painful manual process. I gave up after getting the majority of them.

In [45]:
%%time
missing_counties = defaultdict(Counter) # counties that failed to map to a county in the ggplot set

def convert_to_ggplot_style(x):
    ''' takes a row from our main dataframe 'data' and updates the appropriate county in the new dataframe 
        'county_sightings' if it can find the county, otherwise updates missing_counties '''
    state = state_map_rev[x['state']]
    county = x['county'].lower().replace("'", '').replace('.', '')
    if county.split()[-1] in {'county', 'parish'}:
        county = county[:-7]
    known_fixes = {'dupage': 'du page', 
                   'district of columbia': 'washington', 
                   'desoto': 'de soto',
                   'dekalb': 'de kalb'}
    county = known_fixes.get(county, county)
    try:
        county_sightings.ix[state, county] += 1
    except KeyError:
        missing_counties[state][county] += 1
data.apply(convert_to_ggplot_style, axis=1)
print(county_sightings.head())
                 num_sightings
state   county                
alabama autauga              2
        baldwin             32
        barbour              0
        bibb                 1
        blount               0

[5 rows x 1 columns]
CPU times: user 23.6 s, sys: 120 ms, total: 23.8 s
Wall time: 23.8 s

We could add more to the 'known fixes' dictionary to get more of these, but there aren't too many and Virginia is a little weird when it comes to counties.

In [46]:
# number of sightings lost in the conversion
sum(count for counter in missing_counties.values() for count in counter.values())
Out[46]:
374
In [27]:
for state, counter in missing_counties.items():
    for county, count in counter.most_common():
        if count >= 6:
            print(state, county, count)
nevada vernon 7
virginia alexandria 27
virginia portsmouth 20
virginia williamsburg 14
virginia fredericksburg 13
virginia manassas 12
virginia charlottesville 11
virginia lynchburg 10
virginia danville 9
virginia winchester 8
virginia falls church 6
new york clatsop 6
In [47]:
county_sightings = county_sightings.reset_index()
county_sightings['state_county'] = county_sightings.apply(lambda x: x.state + ' ' + x.county, axis=1)
county_sightings = county_sightings[['state_county', 'num_sightings']]
county_sightings.head()
Out[47]:
state_county num_sightings
0 alabama autauga 2
1 alabama baldwin 32
2 alabama barbour 0
3 alabama bibb 1
4 alabama blount 0

5 rows × 2 columns

In [29]:
county_sightings.num_sightings.sum()
Out[29]:
42624

Counties in the west are larger, so this doesn't quite look the same as the flowing data and xkcd maps.

In [48]:
%%R -i county_sightings -w900
counties <- map_data("county")
counties$region <- do.call(paste, c(counties[c("region", "subregion")], sep = " "))

p <- ggplot(county_sightings, aes(map_id=state_county))
p <- p + geom_map(aes(fill=num_sightings), map=counties)
p <- p + expand_limits(x=counties$long, y=counties$lat)
p <- p + coord_map("conic", lat0=30) + scale_fill_gradientn(trans="sqrt", colours=c('white', '#640000'))
p + ggtitle('UFO sightings by county') + xlab('long') + ylab('lat')

Get county populations from census data

I used a data set found here:

http://www.census.gov/popest/data/counties/totals/2013/CO-EST2013-alldata.html

In [31]:
populations = pd.read_csv('CO-EST2013-Alldata.csv', encoding='ISO-8859-1')
populations.head()
Out[31]:
SUMLEV REGION DIVISION STATE COUNTY STNAME CTYNAME CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013 NPOPCHG_2010 NPOPCHG_2011 NPOPCHG_2012 NPOPCHG_2013 BIRTHS2010 BIRTHS2011 BIRTHS2012
0 40 3 6 1 0 Alabama Alabama 4779736 4779758 4785570 4801627 4817528 4833722 5812 16057 15901 16194 14966 59700 59356 ...
1 50 3 6 1 1 Alabama Autauga County 54571 54571 54613 55278 55265 55246 42 665 -13 -19 170 636 607 ...
2 50 3 6 1 3 Alabama Baldwin County 182265 182265 183223 186727 190675 195540 958 3504 3948 4865 549 2187 2127 ...
3 50 3 6 1 5 Alabama Barbour County 27457 27457 27341 27301 27232 27076 -116 -40 -69 -156 69 334 277 ...
4 50 3 6 1 7 Alabama Bibb County 22915 22919 22883 22770 22662 22512 -36 -113 -108 -150 57 266 242 ...

5 rows × 68 columns

In [49]:
county_populations = populations.query('SUMLEV == 50')[['STNAME', 'CTYNAME', 'POPESTIMATE2013']]
county_populations.columns = ['state', 'county', 'population']
county_populations.state = [x.lower() for x in county_populations.state]
county_populations.county = [x.lower().replace("'", '').replace('.', '') for x in county_populations.county]
county_populations.head()
Out[49]:
state county population
1 alabama autauga county 55246
2 alabama baldwin county 195540
3 alabama barbour county 27076
4 alabama bibb county 22512
5 alabama blount county 57872

5 rows × 3 columns

And now again, the painful process of converting this to the ggplot states and counties.

In [50]:
def convert_to_ggplot_style(x):
    state = x.state
    county = x.county
    if county.split()[-1] in {'county', 'parish'}:
        county = county[:-7]
    if county.split()[-1] == 'city':
        county = county[:-5]
    known_fixes = {'dupage': 'du page', 
                   'district of columbia': 'washington', 
                   'desoto': 'de soto',
                   'dekalb': 'de kalb',
                   'lasalle': 'la salle',
                   'laporte': 'la porte',
                   'dewitt': 'de witt',
                   'lamoure': 'la moure',
                   'doña ana': 'dona ana'}
    county = known_fixes.get(county, county)
    if state == 'nevada' and county == 'carson':
        county = 'carson city'
    return county
county_populations.county = county_populations.apply(convert_to_ggplot_style, axis=1)
county_populations['state_county'] = county_populations.apply(lambda x: x.state + ' ' + x.county, axis=1)
county_populations = county_populations[['state_county', 'population']]
county_populations.head()
Out[50]:
state_county population
1 alabama autauga 55246
2 alabama baldwin 195540
3 alabama barbour 27076
4 alabama bibb 22512
5 alabama blount 57872

5 rows × 2 columns

Counties in Alaska and Hawaii, as well as some in Virginia, are in the population data but not the filtered UFO sighting data. There's also a couple where there's a city and county with the same name, so I'll combine those.

In [51]:
print(len(county_populations), len(county_populations.state_county.unique()))
3143 3136
In [52]:
county_populations = county_populations.groupby('state_county').sum().reset_index()
print(len(county_populations))
county_populations.head()
3136
Out[52]:
state_county population
0 alabama autauga 55246
1 alabama baldwin 195540
2 alabama barbour 27076
3 alabama bibb 22512
4 alabama blount 57872

5 rows × 2 columns

In [53]:
final_data = pd.merge(county_populations, county_sightings, on='state_county', how='left')
print(len(county_populations), len(county_sightings), len(final_data))
final_data = final_data.fillna(0)
final_data.head()
3136 3076 3136
Out[53]:
state_county population num_sightings
0 alabama autauga 55246 2
1 alabama baldwin 195540 32
2 alabama barbour 27076 0
3 alabama bibb 22512 1
4 alabama blount 57872 0

5 rows × 3 columns

In [54]:
final_data.num_sightings.sum()
Out[54]:
42624.0
In [55]:
final_data.to_csv('ufo_data.csv')

I've made this cleaned up data available here

UFO sightings normalized by population, raw

Finally, a first answer to the original question.

In [56]:
final_data.eval('raw_rate = num_sightings/population')
In [57]:
%%R -i final_data -w900
counties <- map_data("county")
counties$region <- do.call(paste, c(counties[c("region", "subregion")], sep = " "))

p <- ggplot(final_data, aes(map_id=state_county))
p <- p + geom_map(aes(fill=raw_rate), map=counties)
p <- p + expand_limits(x=counties$long, y=counties$lat)
p <- p + coord_map("conic", lat0=30) + scale_fill_gradientn(trans="sqrt", colours=c('white', '#640000'))
p + ggtitle('UFO sightings adjusted for population (raw)') + xlab('long') + ylab('lat')

In part 3 I explore a few ways to smooth this out so that the county-level estimates are more reasonable.

In [75]:
!date
Sat May 24 09:22:28 EDT 2014