We've launched a number of improvements today that help solve two problems: ungeocoded data and inaccurately geocoded data. Some of this is unavoidably technical, but I'll try my best to explain it without getting too geeky.
A quick geocoding primer
Geocoding is the process of converting an address into a longitude/latitude pair. For example, the address 53 W. Jackson, Chicago, IL, corresponds to the longitude/latitude point (-87.6296034, 41.87813100). You can geocode any U.S. address for free using tools such as geocoder.us — enter an address there, and the site will tell you its longitude/latitude.
Geocoding is essential to what we do at EveryBlock, because it makes geographic querying possible. Once you have a set of longitude/latitude points, a computer can place them on a map, calculate which points are within a given area and determine which points are near a given block.
But geocoding isn't perfect. To get the longitude/latitude for an address, geocoding software relies on large databases that map streets to points — and these databases inevitably omit certain addresses or streets. As new buildings and streets are built, these databases can be slow to catch up. (If you've ever searched MapQuest or Google Maps for a valid address that it couldn't find, you've come across this issue.)
Another problem is that of misspellings and ambiguity. If you make a typo in a street name, an overly strict geocoder may not be able to correct the misspelling. An address like "800 Halsted" in Chicago can't be geocoded with certainty because both 800 N. Halsted and 800 S. Halsted exist as distinct places.
Finally, a fundamental problem with geocoding is that it's usually interpolative. A standard geocoding database, like the one we use at EveryBlock, doesn't map every address to a longitude/latitude — it only maps address ranges to longitude/latitude lines. A geocoder then works by searching the database for the appropriate address range, then doing a bit of arithmetic to figure out where on the line the exact address is. (The logic goes something like this: "If the addresses in this particular range go from 128 to 156 Main St., then the address 142 Main St. is exactly between those numbers, which implies the longitude/latitude point is exactly in the middle of that line.") The problem here is that the technique assumes addresses/houses are arranged uniformly on a given block, but different houses and city lots have different sizes — and, hence, the resulting longitude/latitude may be a couple of meters off. However, this interpolative approach is an acceptable compromise given the lack of availability of more granular geographic data. If you look closely, you'll notice this problem on even the largest mapping sites like Google Maps.
On EveryBlock, if we cannot geocode a given piece of news or data, we're not able to map it, and it won't show up on block, neighborhood or ZIP pages. It'll only be accessible if you browse by type of information. For example, this restaurant inspection in New York City has an address of "000 TERMINAL 5, QUEENS," which our system couldn't geocode. (Looks like it's a food establishment in JFK Airport.) If you live near this establishment, you won't see it on your EveryBlock neighborhood page, because it hasn't been geocoded and, hence, the system doesn't know that it's geographically near you. However, it's still accessible if you search for restaurant inspections by name or drill down by date.
With this background information in mind, here are the two improvements we've made today:
Improvement #1: Handling inaccurately geocoded data
In some cases, our data sources provide us the raw longitude/latitude point for each piece of data. For example, the Los Angeles Police Department includes longitude/latitude points in its crime data, which we publish on our L.A. site. We like it when this is the case, because it saves us from having to geocode the addresses on our end, and because we've found from experience that government agencies' geocoding systems are generally higher quality than our own, in-house geocoder. But we've recently learned that relying on our data sources' geocoding results isn't always accurate.
The Los Angeles Times got in touch with us on Friday afternoon to point out that some of our L.A. crime data was incorrectly geocoded to a point near City Hall. We confirmed that our source for the data, the LAPD's crime Web site, was also incorrectly geocoding the same crimes to the same incorrect point — so although we weren't originating the error, we were perpetuating it. On Sunday, the Times wrote an article about the LAPD's map glitch and called us out on it.
Thanks to this heads-up from the Times, we've now improved our system to fix the problem. From now on, rather than relying blindly on our data sources' longitude/latitude points, we cross-check those points with our own geocoding of the address provided. If the LAPD's geocoding for a particular crime is significantly off from our own geocoder's results, then we won't geocode that crime at all, and we publish a note on the crime page that explains why a map isn't available. (If you're curious, we're using 375 meters as our threshold. That is, if our own geocoder comes up with a point more than 375 meters away from the point that LAPD provides, then we won't place the crime on a map, or on block/neighborhood pages.)
For example, this crime from Jan. 1, 2009, was reported at Western Ave. and Pico Blvd., but the LAPD's geocoder placed it near City Hall — more than three miles away from that intersection. Because of this large discrepancy, we're not mapping the crime, and we're displaying a message on the crime's page that explains this.
We were curious about the LAPD system's geocoding accuracy and did a bit of number crunching. Here's a map of LA crimes where the LAPD system's longitude/latitude was more than one kilometer away from the result of our own geocoder. (This does not include any crimes whose addresses could not be geocoded by our own geocoder.) The numbers on the map represent the number of crimes at that given cluster, and we used the longitude/latitude points from the LAPD's system.
Starting today, none of these crimes will be mapped on EveryBlock, because the data is too inconsistent to be trustworthy.
This change goes a long way in preventing further inaccuracies of this sort. Now, for an L.A. crime to be mapped, it must not only have an LAPD longitude/latitude — it must also be geocoded by our own internal geocoder database such that the two geocoder results are significantly similar.
Improvement #2: Surfacing ungeocoded data
The second change we've made has to do with records on our site whose addresses could not be geocoded (and, hence, are not mapped). We want to be clearer about how many records can't be geocoded, so that EveryBlock users examining our aggregate statistics can keep that caveat in mind.
Starting today, wherever we have aggregate charts by neighborhood, ZIP or other boundary, we include the number, and percentage, of records that couldn't be geocoded. Each location chart has a new "Unknown" row that provides these figures.
Note that technically this figure includes more than nongeocodable records — it also includes any records that were successfully geocoded but don't lie in any neighborhood. For example, in our Philadelphia crime section, you can see that 1% percent of crime reports in the last 30 days are in an "Unknown" neighborhood; this means those 35 records either couldn't be geocoded or lie outside any of the Philadelphia neighborhood boundaries that we've compiled.
We hope this article, along with these new site changes, helps you get a better understanding of data caveats throughout our site. Let us know if you have any other suggestions by e-mailing feedback at everyblock.com.