Speaking at the Business of APIs Conference

November 12th, 2009
by Jeremy Thomas

logo_apiconference.pngI’m happy to announce that I’m one of the featured speakers at the Business of APIs Conference in NYC on 16 November.  I’ve been leading the charge to open our data at Active.com, and we’ve started a slow rollout of our API.  I’ll be talking about the journey we’ve taken to get to where we are today with our API.  We’ve still got a long way to go.

If you’re in NYC on Monday and are interested in APIs, come by and check it out!

Data. Data. Data.

December 29th, 2008
by Jeremy Thomas

Something I learned while working with the Information Management group at BearingPoint down in Australia continues to resonate for me at my “Web 2.0-ish” job in San Diego, CA.  Data integrity is king but is bloody hard to maintain.  Consider a datawarehouse, where information about information is stored, often for reporting purposes.  Datawarehouses can be used to answer the question “how many customers do I have?”, or more specifically, “how many residential customers do I have?”.  Seems simple enough.

But data, dare I say “truth”, is federated.  And each member of the federation has its own vernacular.

For example, the residential loan processing system might call a customer a “customer”, while the commercial loan processing system calls a customer a “client”.  At the core these are the same entities, with “residential” or “commercial” being a modifier (as an adjective is to a noun).  So a datawarehousing solution would apply its central vernacular to these entities allowing the question “how many customers do I have?” to be answered even though the answer is informed by two sources of truth.

yelp-categories.gifData transformation and categorization works moderately well when an organization has control over its data sources (and has, therefore, a limited number of vernaculars).  But consider the La Jolla, CA, page on Yelp, http://www.yelp.com/la-jolla-ca, which claims that La Jolla has 1028 restaurants worth reviewing.  Most of this data is user-submitted.  And how does a user classify Starbucks?  “Food”?  “Restaurants”?  And what about subcategories?  “Coffee and Tea”? “Desserts”?  Some users might choose to use some of these categories, while others might use all.  And it’s consistency that lies at the heart of the issue of maintaining data integrity.  A user should have access to all restaurants when browsing by “Restaurants”.
If information is consistently categorized, even incorrectly, we can get accurate answers to our queries.  But if it’s inconsistently categorized our answers will not be comprehensive.

So how, then, do websites like yelp.com deliver meaningful, consistently categorized results when they’re reliant on crowdsourcing?  Are there really only 1028 review-worthy restaurants in La Jolla? And what of those restaurants that are mistakingly subcategorized as “Turkish” when they’re actually “Lebanese”?

Manual Labor is the answer.

I suspect sites like Yelp.com leverage services like mechanical turk to comb through the thousands of user-submitted records apply a more uniform categorization scheme.  And this is why data integrity is bloody hard to maintain as there is so much manual labor involved.  I question the sustainability of such a model, especially as a site grows and gathers more data.

But, what I can say, is it is more important for data to be correctly categorized than it is for it to be mostly correctly categorized.  If users on Yelp search for “Automotive” assets and are shown beauty salons they will leave. Data integrity is king.