Data. Data. Data.
December 29th, 2008by Jeremy Thomas
Something I learned while working with the Information Management group at BearingPoint down in Australia continues to resonate for me at my “Web 2.0-ish” job in San Diego, CA. Data integrity is king but is bloody hard to maintain. Consider a datawarehouse, where information about information is stored, often for reporting purposes. Datawarehouses can be used to answer the question “how many customers do I have?”, or more specifically, “how many residential customers do I have?”. Seems simple enough.
But data, dare I say “truth”, is federated. And each member of the federation has its own vernacular.
For example, the residential loan processing system might call a customer a “customer”, while the commercial loan processing system calls a customer a “client”. At the core these are the same entities, with “residential” or “commercial” being a modifier (as an adjective is to a noun). So a datawarehousing solution would apply its central vernacular to these entities allowing the question “how many customers do I have?” to be answered even though the answer is informed by two sources of truth.
Data transformation and categorization works moderately well when an organization has control over its data sources (and has, therefore, a limited number of vernaculars). But consider the La Jolla, CA, page on Yelp, http://www.yelp.com/la-jolla-ca, which claims that La Jolla has 1028 restaurants worth reviewing. Most of this data is user-submitted. And how does a user classify Starbucks? “Food”? “Restaurants”? And what about subcategories? “Coffee and Tea”? “Desserts”? Some users might choose to use some of these categories, while others might use all. And it’s consistency that lies at the heart of the issue of maintaining data integrity. A user should have access to all restaurants when browsing by “Restaurants”.
If information is consistently categorized, even incorrectly, we can get accurate answers to our queries. But if it’s inconsistently categorized our answers will not be comprehensive.
So how, then, do websites like yelp.com deliver meaningful, consistently categorized results when they’re reliant on crowdsourcing? Are there really only 1028 review-worthy restaurants in La Jolla? And what of those restaurants that are mistakingly subcategorized as “Turkish” when they’re actually “Lebanese”?
Manual Labor is the answer.
I suspect sites like Yelp.com leverage services like mechanical turk to comb through the thousands of user-submitted records apply a more uniform categorization scheme. And this is why data integrity is bloody hard to maintain as there is so much manual labor involved. I question the sustainability of such a model, especially as a site grows and gathers more data.
But, what I can say, is it is more important for data to be correctly categorized than it is for it to be mostly correctly categorized. If users on Yelp search for “Automotive” assets and are shown beauty salons they will leave. Data integrity is king.
Follow Me on Twitter
Co-Author
December 31st, 2008 at 3:46 pm
But 1028 is just a scent — “there are many possibilities here”. Coffee and Tea or Dessert may take both take you to Starbucks, but neither is wrong. They are just different paths for different users.
As for beauty salons listed as Automotive… couldn’t the same crowd that created the listings police them with feedback forms and the like? That way, while it would still be a manual process to fix it would be targeted and manageable.