A New Approach to Tagging

August 30th, 2007
by Jeremy Thomas

I’d like to draw your attention to the Tagline Generator - an innovative approach to tag generation based on the Porter Stemming Algorithm. The idea is simple, provide text input to the generator (i.e. content from html pages, word documents, powerpoint presentations), and the generator

makes a list of all the unique words that have been used and counts how many times each word is used. Next it identifies the different variations of words and combines them under the most common variation using the Porter Stemming Algorithm. E.g. “promised”, “promises”, “promising”, and “promise” might be grouped under “promises”.

address_cloud.gifThe relevance of this tag generator to Enterprise 2.0 is significant. Imagine algorithmically combing through enterprise content then creating an association between content items using this generator. We could then create a more cohesive, uniform tag cloud without losing the spirit of tagging (i.e. user perspective) as tags are generated from content authored by knowledge workers.

Craig Mehta, the author of the generator, provides us with several demo sites where this generator has been used including one generated from the John Adam’s inaugural address in 1794.

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • BlinkList
  • digg
  • Reddit

3 Responses to “A New Approach to Tagging”

  1. Peter Says:

    Interesting direction you are heading in Jeremy, going down this automated content discovery path opens doors that far extend mere tagging. Imagine being able to create different tagging views of the content within an organization.

    Using techniques based upon Bayesian Classifiers (http://en.wikipedia.org/wiki/Naive_Bayes_classifier) it could be possible to generate somewhat recursive tag clouds that allow a user to start at a high level and drill down into a knowledge based upon the related semantics of documents.

    Tapping into the meaning of content, enabling contextual and whole document tagging/searching (i.e. user submits an entire document to find similar content based upon meaning), would increase the value and usability of E-2.0 for the organization.

  2. Eliasbeth Freeman Says:

    How is this significantly different from exposing the itnernals of what some search engines do through tags? My impression is that search engine products that offering “clustering” are essentially doing this (without the semantic smarts and without preserving a real or generated hierarchical structure as the previous commenter suggests).

  3. Jeremy Thomas Says:

    Peter - I like this idea, especially the thought around contextual similarities. I think search engines are evolving in this contextual direction (i.e. Powerset) as well, which leads to Elisabeth’s point.

    There are indeed a lot of similarities between how a search engine clusters content and how this tool generates tags. I think the revelation here is that this “clustering” logic is now surfaced and visible to the user (and is perhaps much more aggregated) whereas search engines store this information internally.

Leave a Reply