A New Approach to Tagging
August 30th, 2007by Jeremy Thomas
I’d like to draw your attention to the Tagline Generator - an innovative approach to tag generation based on the Porter Stemming Algorithm. The idea is simple, provide text input to the generator (i.e. content from html pages, word documents, powerpoint presentations), and the generator
makes a list of all the unique words that have been used and counts how many times each word is used. Next it identifies the different variations of words and combines them under the most common variation using the Porter Stemming Algorithm. E.g. “promised”, “promises”, “promising”, and “promise” might be grouped under “promises”.
The relevance of this tag generator to Enterprise 2.0 is significant. Imagine algorithmically combing through enterprise content then creating an association between content items using this generator. We could then create a more cohesive, uniform tag cloud without losing the spirit of tagging (i.e. user perspective) as tags are generated from content authored by knowledge workers.
Craig Mehta, the author of the generator, provides us with several demo sites where this generator has been used including one generated from the John Adam’s inaugural address in 1794.




Follow Me
September 4th, 2007 at 11:06 pm
Interesting direction you are heading in Jeremy, going down this automated content discovery path opens doors that far extend mere tagging. Imagine being able to create different tagging views of the content within an organization.
Using techniques based upon Bayesian Classifiers (http://en.wikipedia.org/wiki/Naive_Bayes_classifier) it could be possible to generate somewhat recursive tag clouds that allow a user to start at a high level and drill down into a knowledge based upon the related semantics of documents.
Tapping into the meaning of content, enabling contextual and whole document tagging/searching (i.e. user submits an entire document to find similar content based upon meaning), would increase the value and usability of E-2.0 for the organization.
September 5th, 2007 at 4:32 am
How is this significantly different from exposing the itnernals of what some search engines do through tags? My impression is that search engine products that offering “clustering” are essentially doing this (without the semantic smarts and without preserving a real or generated hierarchical structure as the previous commenter suggests).
September 5th, 2007 at 2:03 pm
Peter - I like this idea, especially the thought around contextual similarities. I think search engines are evolving in this contextual direction (i.e. Powerset) as well, which leads to Elisabeth’s point.
There are indeed a lot of similarities between how a search engine clusters content and how this tool generates tags. I think the revelation here is that this “clustering” logic is now surfaced and visible to the user (and is perhaps much more aggregated) whereas search engines store this information internally.