The terms cluster, group, subject, category, topic, and folder, all mean generally the same thing within this page.
In This Page:
What is the Point of Topic Clustering?
What is Clustering?
How does Topic Clustering Work?
Specific Search Tools
What is the Point of Topic Clustering?
Directories are edited and organized by humans, and all websites are listed in a hierarchy, grouped according
to their topic, and sub topics are listed within topics. Search engines' databases have no such structure, which would help to focus in on a topic.
Many words and phrases have multiple meanings, and this is not a problem for directories, but it is for search engines. Homer was a Greek writer and
is also a character on The Simpsons. To solve this problem, many search engines are enhanced by also using a directory. For example, a search for Homer
on Google will have two categories listed, Homer the Greek and
Homer Simpson (I have shortened the directory folder names).
However, when you browse these categories, you are just using a directory and have lost the benefits of a search engine. The solution to all this, is topic clustering.
What is Clustering?
Clustering just means grouping. Grouping results makes them easier to read or follow. Many search engines use site clustering, including Google.
When it finds more than one result that are on the same website, it shows the second result indented from the first, and hides the rest, so the results aren't filled with listings from
a single website. Since most websites are generally about one topic, site clustering is one step towards topic clustering. Topic Clustering, or grouping results into topics/subjects, are a good
tool to help refine searches; it is no surprise that the two newest search engines (Teoma and WiseNut) both use it.
How does Topic Clustering Work?
Different tools cluster using different methods. One of the more common methods is to look for phrases which appear in multiple listings. All pages that have a certain phrase are listed in this
cluster, who's name is that phrase. Teoma's method is more hi-tech, see below for more information.
Northern Light note: Northern Light has ended its free search, but you can still use it here
On the left of the page on Northern Light results pages are Custom Search Folders in blue.
The first one is always "Special Collection Documents" which Northern Light charges you to view, but the rest are topic clusters. Northern Light seems to do a pretty good job of this, and the folders
contain more listings than any of the other tools mentioned on this page.
When you click on a folder, the right hand page shows a link to the previous level, and a listing of folders within the folder you chose. You can go down as many levels as you want,
however when you get very specific the folders names are mostly domain names of sites in that folder. I was able to go about seven levels down before it wouldn't let me go any further. If there
are many folders, the last will be labeled "all others...". According to Northern Light, their folders are clusters not only of topics, but also type (press releases, product reviews, etc),
source (commercial Web sites, personal pages, magazines, encyclopedias, databases, etc), and language.
UC Berkeley Library says that the folders are generated from a controlled vocabulary assigned to each web document and article. More information about Northern Light's folders is available
from this article from 1998 by Greg Notess.
Vivísimo's clustering is entirely dynamic, the folder names are not predefined. Vivísimo is actually a meta search engine, so the text that is uses to cluster are website titles and descriptions.
The folders are displayed in a frame on the left in "windows explorer-style" (a folder tree). View Vivísimo's information about it. They
say they use "conceptual clustering," and so if a cluster can't be defined (given a title) well, it is rejected. If a web page fits into more than one cluster, then it will be listed in more than
once cluster. The following is an extract from Vivísimo's website about how their technology differs from Northern Light's: "Northern Light classifies each document within an entire source
collection into pre-defined subjects and then, at query time, selects those subjects that best match the search results. Vivísimo does not use pre-defined subjects; its annotations are created
WiseNut's topic clusters, called WiseGuide Categories, are made by using the semantic relationships with words in your query. See their information about WiseGuide.
The are shown in the black area above search results. Beside each cluster is a "search this" link
which is equivalant to searching the whole web for the name of that cluster. They still need much improvement to be as good as the other tools mentioned on this page. For some reason, most
clusters are small and contain 2-25 listings, but often under ten. WiseNut's programming seems to have some bugs; about half the time when you click on a cluster, you are taken to a different
one, or even one which does not exist. Clusters rarely contain other clusters, but when they do they are denoted by a "plus sign." This is seen most often beside "others." The sub-cluster is
shown in a gray box area under the black one, but still above the search results.
Teoma's topic clusters are shown as folders above the results, labeled "Web Pages Grouped by Topic." The folder's are named fairly well, usually phrases rather than keywords.
Once you click on a folder, you can't go further down; clicking on the names again will only add them to your query. Teoma says that they group the top results, and this seems
to range widely, from about thirty to three hundred. Teoma, which is still in beta (not complete) is now owned by Ask Jeeves, who intend to make use of their technology.
Teoma's clustering is not done by finding common phrases; instead Teoma finds communities, clusters of web pages which link to each other, and then looks for words or phrases to
name the cluster.
AllTheWeb's clusters, which they call FAST Topics, are still in beta (not complete). See their information.
While topic clustering bridges the gap between search engines and directories, AllTheWeb also bridges the gap between directories and topic clustering. The backbone of AllTheWeb's topics is the
Open Directory. AllTheWeb has given every single category in the directory a shorter name, which is quite a lot of work. FAST Topics are generated from the directory and also dynamically.
Looking at the names of the topics you can tell which they are: directory topics have hierarchical names (like Food > Cheese), and dynamically created topics are named with keywords and key phrases (like food, cheese).
When you go to any of the folders, the results displayed are a subset of your original search (as the other search tools on this page do), however each has a link to "more results like these." The results displayed after
clicking on this link may not have been found with your original search, but they are similar or related to results on the topic that you chose. AllTheWeb's topics tend to have more results each than any of the other tools
mentioned on this page. AllTheWeb's use of the Open Directory definitely makes it one of the better topic clusterers; unfortunately they only cluster the top 200 results and do not have subtopics.
Like Vivísimo, iBoogie is a meta search engine, displays topic clusters in a frame, and has subtopics. iBoogie is currently not one of the better topic clusterers. Most folders contain less
than five sites, and often folders will contain only one other folder. Still, the folder tree is an alternative way to browse search results. iBoogie also has their own version of the Open Directory,
which can also be navigated with a folder tree. These are the default folders displayed on iBoogie before you conduct a search.
Query Server is operated by Open Text (see search tool graveyard). On the front page there is an option for no clustering, clustering by site, clustering by content (topic),
and clustering by both. If you are looking for clustering, we recommend both. If you do select clustering, then the results page has them clustered already. At the top of the page is a list of the clusters, which are links to lower down on the page
where the sites in that cluster are listed. Query Server's determines its clusters based on a phrase extraction algorithm (see above). More details about the clustering is available on
Free Pint. Overall, Query Server's clustering looks pretty good.
Turbo10 is a meta search engine. When you perform a search, there is a selection box at the center of the top with a list of clusters and how many sites each cluster contains. When you select a cluster, you are show
only those results. The topic clustering is probably a form of phrase-extraction, and it seems to work well. There are no sub-clusters.
Google's News Search and the news headlines on Google's news page uses some topic clustering. Each headline is accompanied by up to four related articles. This is done automatically with no human intervention, and the
software probably groups results based on the article's using many of the same words. According to an article on ABC News, Google had a
prototype for topic clustering "long ago," but user testing found that it wasn't worth the "screen real estate." Too bad, they would have probably done a good job, and made the best search engine even better.
Features Chart (see details below chart)
* iBoogie (without searching) uses predefined folders from the Open Directory (see above)
||Predefined or Dynamic
||Framed Folder Tree
||Engine or Meta Engine
||# of Clusters Shown on Default
||Locations per Listing
||Shows # in Cluster
||# of Listings Clustered
||aspects of both
||13, click on "all others..."
||meta search engine
||10, click on "more"
||varies, click on "others"
||8, click on "show all"
||9, click the arrow
||meta search engine
||about 16, click on arrow
||varies (shows all)
Predefined or Dynamic - either the clusters are made before you search (predefined) or after (created dynamically)
Sub Clusters - if the clusters contain sub clusters, and those contain sub clusters, etc
Framed Folder Tree - if the clusters and sub clusters are displayed as a folder tree ("windows explorer-style"), in a frame
Engine or Meta Engine - search engines have the entire page's text, while meta search engines only have the description to use.
This is important if the clusters are determined by the text on the page (usually "phrase extraction").
# of Clusters Shown on Default - the number of clusters which are displayed without clicking on "more"
Locations per Listing - some web pages fit into more than one cluster. some tools will only put them in one, others many
Shows # in Cluster - whether or not the clusters say beside them how many web pages they contain
# of Listings Clustered - many search tools only determine clusters for a certain number of results, others use all
Topic clustering is definitely useful in improving search results. This page has shown that there are many methods of implementing the concept, all of which have
pros and cons, and that topic clustering is still evolving. Northern Light, and also AllTheWeb and Query Server list more results in clusters than the others.
Vivísimo's folder-tree display is very quick and easy to use; their clustering is also good. And AllTheWeb's combination of predefined and dynamic clusters is an
innovative way of improving cluster's usefulness.
This page was last updated on March 17, 2002. Submit changes here.