Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Clustering web documents using co‐citation, coupling, incoming, and outgoing hyperlinks: a comparative performance analysis of algorithms

Clustering web documents using co‐citation, coupling, incoming, and outgoing hyperlinks: a... Querying search engines with the keyword “jaguars” returns results as diverse as web sites about cars, computer games, attack planes, American football, and animals. More and more search engines offer options to organize query results by categories or, given a document, to return a list of links to topically related documents. While information retrieval traditionally defines similarity of documents in terms of contents, it seems natural to expect that the very structure of the Web carries important information about the topical similarity of documents. Here we study the role of a matrix constructed from weighted co‐citations (documents referenced by the same document), weighted couplings (documents referencing the same document), incoming, and outgoing links for the clustering of documents on the Web. We present and discuss three methods of clustering based on this matrix construction using three clustering algorithms, K‐means, Markov and Maximum Spanning Tree, respectively. Our main contribution is a clustering technique based on the Maximum Spanning Tree technique and an evaluation of its effectiveness comparatively to the two most robust alternatives: K‐means and Markov clustering. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Web Information Systems Emerald Publishing

Clustering web documents using co‐citation, coupling, incoming, and outgoing hyperlinks: a comparative performance analysis of algorithms

Loading next page...
 
/lp/emerald-publishing/clustering-web-documents-using-co-citation-coupling-incoming-and-cjz3FZmxQz
Publisher
Emerald Publishing
Copyright
Copyright © 2006 Emerald Group Publishing Limited. All rights reserved.
ISSN
1744-0084
DOI
10.1108/17440080680000102
Publisher site
See Article on Publisher Site

Abstract

Querying search engines with the keyword “jaguars” returns results as diverse as web sites about cars, computer games, attack planes, American football, and animals. More and more search engines offer options to organize query results by categories or, given a document, to return a list of links to topically related documents. While information retrieval traditionally defines similarity of documents in terms of contents, it seems natural to expect that the very structure of the Web carries important information about the topical similarity of documents. Here we study the role of a matrix constructed from weighted co‐citations (documents referenced by the same document), weighted couplings (documents referencing the same document), incoming, and outgoing links for the clustering of documents on the Web. We present and discuss three methods of clustering based on this matrix construction using three clustering algorithms, K‐means, Markov and Maximum Spanning Tree, respectively. Our main contribution is a clustering technique based on the Maximum Spanning Tree technique and an evaluation of its effectiveness comparatively to the two most robust alternatives: K‐means and Markov clustering.

Journal

International Journal of Web Information SystemsEmerald Publishing

Published: May 1, 2006

Keywords: Search engines; Clustering; Co‐citation; Coupling; Hyperlinks

There are no references for this article.