A structural, content‐similarity measure for detecting spam documents on the web

Maria Soledad Pera; Yiu‐Kai Ng

doi:10.1108/17440080911006207

Loading next page...

References (37)

Adenike Lam-Adesina, G. Jones (2001)
Applying summarization techniques for term selection in relevance feedback
A. Benczúr, Károly Csalogány, Tamás Sarlós, M. Uher (2005)
SpamRank -- Fully Automatic Link Spam Detection
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, R. Baeza-Yates (2006)
Using rank propagation and Probabilistic counting for Link-Based Spam Detection
M. Pera, Yiu-Kai Ng (2007)
Using word similarity to eradicate junk emails
Dengyong Zhou, O. Bousquet, T. Lal, J. Weston, B. Scholkopf (2003)
Learning with Local and Global Consistency
Zoltán Gyöngyi, P. Berkhin, H. Garcia-Molina, Jan Pedersen (2006)
Link spam detection based on mass estimation
A. Benczúr, István Bíró, Károly Csalogány, Tamás Sarlós (2007)
Web spam detection via commercial intent analysis
G. Misjne, M. de Rijke
Boosting web retrieval through query operations
Luis Ahn, Laura Dabbish (2004)
Labeling images with a computer game
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, R. Baeza-Yates (2006)
Link-Based Characterization and Detection of Web Spam
Yahoo!
Crawled by the Laboratory of Web Algorithmics
10.1016/B978-012088469-8/50052-8
10.1007/978-3-540-31865-1_36
A. Ntoulas, Marc Najork, M. Manasse, Dennis Fetterly (2006)
Detecting spam web pages through content analysis
K. Svore, Qiang Wu, C. Burges, A. Raman (2007)
Improving web spam classification using rank-time features
Zoltán Gyöngyi, H. Garcia-Molina, Jan Pedersen (2004)
Combating Web Spam with TrustRank
C. Castillo, D. Donato, A. Gionis, Vanessa Murdock, F. Silvestri (2007)
Know your neighbors: web spam detection using the web topology
Brian Davison (2000)
Recognizing Nepotistic Links on the Web
(2007)
http://barcelona.research.yahoo.net/ webspam/datasets
Juan Martínez-Romo, Lourdes Araujo (2009)
Web spam identification through language model analysis
G. Mishne, M. Rijke (2004)
UvA-DARE ( Digital Academic Repository ) Boosting Web Retrieval through Query Operations
Michelle Goodstein, Virginia Vassilevska (2007)
A Two Player Game To Combat Web Spam
(2001)
The classification of Search Engine Spam”, Available online at http://www.silverdisc.co.uk/articles/spam-classification
Zoltán Gyöngyi, H. Garcia-Molina (2005)
Web Spam Taxonomy
W. Cohen, Z. Kou
Stacked Graphical Learning: Approximating Learning Markov Random Fields Using Very Short Inhomogeneous Markov Chains
Rong Jin, Alexander Hauptmann (2002)
A New Probabilistic Model for Title Generation
Tanguy Urvoy, E. Chauveau, P. Filoche, T. Lavergne (2008)
Tracking Web spam with HTML style similarities
ACM Trans. Web, 2
(2001)
The classification of Search Engine Spam ”
James Caverlee, Ling Liu (2007)
Countering web spam with credibility-based link analysis
J. Pearl (1991)
Probabilistic reasoning in intelligent systems - networks of plausible inference
Yiqun Liu, Min Zhang, Shaoping Ma, Liyun Ru (2008)
User behavior oriented web spam detection
C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, Massimo Santini, S. Vigna (2006)
A reference collection for web spam
SIGIR Forum, 40
L. Becchetti, C. Castillo, D. Donato, R. Baeza-Yates, S. Leonardi (2008)
Link analysis for Web spam detection
ACM Trans. Web, 2
S. Brin, Lawrence Page (1998)
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Comput. Networks, 30
Steve Webb, James Caverlee, C. Pu (2008)
Predicting web spam with HTTP session information
William Cohen, Zhenzhen Kou (2006)
Stacked Graphical Learning : Learning in Markov Random Fields using Very Short Inhomogeneous Markov Chains
Dennis Fetterly, M. Manasse, Marc Najork (2004)
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Publisher: Emerald Publishing
Copyright: Copyright © 2009 Emerald Group Publishing Limited. All rights reserved.
ISSN: 1744-0084
DOI: 10.1108/17440080911006207
Publisher site: See Article on Publisher Site

Abstract

Purpose – The web provides its users with abundant information. Unfortunately, when a web search is performed, both users and search engines must deal with an annoying problem: the presence of spam documents that are ranked among legitimate ones. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. To improve the quality of web searches, the number of spam documents on the web must be reduced, if they cannot be eradicated entirely. This paper aims to present a novel approach for identifying spam web documents, which have mismatched titles and bodies and/or low percentage of hidden content in markup data structure. Design/methodology/approach – The paper shows that by considering the degree of similarity among the words in the title and body of a web docuemnt D , which is computed by using their word‐correlation factors; using the percentage of hidden context in the markup data structure within D ; and/or considering the bigram or trigram phase‐similarity values of D , it is possible to determine whether D is spam with high accuracy Findings – By considering the content and markup of web documents, this paper develops a spam‐detection tool that is: reliable, since we can accurately detect 84.5 percent of spam/legitimate web documents; and computational inexpensive, since the word‐correlation factors used for content analysis are pre‐computed. Research limitations/implications – Since the bigram‐correlation values employed in the spam‐detection approach are computed by using the unigram‐correlation factors, it imposes additional computational time during the spam‐detection process and could generate higher number of misclassified spam web documents. Originality/value – The paper verifies that the spam‐detection approach outperforms existing anti‐spam methods by at least 3 percent in terms of F ‐measure.

Journal

International Journal of Web Information Systems – Emerald Publishing

Published: Nov 20, 2009

Keywords: Accuracy; Error analysis; Data structures

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A structural, content‐similarity measure for detecting spam documents on the web

A structural, content‐similarity measure for detecting spam documents on the web

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A structural, content‐similarity measure for detecting spam documents on the web

A structural, content‐similarity measure for detecting spam documents on the web

References (37)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies