Web Page Classi cation: Features and Algorithms XIAOGUANG QI and BRIAN D. DAVISON Lehigh University Classi cation of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classi cation as compared to traditional text classi cation, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in Web page classi cation, we note the importance of these Web-speci c features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages. Categories and Subject Descriptors: I.5.2 [Pattern Recognition]: Design Methodology Classi er design and evaluation; I.5.4 [Pattern Recognition]: Applications Text processing; I.2.6 [Arti cial Intelligence]: Learning; H.2.8 [Database Management]: Database Applications Data Mining; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Performance, Design Additional Key Words and Phrases: Categorization, Web mining ACM Reference Format: Qi, X. and Davison, B. D. 2009. Web page classi cation: Features and algorithms. ACM Comput. Surv. 41, 2, Article 12 (February 2009), 31 pages DOI = 10.1145/1459352.1459357 http://doi.acm.org/10.1145/ 1459352.1459357 1. INTRODUCTION
/lp/association-for-computing-machinery/web-page-classification-features-and-algorithms-ePRuhgBcRT