Methodologies for crawler based Web surveys

Methodologies for crawler based Web surveys There have been many attempts to study the content of the Web, either through human or automatic agents. Describes five different previously used Web survey methodologies, each justifiable in its own right, but presents a simple experiment that demonstrates concrete differences between them. The concept of crawling the Web also bears further inspection, including the scope of the pages to crawl, the method used to access and index each page, and the algorithm for the identification of duplicate pages. The issues involved here will be well-known to many computer scientists but, with the increasing use of crawlers and search engines in other disciplines, they now require a public discussion in the wider research community. Concludes that any scientific attempt to crawl the Web must make available the parameters under which it is operating so that researchers can, in principle, replicate experiments or be aware of and take into account differences between methodologies. Also introduces a new hybrid random page selection methodology. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Internet Research Emerald Publishing

Methodologies for crawler based Web surveys

Internet Research, Volume 12 (2): 15 – May 1, 2002

Loading next page...
 
/lp/emerald-publishing/methodologies-for-crawler-based-web-surveys-VI5mUn0Yvk
Publisher
Emerald Publishing
Copyright
Copyright © 2002 MCB UP Ltd. All rights reserved.
ISSN
1066-2243
DOI
10.1108/10662240210422503
Publisher site
See Article on Publisher Site

Abstract

There have been many attempts to study the content of the Web, either through human or automatic agents. Describes five different previously used Web survey methodologies, each justifiable in its own right, but presents a simple experiment that demonstrates concrete differences between them. The concept of crawling the Web also bears further inspection, including the scope of the pages to crawl, the method used to access and index each page, and the algorithm for the identification of duplicate pages. The issues involved here will be well-known to many computer scientists but, with the increasing use of crawlers and search engines in other disciplines, they now require a public discussion in the wider research community. Concludes that any scientific attempt to crawl the Web must make available the parameters under which it is operating so that researchers can, in principle, replicate experiments or be aware of and take into account differences between methodologies. Also introduces a new hybrid random page selection methodology.

Journal

Internet ResearchEmerald Publishing

Published: May 1, 2002

Keywords: Surveys; Indexes

References

  • Mercator: a scalable, extensible Web crawler
    Heydon, A; Najork, M
  • Jamaica: a World Wide Web profiler
    Ho, C.; Goh, A
  • Web impact factors
    Ingwersen, P.
  • Accessibility of information on the Web
    Lawrence, S; Giles, C.L
  • Adaptive Web sites
    Perkowitz, M; Etzioni, O
  • Summary of WWW characterizations
    Pitkow, J.E
  • Use of query reformulation and relevance feedback by Excite users
    Spink, A; Jensen, B.J; Ozmultu, H.C
  • Commercial Web sites: lost in cyberspace?
    Thelwall, M
  • Web impact factors and search engine coverage
    Thelwall, M
  • Extracting macroscopic information from Web links
    Thelwall, M

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create folders to
organize your research

Export folders, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off