News analysis through text
mining: a case study
I.C. Mogotsi
Department of Library and Information Studies, University of Botswana,
Gaborone, Botswana
Abstract
Purpose – This paper seeks to provide a tangible example of the use of text-mining techniques in a
real world setting, i.e. using real, as opposed to test, data.
Design/methodology/approach – News stories are modeled using the vector space model, with the
similarity between documents quantified using the cosine measure. For data analysis, three clustering
algorithms are used, and the results from the best-performing algorithm retained.
Findings – Agglomerative clustering performed poorly, while direct k-way clustering and k-way
clustering through repeated bisections yielded similar results, with the former performing marginally
better in terms of external isolation and internal cohesion of the clusters produced. A number of
themes that dominated news coverage during the period under consideration were identified, some of
which were noticeably only topical during certain parts of the year.
Research limitations/implications – Text mining holds much promise for businesses,
particularly if integrated into a well-orchestrated competitive intelligence function. However, more
publicly accessible studies need to be undertaken if businesses are to derive maximum value from it.
Originality/value – There is a growing body of literature devoted to both data and text mining.
However, much of this literature focuses on the development of new algorithms, with scant attention
paid to the practical application of these techniques in business settings, possibly because of the
strategic sensitivity of project findings. This study helps fill this yawning void.
Keywords Text retrieval, Knowledge management systems
Paper type Case study
1. Introduction
The relentless drop in the cost of computer processing power and disk storage, and the
concomitant explosion in the quantity of data stored by individuals and organisations,
has fuelled an ever growing interest in automatic and semi-automatic approaches to
extracting meaningful information from large bodies of data, commonly referred to as
data mining (Two Crows Corporation, 1999; Dilly, 1995; Hand et al., 2001). Although
initial interest was on mining structured data – such as that found in relational
databases – it has been estimated that as much as 80 per cent of company information
typically exists in textual form (Delphi Group, cited in Yu et al., 2005), which in turn
has fuelled interest in text data mining. Much of the literature on both data and text
mining, however, is concerned with the development of new algorithms, rather than
with business applications of these technologies. This may, at least in part, be due to
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/0305-5728.htm
The research reported here was undertaken in partial fulfilment of the requirements of the degree
of MPhil (Information and Knowledge Management) of the University of Stellenbosch, South
Africa. The assistance and encouragement of Dr Martin van der Walt, who supervised the thesis
work from which this paper emanates, are gratefully acknowledged. Funding for the study was
graciously provided by the University of Botswana (Training and Development).
VINE
37,4
516
VINE: The journal of information and
knowledge management systems
Vol. 37 No. 4, 2007
pp. 516-531
q Emerald Group Publishing Limited
0305-5728
DOI 10.1108/03055720710838560