A Sequential Algorithm for Training Text Classifiers: Corrigendum and Additional Data David D. Lewis A T & T Bell Laboratories M u r r a y Hill, NJ 07974 USA lewis @research. att. corn Introduction At ACM SIGIR '94, I compared the effectiveness of uncertainty sampling with that of random sampling and relevance sampling in choosing training data for a text categorization data set [1]. (Relevance sampling is the application of relevance feedback [3] to producing a training sample.) I have discovered a bug in my experimental software which caused the relevance sampling results reported in the SIGIR '94 paper to be incorrect. (The uncertainty sampling and random sampling results in that paper were correct.) I have since fixed the bug and rerun the experiments. This note presents the corrected results, along with additional data supporting the original claim that uncertainty sampling has an advantage over relevance sampling in most training situations. Methods The SIGIR '94 experiment, and the experiments reported here, proceeded as follows. (See the original paper [1] for full details.) The experimental variable was the m e t h o d used to choose training samples for a text categorization problem. Uncertainty sampling, relevance sampling,
/lp/association-for-computing-machinery/a-sequential-algorithm-for-training-text-classifiers-corrigendum-and-LDLlnkWTRZ