S o m e Thoughts on Similarity Measures ⢠Robert Korfhage Dept. of Information Science University of Pittsburgh Pittsburgh, PA 15260 korfhage @lis.pitt.edu Here are three thoughts on similarity measures for the vector model of retrieval. T h e y are meant primarily to open discussion, and to perhaps stimulate some thinking and research. Intrinsic versus extrinsic measures. T h e m o s t c o m m o n l y used similarity measures are probably the cosine m e a s u r e and a variety of m e a s u r e s based on a distance calculation. Putting aside q u e s t i o n s o f orthogonality (which m a y be important), it seems that these two kinds of measures have very different characteristics. The cosine measure is an "extrinsic" measure: it measures the similarity of d o c u m e n t s by the angle between their vectors. But the vectors are drawn from the origin, <0,0,..0> of the d o c u m e n t space. That is, the measure makes use of an external reference point, and when that point is changed, the m e a s u r e o f d o c u m e n t similarity changes, even though the documents in question have not. A distance-based measure, however, refers only to the two d o c u m e n t points in the space. It is thus "intrinsic," with a value that is independent of where the origin or any other point in the space is located. ⢠It is clear that the subspace consisting of documents "close" to a given d o c u m e n t is very different for the two kinds of measures. Whether this is significant, and if so, which type of measure is appropriate in a given situation, is not known. Mixed measures. Because the cosine measure is a monotonic function of the absolute value of the angle (at least within the range of angles used in information retrieval), there is really little reason to use the cosine rather than the angular measure itself. As a measure, the angle is a pseudometric, that is, it satisfies all of the metric axioms except that the angle between two d o c u m e n t vectors can be zero even if the documents are not identical. It is easy to show that the sum of a metric and a pseudometric is a metric. Hence we could consider (distance + angle)(D1, D2) as a measure of similarity between D1 and D2. This is interesting: it is a distance measure, hence is in some sense "intrinsic" as defined above. Yet a portion of it is also "extrinsic," depending on the point from w h i c h the angle is measured. It appears that for documents far from the origin the distance c o m p o n e n t of this measure is dominant, while for documents close to the origin the angular c o m p o n e n t is the main similarity measure. How does this relate to retrieval? Angles and reference points. Traditionally the cosine measure is developed with respect to the origin of the d o c u m e n t space. N o w suppose that we have a query, Q, and an additional reference point, R, perhaps a k n o w n document. Consider a d o c u m e n t D. We could measure the angles between the d o c u m e n t ' vector and the Q vector, and between wish), then figure out s o m e way to c o m b i n e these two measures. Alternatively, we could measure the angle between the Q-D vector and the Q-R vector, and use that as a partial basis for retrieval. It is only partial because this says nothing about the similarity of the d o c u m e n t to the query. For that we could use a distance measure, or perhaps the usual cosine measure. This certainly changes the basis for retrieval decisions, and may possibly improve retrieval.
/lp/association-for-computing-machinery/some-thoughts-on-similarity-measures-7AWrgQ5Uwr