Masaki Eto
IPSJ Transactions on Databases, 49(7) 1-15, Mar, 2008 Peer-reviewed
The co-citation measure is widely used to retrieve similar documents. The method is based on the premise that all degrees of similarity between a pair of co-cited papers have an equal weight in a single citing paper. In addition, the conventional measure is binary in that only whether two papers are "co-cited" or "not co-cited" is considered in the calculation process. In order to estimate similarities between co-cited papers more precisely, the author proposes a new co-citation measure based on structures of citing papers, i.e., we focus on structural distances between the positions where two co-cited papers appear in a citing paper. By the proposed measure, each co-citation is classified into four types: "different paragraph", "same paragraph", "same sentence" and "enumeration (i.e., a set of references to papers is included in a single sentence of the citing paper)". To evaluate the effectiveness of the proposed measure, the five typical similarities between co-cited papers that are found by the above four types of co-citation measure were respectively calculated and compared. In the experiment, the degree of calculated similarities gradually became higher with shorter structural distance; the highset one was "enumeration" and the lowest was "different paragraph". The proposed co-citation measure was thus shown to be able to estimate similarities between co-cited papers more precisely.