江藤正己
情報処理学会論文誌データベース(TOD) 49(7) 1-15 2008年3月 査読有り
類似論文検索の代表的な手法の1つに,共引用関係を利用するものがある.ただし,従来の手法では,引用論文の本文の内容とは無関係に「1つの引用論文で示された被引用論文間の類似度はすべて同じ」と仮定され,本文中の引用のされ方による被引用論文間の関係の違いは考慮されていなかった.そこで,被引用論文間の類似性の強弱を推測するために,引用論文中の引用箇所間の間隔を論文の構成単位からとらえる尺度を提案する.この尺度は,共引用関係を「非同一段落」「同一段落」「同一文」「列挙」の4 種類にとらえるもので,この順に間隔が短くなるとするものである.提案する尺度の適切さを検証するために,尺度を用いて共引用関係を分類し,各種類の共引用関係にある論文間の類似度を算出・比較する実験を行った.実験の結果,構成単位に基づく間隔の長さに応じて被引用論文間の類似度が段階的に変化し,提案する尺度が適切であることが分かった.The co-citation measure is widely used to retrieve similar documents. The method is based on the premise that all degrees of similarity between a pair of co-cited papers have an equal weight in a single citing paper. In addition, the conventional measure is binary in that only whether two papers are "co-cited" or "not co-cited" is considered in the calculation process.In order to estimate similarities between co-cited papers more precisely, the author proposes a new co-citation measure based on structures of citing papers, i.e., we focus on structural distances between the positions where two co-cited papers appear in a citing paper. By the proposed measure, each co-citation is classified into four types: "different paragraph", "same paragraph", "same sentence" and "enumeration (i.e., a set of references to papers is included in a single sentence of the citing paper)". To evaluate the effectiveness of the proposed measure,the five typical similarities between co-cited papers that are found by the above four types of co-citation measure were respectively calculated and compared. In the experiment,the degree of calculated similarities gradually became higher with shorter structural distance;the highest one was "enumeration" and the lowest was "different paragraph". The proposed co-citation measure was thus shown to be able to estimate similarities between co-cited papers more precisely.