A Unifying Semantic Distance Model for Determining the Similarity of Attribute Values

Roddick, J.F., Hornsby, K. and de Vries, D.

    The relative difference between two data values is of interest in a number of application domains including temporal and spatial applications, schema versioning, data warehousing (particularly data preparation), internet searching, validation and error correction, and data mining. Moreover, consistency across systems in determining such distances and the robustness of such calculations is essential in some domains and useful in many. Despite this, there is no generally adopted approach to determining such distances and no accommodation of distance within SQL or any commercially available DBMS. For non-numeric data values calculating the difference between values often requires applicationspecific support but even for numeric values the practical distance between two values may not simply be their numeric difference or Euclidean distance. In this paper, a model of semantic distance is developed in which a graph-based approach is used to quantify the distance between two data values. The approach facilitates a notion of distance, both as a simple traversal distance and as weighted arcs. Transition costs, as an additional expense of passing through a node, are also accommodated. Furthermore, multiple distance measures can be incorporated and a method of 'localisation' is discussed which allows relevant information to take precedence over less relevant information. Some results from our investigations, including our SQL based implementation, are presented.
Cite as: Roddick, J.F., Hornsby, K. and de Vries, D. (2003). A Unifying Semantic Distance Model for Determining the Similarity of Attribute Values. In Proc. Twenty-Sixth Australasian Computer Science Conference (ACSC2003), Adelaide, Australia. CRPIT, 16. Oudshoorn, M. J., Ed. ACS. 111-118.
pdf (from crpit.com) pdf (local if available) BibTeX EndNote GS