This paper proposes a clustering approach that explores both the content and the structure of XML
documents for determining similarity among them.
Assuming that the content and the structure of XML
documents play different roles and importance depending on the use and purpose of a dataset, the
content and structure information of the documents
are handled using two different similarity measuring
methods. The similarity values produced from these
two methods are then combined with weightings to
measure the overall document similarity. The effect
of structure similarity and content similarity on the
clustering solution is thoroughly analysed. The experiments prove that clustering of the text-centric
XML documents based on the content-only information produces a better solution in a homogeneous environment, documents that derived from one structural definition; however, in a heterogeneous environment, documents that derived from two or more
structural definitions, clustering of the text-centric
XML documents produces a better result when the
structure and the content similarities of the documents are combined with different strengths. |
Cite as: Tran, T., Nayak, R. and Bruza, P. (2008). Combining Structure and Content Similarities for XML Document Clustering. In Proc. Seventh Australasian Data Mining Conference (AusDM 2008), Glenelg, South Australia. CRPIT, 87. Roddick, J. F., Li, J., Christen, P. and Kennedy, P. J., Eds. ACS. 219-226. |
(from crpit.com)
(local if available)
|