We introduce K-tree in an information retrieval context. It is an efficient
approximation of the k-means clustering algorithm. Unlike k-means it forms a
hierarchy of clusters. It has been extended to address issues with sparse
representations. We compare performance and quality to CLUTO using document
collections. The K-tree has a low time complexity that is suitable for large
document collections. This tree structure allows for efficient disk based
implementations where space requirements exceed that of main memory.
This paper describes the approach taken to the XML Mining track at INEX 2008
by a group at the Queensland University of Technology. We introduce the K-tree
clustering algorithm in an Information Retrieval context by adapting it for
document clustering. Many large scale problems exist in document clustering.
K-tree scales well with large inputs due to its low complexity. It offers
promising results both in terms of efficiency and quality. Document
classification was completed using Support Vector Machines.
Random Indexing (RI) K-tree is the combination of two algorithms for
clustering. Many large scale problems exist in document clustering. RI K-tree
scales well with large inputs due to its low complexity. It also exhibits
features that are useful for managing a changing collection. Furthermore, it
solves previous issues with sparse document vectors when using K-tree. The
algorithms and data structures are defined, explained and motivated. Specific
modifications to K-tree are made for use with RI. Experiments have been
executed to measure quality.