An Improvised Sub-Document Based Framework for Efficient Document Clustering

Muhammad Qasim Memon,
Jingsha He,
Yu Lu,
Nafei Zhu,
Aasma Memon,

Abstract


Document clustering, which is used for topic discovery and similarity computation, has received a great deal of attention in text data management. Methods that have been adopted in traditional clustering, particularly for multi-topic documents, are not viable because the contents that are distinguished by the sub topical structure may not be pertinent across the entire documents. In this paper, a sub-document based framework for clustering multiple documents is proposed in which LDA is used for document segmentation. The proposed improvised framework is a two-way approach to address the clustering problem. First, instead of applying a clustering algorithm to the entire data sets, documents are partitioned into cohesive sub-documents along topic boundaries through text segmentation to establish a two-level representation of text data, i.e., topics and words. Second, the proposed framework is compared to existing clustering methods, both traditional and segment based clustering through different clustering algorithms using the F-measure as the measurement metric. In addition, various real-time data sets that contain multi-topic documents are applied to validating the clustering algorithms through the proposed sub-document based framework. Each sub-document is clustered within a document and the resulting clusters are further clustered across the documents. Experimental results show that the proposed framework outperforms existing clustering approaches in terms of the F-measure as well as efficiency at least 73% with LDA segmentation and bisecting LDA in comparison to TextTiling.


Citation Format:
Muhammad Qasim Memon, Jingsha He, Yu Lu, Nafei Zhu, Aasma Memon, "An Improvised Sub-Document Based Framework for Efficient Document Clustering," Journal of Internet Technology, vol. 20, no. 4 , pp. 1191-1203, Jul. 2019.

Full Text:

PDF

Refbacks

  • There are currently no refbacks.





Published by Executive Committee, Taiwan Academic Network, Ministry of Education, Taipei, Taiwan, R.O.C
JIT Editorial Office, Library and Information Center, National Dong Hwa University
No. 1, Sec. 2, Da Hsueh Rd. Shoufeng, Hualien 97401, Taiwan, R.O.C.
Tel: +886-3-931-7017  E-mail: jit.editorial@gmail.com