Publication: Enhanced approach for non-negative matrix factorization (NMF) based summarization using conditional random fields (CRF) segmentation
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Subject LCSH
Subject ICSI
Call Number
Abstract
Automatic Text Summarization (ATS) is a complicated task of computer generating summary of document(s), which is smaller in size while preserving its information content. Since ATS appears to be a good candidate to address the information overload problem, it has gained a quantifiable attention recently. This thesis mainly attempts an enhanced approach of ATS, addressing the feature extraction problem that prevails with the existing ATS approaches which uses algebraic based reduction method namely Nonnegative Matrix Factorization (NMF). The most vital role of any extractive ATS is the identification of most important sentences from the given text. This is possible only when the correct semantics or features of the sentences are identified properly. When NMF applied on ATS, transformation of information from the input sentences to features is more precisely not possible, since NMF has no intrinsic domain knowledge of the input source to be summarized. Thus the main issue with the existing ATS based on NMF is the proper feature extraction from the source text. Moreover as NMF is basically an approximation algorithm and not intended for feature extraction, better performance can be achieved only after proper enhancement or tuning on NMF when applied on ATS. Hence this work proposes an enhanced supervised domain based extractive approach on ATS using NMF to resolve the problem with the existing approach. The two important parametric values that serve as input to the NMF process include initialization and sparseness measure. These inputs are vital to the output produced by NMF. These parameters of NMF were not been considered in the existing literature when NMF applied on ATS. In the existing NMF based ATS the initial seeds of W and H matrices are initialized with random values or zeroes without considering the features of the source text. Thus to address the issue, in the proposed approach applying Conditional Random Field (CRF), the initial seeds of W and H are constructed based on the features available in the source text to achieve better performance. The other parameter, sparseness of the W and H matrices, which makes only few elements of Wand H matrices active to extract the features more accurate, is not used in the existing approach of NMF based ATS. The proposed work aimed to achieve better performance by using the sparse representation of the W and H matrices. Hence this work proposes an extended approach that can enhance the performance of NMF when applied on NMF by treating initialization and sparseness parameters of NMF. Also it is aimed to study the impact it makes to the quality of the summary generated. The proposed methodology is tested across two domains namely legal and scientific documents. Experimental results shows that proposed method when treated with proper initialization seeds for W and H matrices can produce better performance. Whereas for sparseness treatment, obtained results clearly illustrates that tuning the sparseness could not give better performance. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics are used for the evaluation of the proposed method.