Abstract
This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for the documents. An improved LCS algorithm was applied to such a representation in order to compare and rank the documents according to the similarity of their representative strings. The approach was applied in the detection of duplicate documents within a corpus, and in the filtering of search engine results. Obtained results were encouraging.
Original language | English |
---|---|
Pages (from-to) | 138-147 |
Number of pages | 10 |
Journal | International Journal of Information Processing and Management |
Volume | 1 |
Issue number | 1 |
DOIs | |
Publication status | Published - 2010 |
Keywords
- Duplication filtering
- Longest common subsequence
- POS
- Syntactical structure
ASJC Scopus subject areas
- Computer Science(all)
- Information Systems and Management