Detection of duplication in documents and webpages based documents syntactical structures through an improved longest common subsequence

Mohamed Elhadi*, Amjad Al-Tobi

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

5 Citations (Scopus)

Abstract

This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for the documents. An improved LCS algorithm was applied to such a representation in order to compare and rank the documents according to the similarity of their representative strings. The approach was applied in the detection of duplicate documents within a corpus, and in the filtering of search engine results. Obtained results were encouraging.

Original languageEnglish
Pages (from-to)138-147
Number of pages10
JournalInternational Journal of Information Processing and Management
Volume1
Issue number1
DOIs
Publication statusPublished - 2010
Externally publishedYes

Keywords

  • Duplication filtering
  • Longest common subsequence
  • POS
  • Syntactical structure

ASJC Scopus subject areas

  • Computer Science(all)
  • Information Systems and Management

Cite this