Detection of duplication in documents and webpages based documents syntactical structures through an improved longest common subsequence

Mohamed Elhadi*, Amjad Al-Tobi

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

5 Citations (Scopus)

Abstract

This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for the documents. An improved LCS algorithm was applied to such a representation in order to compare and rank the documents according to the similarity of their representative strings. The approach was applied in the detection of duplicate documents within a corpus, and in the filtering of search engine results. Obtained results were encouraging.

Original languageEnglish
Pages (from-to)138-147
Number of pages10
JournalInternational Journal of Information Processing and Management
Volume1
Issue number1
DOIs
Publication statusPublished - 2010
Externally publishedYes

Keywords

  • Duplication filtering
  • Longest common subsequence
  • POS
  • Syntactical structure

ASJC Scopus subject areas

  • General Computer Science
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'Detection of duplication in documents and webpages based documents syntactical structures through an improved longest common subsequence'. Together they form a unique fingerprint.

Cite this