Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures

Mohamed Elhadi*, Amjad Al-Tobi

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

20 Citations (Scopus)

Abstract

This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for documents. An improved LCS algorithm was applied to such a representation to compare and rank the documents according to the similarity of their representative string. The approach was applied in detecting duplicate documents within a corpus, and in the filtering of search engine results. Results obtained were encouraging.

Original languageEnglish
Title of host publicationICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology
Pages679-684
Number of pages6
DOIs
Publication statusPublished - 2009
Event4th International Conference on Computer Sciences and Convergence Information Technology, ICCIT 2009 - Seoul, Korea, Republic of
Duration: Nov 24 2009Nov 26 2009

Publication series

NameICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology

Conference

Conference4th International Conference on Computer Sciences and Convergence Information Technology, ICCIT 2009
Country/TerritoryKorea, Republic of
CitySeoul
Period11/24/0911/26/09

Keywords

  • Component: part-of-speech
  • Duplication filtering
  • Longest common subsequence
  • Syntactical structure

ASJC Scopus subject areas

  • Computer Science(all)
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures'. Together they form a unique fingerprint.

Cite this