Use of text syntactical structures in detection of document duplicates

Mohamed Elhadi*, Amjad Al-Tobi

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

25 Citations (Scopus)

Abstract

This is the first paper on a set of experiments addressing issues related to the determination of text similarity using a combined syntactical representation and string alignment techniques.The suggested approach takes advantage of document syntactical structure manifested in Part of Speech (POS) tags and uses it as a basis for further processing. Documents, including the query, are preprocessed using a POS tagger converting them into a reduced string that captures some of the writing style of authors and some of the semantics of the written text. This provides means of representing a document on a higher level of abstraction that captures the different alterations that can be done on documents that are similar in origin and possibly in style. This in turn enables processing of such documents using many of the available string manipulation and matching algorithms. This work is inspired and driven by the parallel between text processing and sequence alignment in computational biology. Sequence alignment techniques are used to analyze and establish the utility of using strings produced as a result of this syntactical representation of content.

Original languageEnglish
Title of host publication3rd International Conference on Digital Information Management, ICDIM 2008
Pages520-525
Number of pages6
DOIs
Publication statusPublished - 2008
Externally publishedYes
Event3rd International Conference on Digital Information Management, ICDIM 2008 - London, United Kingdom
Duration: Nov 13 2008Nov 16 2008

Publication series

Name3rd International Conference on Digital Information Management, ICDIM 2008

Conference

Conference3rd International Conference on Digital Information Management, ICDIM 2008
Country/TerritoryUnited Kingdom
CityLondon
Period11/13/0811/16/08

ASJC Scopus subject areas

  • Information Systems
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'Use of text syntactical structures in detection of document duplicates'. Together they form a unique fingerprint.

Cite this