TY - GEN
T1 - Use of text syntactical structures in detection of document duplicates
AU - Elhadi, Mohamed
AU - Al-Tobi, Amjad
PY - 2008
Y1 - 2008
N2 - This is the first paper on a set of experiments addressing issues related to the determination of text similarity using a combined syntactical representation and string alignment techniques.The suggested approach takes advantage of document syntactical structure manifested in Part of Speech (POS) tags and uses it as a basis for further processing. Documents, including the query, are preprocessed using a POS tagger converting them into a reduced string that captures some of the writing style of authors and some of the semantics of the written text. This provides means of representing a document on a higher level of abstraction that captures the different alterations that can be done on documents that are similar in origin and possibly in style. This in turn enables processing of such documents using many of the available string manipulation and matching algorithms. This work is inspired and driven by the parallel between text processing and sequence alignment in computational biology. Sequence alignment techniques are used to analyze and establish the utility of using strings produced as a result of this syntactical representation of content.
AB - This is the first paper on a set of experiments addressing issues related to the determination of text similarity using a combined syntactical representation and string alignment techniques.The suggested approach takes advantage of document syntactical structure manifested in Part of Speech (POS) tags and uses it as a basis for further processing. Documents, including the query, are preprocessed using a POS tagger converting them into a reduced string that captures some of the writing style of authors and some of the semantics of the written text. This provides means of representing a document on a higher level of abstraction that captures the different alterations that can be done on documents that are similar in origin and possibly in style. This in turn enables processing of such documents using many of the available string manipulation and matching algorithms. This work is inspired and driven by the parallel between text processing and sequence alignment in computational biology. Sequence alignment techniques are used to analyze and establish the utility of using strings produced as a result of this syntactical representation of content.
UR - http://www.scopus.com/inward/record.url?scp=62949125921&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=62949125921&partnerID=8YFLogxK
U2 - 10.1109/ICDIM.2008.4746719
DO - 10.1109/ICDIM.2008.4746719
M3 - Conference contribution
AN - SCOPUS:62949125921
SN - 9781424429172
T3 - 3rd International Conference on Digital Information Management, ICDIM 2008
SP - 520
EP - 525
BT - 3rd International Conference on Digital Information Management, ICDIM 2008
T2 - 3rd International Conference on Digital Information Management, ICDIM 2008
Y2 - 13 November 2008 through 16 November 2008
ER -