Use of text syntactical structures in detection of document duplicates

Mohamed Elhadi*, Amjad Al-Tobi

*المؤلف المقابل لهذا العمل

نتاج البحث: Conference contribution

25 اقتباسات (Scopus)

ملخص

This is the first paper on a set of experiments addressing issues related to the determination of text similarity using a combined syntactical representation and string alignment techniques.The suggested approach takes advantage of document syntactical structure manifested in Part of Speech (POS) tags and uses it as a basis for further processing. Documents, including the query, are preprocessed using a POS tagger converting them into a reduced string that captures some of the writing style of authors and some of the semantics of the written text. This provides means of representing a document on a higher level of abstraction that captures the different alterations that can be done on documents that are similar in origin and possibly in style. This in turn enables processing of such documents using many of the available string manipulation and matching algorithms. This work is inspired and driven by the parallel between text processing and sequence alignment in computational biology. Sequence alignment techniques are used to analyze and establish the utility of using strings produced as a result of this syntactical representation of content.

اللغة الأصليةEnglish
عنوان منشور المضيف3rd International Conference on Digital Information Management, ICDIM 2008
الصفحات520-525
عدد الصفحات6
المعرِّفات الرقمية للأشياء
حالة النشرPublished - 2008
منشور خارجيًانعم
الحدث3rd International Conference on Digital Information Management, ICDIM 2008 - London, United Kingdom
المدة: نوفمبر ١٣ ٢٠٠٨نوفمبر ١٦ ٢٠٠٨

سلسلة المنشورات

الاسم3rd International Conference on Digital Information Management, ICDIM 2008

Conference

Conference3rd International Conference on Digital Information Management, ICDIM 2008
الدولة/الإقليمUnited Kingdom
المدينةLondon
المدة١١/١٣/٠٨١١/١٦/٠٨

ASJC Scopus subject areas

  • ???subjectarea.asjc.1700.1710???
  • ???subjectarea.asjc.1800.1802???

بصمة

أدرس بدقة موضوعات البحث “Use of text syntactical structures in detection of document duplicates'. فهما يشكلان معًا بصمة فريدة.

قم بذكر هذا