Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures

Mohamed Elhadi*, Amjad Al-Tobi

*المؤلف المقابل لهذا العمل

نتاج البحث: Conference contribution

22 اقتباسات (Scopus)

ملخص

This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for documents. An improved LCS algorithm was applied to such a representation to compare and rank the documents according to the similarity of their representative string. The approach was applied in detecting duplicate documents within a corpus, and in the filtering of search engine results. Results obtained were encouraging.

اللغة الأصليةEnglish
عنوان منشور المضيفICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology
الصفحات679-684
عدد الصفحات6
المعرِّفات الرقمية للأشياء
حالة النشرPublished - 2009
منشور خارجيًانعم
الحدث4th International Conference on Computer Sciences and Convergence Information Technology, ICCIT 2009 - Seoul, Korea, Republic of
المدة: نوفمبر ٢٤ ٢٠٠٩نوفمبر ٢٦ ٢٠٠٩

سلسلة المنشورات

الاسمICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology

Conference

Conference4th International Conference on Computer Sciences and Convergence Information Technology, ICCIT 2009
الدولة/الإقليمKorea, Republic of
المدينةSeoul
المدة١١/٢٤/٠٩١١/٢٦/٠٩

ASJC Scopus subject areas

  • ???subjectarea.asjc.1700.1700???
  • ???subjectarea.asjc.1800.1802???

بصمة

أدرس بدقة موضوعات البحث “Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures'. فهما يشكلان معًا بصمة فريدة.

قم بذكر هذا