Webpage duplicate detection using combined POS and sequence alignment algorithm

Mohamed Elhadi*, Amjad Al-Tobi

*المؤلف المقابل لهذا العمل

نتاج البحث: Conference contribution

3 اقتباسات (Scopus)

ملخص

Combined syntactical categories and sequence alignment algorithms are implemented and used to weed-out duplicate and near-duplicate web-pages from search engine results. The syntactical structures manifested as POS-tags were pre-processed using a POS tagger converting parts of a webpage's text into a string of tags. The produced string was then subjected into the longest Common Sequence (LCS) techniques (as is commonly done in computational biology), to detect duplicate and near-duplicate webpages. The process of tagging and aligning was based on set of sentences extracted from the web page as a representative of the pages. The query-keywords are used as a basis for sentence extraction. Results obtained from experiments performed have shown that such a combined approach can provide very interesting similarity calculation and re-ranking measure. This can be used with reasonable efficiency to detect duplications on search results generated by search engines such as Google. Similarity measurements obtained can be further used as a basis for text analysis of the search results allowing the detection of duplicate and near duplicates and clustering of documents in general.

اللغة الأصليةEnglish
عنوان منشور المضيف2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009
الصفحات630-634
عدد الصفحات5
المعرِّفات الرقمية للأشياء
حالة النشرPublished - 2009
منشور خارجيًانعم
الحدث2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009 - Los Angeles, CA, United States
المدة: مارس ٣١ ٢٠٠٩أبريل ٢ ٢٠٠٩

سلسلة المنشورات

الاسم2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009
مستوى الصوت1

Conference

Conference2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009
الدولة/الإقليمUnited States
المدينةLos Angeles, CA
المدة٣/٣١/٠٩٤/٢/٠٩

ASJC Scopus subject areas

  • ???subjectarea.asjc.1700.1706???
  • ???subjectarea.asjc.1700.1708???
  • ???subjectarea.asjc.1700.1710???
  • ???subjectarea.asjc.1700.1712???

بصمة

أدرس بدقة موضوعات البحث “Webpage duplicate detection using combined POS and sequence alignment algorithm'. فهما يشكلان معًا بصمة فريدة.

قم بذكر هذا