Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures

Mohamed Elhadi; Amjad Al-Tobi

doi:10.1109/ICCIT.2009.235

Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures

Mohamed Elhadi^*, Amjad Al-Tobi

^*المؤلف المقابل لهذا العمل

نتاج البحث: Conference contribution

22 اقتباسات (Scopus)

ملخص

This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for documents. An improved LCS algorithm was applied to such a representation to compare and rank the documents according to the similarity of their representative string. The approach was applied in detecting duplicate documents within a corpus, and in the filtering of search engine results. Results obtained were encouraging.

اللغة الأصلية	English
عنوان منشور المضيف	ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology
الصفحات	679-684
عدد الصفحات	6
المعرِّفات الرقمية للأشياء	https://doi.org/10.1109/ICCIT.2009.235
حالة النشر	Published - 2009
منشور خارجيًا	نعم
الحدث	4th International Conference on Computer Sciences and Convergence Information Technology, ICCIT 2009 - Seoul, Korea, Republic of المدة: نوفمبر ٢٤ ٢٠٠٩ → نوفمبر ٢٦ ٢٠٠٩

سلسلة المنشورات

الاسم	ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology

Conference

Conference	4th International Conference on Computer Sciences and Convergence Information Technology, ICCIT 2009
الدولة/الإقليم	Korea, Republic of
المدينة	Seoul
المدة	١١/٢٤/٠٩ → ١١/٢٦/٠٩

ASJC Scopus subject areas

???subjectarea.asjc.1700.1700???
???subjectarea.asjc.1800.1802???

الوصول إلى المستند

10.1109/ICCIT.2009.235

الملفات والروابط الأخرى

قم بذكر هذا

Elhadi, M., & Al-Tobi, A. (2009). Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures. في ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology (الصفحات 679-684). المقال 5368928 (ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology). https://doi.org/10.1109/ICCIT.2009.235

Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures. / Elhadi, Mohamed; Al-Tobi, Amjad.
ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology. 2009. صفحة 679-684 5368928 (ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology).

نتاج البحث: Conference contribution

Elhadi, M & Al-Tobi, A 2009, Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures. في ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology., 5368928, ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology, الصفحات 679-684, 4th International Conference on Computer Sciences and Convergence Information Technology, ICCIT 2009, Seoul, Korea, Republic of, ١١/٢٤/٠٩. https://doi.org/10.1109/ICCIT.2009.235

Elhadi M, Al-Tobi A. Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures. في ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology. 2009. صفحة 679-684. 5368928. (ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology). doi: 10.1109/ICCIT.2009.235

Elhadi, Mohamed ; Al-Tobi, Amjad. / Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures. ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology. 2009. الصفحات 679-684 (ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology).

@inproceedings{8f75d042a5e84f2494c559778ed46904,

title = "Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures",

abstract = "This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for documents. An improved LCS algorithm was applied to such a representation to compare and rank the documents according to the similarity of their representative string. The approach was applied in detecting duplicate documents within a corpus, and in the filtering of search engine results. Results obtained were encouraging.",

keywords = "Component: part-of-speech, Duplication filtering, Longest common subsequence, Syntactical structure",

author = "Mohamed Elhadi and Amjad Al-Tobi",

year = "2009",

doi = "10.1109/ICCIT.2009.235",

language = "English",

isbn = "9780769538969",

series = "ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology",

pages = "679--684",

booktitle = "ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology",

note = "4th International Conference on Computer Sciences and Convergence Information Technology, ICCIT 2009 ; Conference date: 24-11-2009 Through 26-11-2009",

}

TY - GEN

T1 - Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures

AU - Elhadi, Mohamed

AU - Al-Tobi, Amjad

PY - 2009

Y1 - 2009

N2 - This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for documents. An improved LCS algorithm was applied to such a representation to compare and rank the documents according to the similarity of their representative string. The approach was applied in detecting duplicate documents within a corpus, and in the filtering of search engine results. Results obtained were encouraging.

AB - This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for documents. An improved LCS algorithm was applied to such a representation to compare and rank the documents according to the similarity of their representative string. The approach was applied in detecting duplicate documents within a corpus, and in the filtering of search engine results. Results obtained were encouraging.

KW - Component: part-of-speech

KW - Duplication filtering

KW - Longest common subsequence

KW - Syntactical structure

UR - http://www.scopus.com/inward/record.url?scp=77749301855&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77749301855&partnerID=8YFLogxK

U2 - 10.1109/ICCIT.2009.235

DO - 10.1109/ICCIT.2009.235

M3 - Conference contribution

AN - SCOPUS:77749301855

SN - 9780769538969

T3 - ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology

SP - 679

EP - 684

BT - ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology

T2 - 4th International Conference on Computer Sciences and Convergence Information Technology, ICCIT 2009

Y2 - 24 November 2009 through 26 November 2009

ER -

Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures

ملخص

سلسلة المنشورات

Conference

ASJC Scopus subject areas

الوصول إلى المستند

الملفات والروابط الأخرى

بصمة

قم بذكر هذا