Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures

Mohamed Elhadi; Amjad Al-Tobi

doi:10.1109/ICCIT.2009.235

Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures

Mohamed Elhadi^*, Amjad Al-Tobi

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

22 Citations (Scopus)

Abstract

This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for documents. An improved LCS algorithm was applied to such a representation to compare and rank the documents according to the similarity of their representative string. The approach was applied in detecting duplicate documents within a corpus, and in the filtering of search engine results. Results obtained were encouraging.

Original language	English
Title of host publication	ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology
Pages	679-684
Number of pages	6
DOIs	https://doi.org/10.1109/ICCIT.2009.235
Publication status	Published - 2009
Externally published	Yes
Event	4th International Conference on Computer Sciences and Convergence Information Technology, ICCIT 2009 - Seoul, Korea, Republic of Duration: Nov 24 2009 → Nov 26 2009

Publication series

Name	ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology

Conference

Conference	4th International Conference on Computer Sciences and Convergence Information Technology, ICCIT 2009
Country/Territory	Korea, Republic of
City	Seoul
Period	11/24/09 → 11/26/09

Keywords

Component: part-of-speech
Duplication filtering
Longest common subsequence
Syntactical structure

ASJC Scopus subject areas

General Computer Science
Information Systems and Management

Access to Document

10.1109/ICCIT.2009.235

Cite this

Elhadi, M., & Al-Tobi, A. (2009). Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures. In ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology (pp. 679-684). Article 5368928 (ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology). https://doi.org/10.1109/ICCIT.2009.235

Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures. / Elhadi, Mohamed; Al-Tobi, Amjad.
ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology. 2009. p. 679-684 5368928 (ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Elhadi, M & Al-Tobi, A 2009, Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures. in ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology., 5368928, ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology, pp. 679-684, 4th International Conference on Computer Sciences and Convergence Information Technology, ICCIT 2009, Seoul, Korea, Republic of, 11/24/09. https://doi.org/10.1109/ICCIT.2009.235

Elhadi M, Al-Tobi A. Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures. In ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology. 2009. p. 679-684. 5368928. (ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology). doi: 10.1109/ICCIT.2009.235

Elhadi, Mohamed ; Al-Tobi, Amjad. / Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures. ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology. 2009. pp. 679-684 (ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology).

@inproceedings{8f75d042a5e84f2494c559778ed46904,

title = "Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures",

abstract = "This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for documents. An improved LCS algorithm was applied to such a representation to compare and rank the documents according to the similarity of their representative string. The approach was applied in detecting duplicate documents within a corpus, and in the filtering of search engine results. Results obtained were encouraging.",

keywords = "Component: part-of-speech, Duplication filtering, Longest common subsequence, Syntactical structure",

author = "Mohamed Elhadi and Amjad Al-Tobi",

year = "2009",

doi = "10.1109/ICCIT.2009.235",

language = "English",

isbn = "9780769538969",

series = "ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology",

pages = "679--684",

booktitle = "ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology",

note = "4th International Conference on Computer Sciences and Convergence Information Technology, ICCIT 2009 ; Conference date: 24-11-2009 Through 26-11-2009",

}

TY - GEN

T1 - Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures

AU - Elhadi, Mohamed

AU - Al-Tobi, Amjad

PY - 2009

Y1 - 2009

N2 - This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for documents. An improved LCS algorithm was applied to such a representation to compare and rank the documents according to the similarity of their representative string. The approach was applied in detecting duplicate documents within a corpus, and in the filtering of search engine results. Results obtained were encouraging.

AB - This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for documents. An improved LCS algorithm was applied to such a representation to compare and rank the documents according to the similarity of their representative string. The approach was applied in detecting duplicate documents within a corpus, and in the filtering of search engine results. Results obtained were encouraging.

KW - Component: part-of-speech

KW - Duplication filtering

KW - Longest common subsequence

KW - Syntactical structure

UR - http://www.scopus.com/inward/record.url?scp=77749301855&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77749301855&partnerID=8YFLogxK

U2 - 10.1109/ICCIT.2009.235

DO - 10.1109/ICCIT.2009.235

M3 - Conference contribution

AN - SCOPUS:77749301855

SN - 9780769538969

T3 - ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology

SP - 679

EP - 684

BT - ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology

T2 - 4th International Conference on Computer Sciences and Convergence Information Technology, ICCIT 2009

Y2 - 24 November 2009 through 26 November 2009

ER -

Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this