Webpage duplicate detection using combined POS and sequence alignment algorithm

Mohamed Elhadi; Amjad Al-Tobi

doi:10.1109/CSIE.2009.771

Webpage duplicate detection using combined POS and sequence alignment algorithm

Mohamed Elhadi^*, Amjad Al-Tobi

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

3 Citations (Scopus)

Abstract

Combined syntactical categories and sequence alignment algorithms are implemented and used to weed-out duplicate and near-duplicate web-pages from search engine results. The syntactical structures manifested as POS-tags were pre-processed using a POS tagger converting parts of a webpage's text into a string of tags. The produced string was then subjected into the longest Common Sequence (LCS) techniques (as is commonly done in computational biology), to detect duplicate and near-duplicate webpages. The process of tagging and aligning was based on set of sentences extracted from the web page as a representative of the pages. The query-keywords are used as a basis for sentence extraction. Results obtained from experiments performed have shown that such a combined approach can provide very interesting similarity calculation and re-ranking measure. This can be used with reasonable efficiency to detect duplications on search results generated by search engines such as Google. Similarity measurements obtained can be further used as a basis for text analysis of the search results allowing the detection of duplicate and near duplicates and clustering of documents in general.

Original language	English
Title of host publication	2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009
Pages	630-634
Number of pages	5
DOIs	https://doi.org/10.1109/CSIE.2009.771
Publication status	Published - 2009
Externally published	Yes
Event	2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009 - Los Angeles, CA, United States Duration: Mar 31 2009 → Apr 2 2009

Publication series

Name	2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009
Volume	1

Conference

Conference	2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009
Country/Territory	United States
City	Los Angeles, CA
Period	3/31/09 → 4/2/09

Keywords

Copy detection
Duplicate
LCS
Longest common sequence
POS
Part-of-speech
Search engine

ASJC Scopus subject areas

Computer Science Applications
Hardware and Architecture
Information Systems
Software

Access to Document

10.1109/CSIE.2009.771

Cite this

Webpage duplicate detection using combined POS and sequence alignment algorithm. / Elhadi, Mohamed; Al-Tobi, Amjad.
2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009. 2009. p. 630-634 5171248 (2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009; Vol. 1).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Elhadi, M & Al-Tobi, A 2009, Webpage duplicate detection using combined POS and sequence alignment algorithm. in 2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009., 5171248, 2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009, vol. 1, pp. 630-634, 2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009, Los Angeles, CA, United States, 3/31/09. https://doi.org/10.1109/CSIE.2009.771

@inproceedings{cef7be8f0b8946f88f70d8431b3a5df0,

title = "Webpage duplicate detection using combined POS and sequence alignment algorithm",

abstract = "Combined syntactical categories and sequence alignment algorithms are implemented and used to weed-out duplicate and near-duplicate web-pages from search engine results. The syntactical structures manifested as POS-tags were pre-processed using a POS tagger converting parts of a webpage's text into a string of tags. The produced string was then subjected into the longest Common Sequence (LCS) techniques (as is commonly done in computational biology), to detect duplicate and near-duplicate webpages. The process of tagging and aligning was based on set of sentences extracted from the web page as a representative of the pages. The query-keywords are used as a basis for sentence extraction. Results obtained from experiments performed have shown that such a combined approach can provide very interesting similarity calculation and re-ranking measure. This can be used with reasonable efficiency to detect duplications on search results generated by search engines such as Google. Similarity measurements obtained can be further used as a basis for text analysis of the search results allowing the detection of duplicate and near duplicates and clustering of documents in general.",

keywords = "Copy detection, Duplicate, LCS, Longest common sequence, POS, Part-of-speech, Search engine",

author = "Mohamed Elhadi and Amjad Al-Tobi",

year = "2009",

doi = "10.1109/CSIE.2009.771",

language = "English",

isbn = "9780769535074",

series = "2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009",

pages = "630--634",

booktitle = "2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009",

note = "2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009 ; Conference date: 31-03-2009 Through 02-04-2009",

}

TY - GEN

T1 - Webpage duplicate detection using combined POS and sequence alignment algorithm

AU - Elhadi, Mohamed

AU - Al-Tobi, Amjad

PY - 2009

Y1 - 2009

N2 - Combined syntactical categories and sequence alignment algorithms are implemented and used to weed-out duplicate and near-duplicate web-pages from search engine results. The syntactical structures manifested as POS-tags were pre-processed using a POS tagger converting parts of a webpage's text into a string of tags. The produced string was then subjected into the longest Common Sequence (LCS) techniques (as is commonly done in computational biology), to detect duplicate and near-duplicate webpages. The process of tagging and aligning was based on set of sentences extracted from the web page as a representative of the pages. The query-keywords are used as a basis for sentence extraction. Results obtained from experiments performed have shown that such a combined approach can provide very interesting similarity calculation and re-ranking measure. This can be used with reasonable efficiency to detect duplications on search results generated by search engines such as Google. Similarity measurements obtained can be further used as a basis for text analysis of the search results allowing the detection of duplicate and near duplicates and clustering of documents in general.

AB - Combined syntactical categories and sequence alignment algorithms are implemented and used to weed-out duplicate and near-duplicate web-pages from search engine results. The syntactical structures manifested as POS-tags were pre-processed using a POS tagger converting parts of a webpage's text into a string of tags. The produced string was then subjected into the longest Common Sequence (LCS) techniques (as is commonly done in computational biology), to detect duplicate and near-duplicate webpages. The process of tagging and aligning was based on set of sentences extracted from the web page as a representative of the pages. The query-keywords are used as a basis for sentence extraction. Results obtained from experiments performed have shown that such a combined approach can provide very interesting similarity calculation and re-ranking measure. This can be used with reasonable efficiency to detect duplications on search results generated by search engines such as Google. Similarity measurements obtained can be further used as a basis for text analysis of the search results allowing the detection of duplicate and near duplicates and clustering of documents in general.

KW - Copy detection

KW - Duplicate

KW - LCS

KW - Longest common sequence

KW - POS

KW - Part-of-speech

KW - Search engine

UR - http://www.scopus.com/inward/record.url?scp=70449127700&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70449127700&partnerID=8YFLogxK

U2 - 10.1109/CSIE.2009.771

DO - 10.1109/CSIE.2009.771

M3 - Conference contribution

AN - SCOPUS:70449127700

SN - 9780769535074

T3 - 2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009

SP - 630

EP - 634

BT - 2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009

T2 - 2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009

Y2 - 31 March 2009 through 2 April 2009

ER -

Webpage duplicate detection using combined POS and sequence alignment algorithm

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this