A Quantitative Comparison of Semantic Web Page Segmentation Approaches

Robert Kreuzer, Jurriaan Hage, Adrianus Feelders

Research output: Book/ReportOther report

Abstract

This paper explores the effectiveness of different semantic web
page segmentation algorithms on modern websites. We compare
three known algorithms each serving as an example of a particular approach to the problem, and one self-developed algorithm,
WebTerrain, that combines two of the approaches. With our testing
framework we have compared the performance of four algorithms
for a large benchmark we have constructed. We have examined
each algorithm for a total of eight different configurations (varying datasets, evaluation metric and the type of the input HTML
documents). We found that all algorithms performed better on random pages on average than on popular pages, and results are better when running the algorithms on the HTML obtained from the DOM rather than on the plain HTML. Overall there is much room for improvement as we find the best average F-score to be 0.49, indicating that for modern websites currently available algorithms are not yet of practical use.
Original languageEnglish
PublisherDepartment of Information and Computing Sciences, Utrecht University
Number of pages12
Publication statusPublished - Jun 2014

Publication series

NameTechnical Report Series
No.UU-CS-2014-018
ISSN (Print)0924-3275

Fingerprint

Dive into the research topics of 'A Quantitative Comparison of Semantic Web Page Segmentation Approaches'. Together they form a unique fingerprint.

Cite this