Abstract
This paper explores the effectiveness of different semantic web
page segmentation algorithms on modern websites. We compare
three known algorithms each serving as an example of a particular approach to the problem, and one self-developed algorithm,
WebTerrain, that combines two of the approaches. With our testing
framework we have compared the performance of four algorithms
for a large benchmark we have constructed. We have examined
each algorithm for a total of eight different configurations (varying datasets, evaluation metric and the type of the input HTML
documents). We found that all algorithms performed better on random pages on average than on popular pages, and results are better when running the algorithms on the HTML obtained from the DOM rather than on the plain HTML. Overall there is much room for improvement as we find the best average F-score to be 0.49, indicating that for modern websites currently available algorithms are not yet of practical use.
page segmentation algorithms on modern websites. We compare
three known algorithms each serving as an example of a particular approach to the problem, and one self-developed algorithm,
WebTerrain, that combines two of the approaches. With our testing
framework we have compared the performance of four algorithms
for a large benchmark we have constructed. We have examined
each algorithm for a total of eight different configurations (varying datasets, evaluation metric and the type of the input HTML
documents). We found that all algorithms performed better on random pages on average than on popular pages, and results are better when running the algorithms on the HTML obtained from the DOM rather than on the plain HTML. Overall there is much room for improvement as we find the best average F-score to be 0.49, indicating that for modern websites currently available algorithms are not yet of practical use.
Original language | English |
---|---|
Publisher | Department of Information and Computing Sciences, Utrecht University |
Number of pages | 12 |
Publication status | Published - Jun 2014 |
Publication series
Name | Technical Report Series |
---|---|
No. | UU-CS-2014-018 |
ISSN (Print) | 0924-3275 |