TY - GEN
T1 - A Quantitative Comparison of Semantic Web Page Segmentation Approaches
AU - Kreuzer, Robert
AU - Hage, J.
AU - Feelders, Ad
PY - 2015/6/10
Y1 - 2015/6/10
N2 - We compare three known semantic web page segmentationalgorithms, each serving as an example of a particular approach to theproblem, and one self-developed algorithm, WebTerrain, that combinestwo of the approaches. We compare the performance of the four algorithmsfor a large benchmark of modern websites we have constructed,examining each algorithm for a total of eight configurations. We foundthat all algorithms performed better on random pages on average thanon popular pages, and results are better when running the algorithmson the HTML obtained from the DOM rather than on the plain HTML.Overall there is much room for improvement as we find the best averageF-score to be 0.49, indicating that for modern websites currentlyavailable algorithms are not yet of practical use.
AB - We compare three known semantic web page segmentationalgorithms, each serving as an example of a particular approach to theproblem, and one self-developed algorithm, WebTerrain, that combinestwo of the approaches. We compare the performance of the four algorithmsfor a large benchmark of modern websites we have constructed,examining each algorithm for a total of eight configurations. We foundthat all algorithms performed better on random pages on average thanon popular pages, and results are better when running the algorithmson the HTML obtained from the DOM rather than on the plain HTML.Overall there is much room for improvement as we find the best averageF-score to be 0.49, indicating that for modern websites currentlyavailable algorithms are not yet of practical use.
U2 - 10.1007%2F978-3-319-19890-3_24
DO - 10.1007%2F978-3-319-19890-3_24
M3 - Conference contribution
SN - 9783319198897
T3 - Lecture Notes in Computer Science
SP - 374
EP - 391
BT - Engineering the Web in the Big Data Era. ICWE 2015
PB - Springer
ER -