We compare three known semantic web page segmentationalgorithms, each serving as an example of a particular approach to theproblem, and one self-developed algorithm, WebTerrain, that combinestwo of the approaches. We compare the performance of the four algorithmsfor a large benchmark of modern websites we have constructed,examining each algorithm for a total of eight configurations. We foundthat all algorithms performed better on random pages on average thanon popular pages, and results are better when running the algorithmson the HTML obtained from the DOM rather than on the plain HTML.Overall there is much room for improvement as we find the best averageF-score to be 0.49, indicating that for modern websites currentlyavailable algorithms are not yet of practical use.
|Title of host publication||Engineering the Web in the Big Data Era. ICWE 2015|
|Number of pages||18|
|Publication status||Published - 10 Jun 2015|
|Name||Lecture Notes in Computer Science|