Boilerpipe -- Boilerplate Removal and Fulltext Extraction from HTML pages boilerpipe

Group de.l3s.boilerpipe
描述 The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings. Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate. Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0. The algorithms used by the library are based on (and extending) some concepts of the paper "Boilerplate Detection using Shallow Text Features" by Christian Kohlschütter et al., presented at WSDM 2010 -
Packaging jar
Size 89.87 KB
文件 pom jar
网址 http://code.google.com/p/boilerpipe/
发布时间 2010-11-04 04:40

dependencies

Group Artifact Version

developers

Christian Kohlschütter

licenses

Apache License 2.0
索引仓库
仓库 个数
Central 592045
5062623