These links were fetched from List of resources: Article text extraction from HTML documents
Links
- Boilerpipe library: an open source Java library. The library itself is the official implementation of the overall algorithm presented in the previously mentioned paper by Kohlschütter et al.
- Readability bookmarklet
by arc90labs is open sourced. Originally written in
JavaScript it was also ported to other languages:
- python-readabilty – using BeautifulSoup (slow)
- fork of python-readability employing lxml for faster parsing
- ruby-readability
- PHP port
- Project Goose by Gravity labs
- Perl module HTML::Feature
- Webstemmer is a web crawler and page layout analyzer with a text extraction utility
- Demo of VIPS packaged in a .dll (it’s use is limited to research purposes only)
Web APIs
After a short inquiry I came across some very decent web APIs:
- Alchemy API Web Page Cleaning – a well known commercial API with a limited free service
- ViewText.org – they’re asking you to be kind to their servers, so this is not your typical commercial service
- DiffBot API – describes itself as: “Statistical machine learning algorithms are run over all of the visual elements on the page to extract out the article text and associated metadata, such as its images, videos, and tags.”
- Purifry – is promising high performance and good accuracy. It’s also available as a binary.