User Tools

Site Tools


software:scraping

Apache Tika - a content analysis toolkit

From: http://scraping.apievangelist.com/tools/

Also: http://tika.apache.org/

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file 
types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, 
making Tika useful for search engine indexing, content analysis, translation, and much more. You can 
find the latest release on the download page. Please see the Getting Started page for more information 
on how to start using Tika.

Diffbot

From: https://www.diffbot.com/products/crawlbot/

Crawlbot is smart spidering Crawlbot uses any Diffbot API to extract data from entire sites. Whether for product prices, historical weather, content migration or even three years of Hacker News archives, Crawlbot creates a structured index of practically any site's data.

software/scraping.txt · Last modified: 2018/06/07 04:54 by superwizard