Give me text!

An easy to use free web service to extract text from PDFs and other documents - OCR support included!

Choose a file...

What is it?

Give Me Text is an online service for converting many complex file formats into simple text. This is useful for all sorts of things, especially in the area of document processing and indexing. Using the form above you can upload any file and see what the Apache Tika software behind the site makes of it. The service will even run Optical Character Recognition on image formats in order to give you text from images. The list of file formats supported is long and can be found here.


API

Behind the site is an instance of the Apache Tika Server, which takes the files and processes them with the Tika engine. The endpoint for the complete API is at http://givemetext.okfnlabs.org/tika. The endpoint for the text extraction service is at http://givemetext.okfnlabs.org/tika/tika, but there is a bunch of other useful services too. Check out the Tika Documentation for full details, and don't forget the extra 'tika' in the path compared to the documentation.

Here's a simple example using curl to get text from a TIF image:

curl -T my_image.tif http:///beta.offenedaten.de:9998/tika

That command uses a PUT request. If you'd like to POST, for example using a web form, you can use http://givemetext.okfnlabs.org/tika/tika/form.

Note that there is some special advice on using OCR with the Tika server here.


Plans & Get Involved

Current TODOs include:


Source Code

The source code for the Tika project is available through SVN. The Dockerfile for creating an instance can be found on GitHub as can the web frontend. The arrangement of these three pieces is currently maintained by Matt Fullerton - just get in touch if you'd like to help out.