Text and meta-data extraction service. The extractor module exposes a JSON web service for text and meta-data extraction. Here is the list of supported binary formats. Click on the link to see the description of the returned informations.
docker run -p 9091:9091 qwazr/extractor
curl -XGET http://localhost:9091/extractor
The function returns the list of available parsers.
[
"audio",
"doc",
"docx",
"eml",
"html",
"image",
"mapimsg",
"markdown",
"ocr",
"odf",
"pdfbox",
"ppt",
"pptx",
"publisher",
"rss",
"rtf",
"text",
"visio",
"xls",
"xlsx",
"wdp"
]
curl -XGET http://localhost:9091/extractor/text
The function displays which fields are returned by the parser and the available methods.
{
"returnedFields" : [ {
"name" : "content",
"type" : "STRING",
"description" : "The content of the document"
}, {
"name" : "lang_detection",
"type" : "STRING",
"description" : "Detection of the language"
}, {
"name" : "charset_detection",
"type" : "STRING",
"description" : "Detection of the charset"
} ],
"file_extensions" : [ "txt" ],
"mime_types" : [ "text/plain" ]
}
There are several options to extract data from a file.
curl -XPUT --data-binary @tutorial.pdf http://localhost:9091/extractor/pdfbox
curl -XPUT --data-binary @tutorial.pdf http://localhost:9091/extractor?name=tutorial.pdf
If the file is already available in the server, the extraction can made by passing the path of the file.
curl -XGET http://localhost:9091/extractor/pdfbox?path=/home/manu/tutorial.pdf
curl -XGET http://localhost:9091/extractor?path=/home/manu/tutorial.pdf
The parser extracts the metas and text information using the following JSON format:
{
"time_elapsed": 2735,
"metas": {
"number_of_pages": [7],
"producer": ["FOP 0.20.5"]
},
"documents": [
{
"content": ["Table of contents Requirements Getting Started Deleting Querying Data Sorting Text Analysis Debugging"],
"character_count":[13634],
"rotation": [ 0 ],
"lang_detection": ["en" ]
}
]
}
Writing a parser is easy. Just extends the abstract class ParserAbstract and implements the required methods.
protected void parseContent(InputStream inputStream, String extension, String mimeType) throws Exception;
The parse must build a list of ParserDocument. A parser may return one or more documents (one document per page, one document per RSS item, …). A Parser Document is a list of name/value pair.
Have a look at the Rtf class to see a simple example.
@Override
protected void parseContent(InputStream inputStream, String extension,
String mimeType) throws Exception {
// Extract the text data
RTFEditorKit rtf = new RTFEditorKit();
Document doc = rtf.createDefaultDocument();
rtf.read(inputStream, doc, 0);
// Obtain a new parser document.
ParserDocument result = getNewParserDocument();
// Fill the field of the ParserDocument
result.add(CONTENT, doc.getText(0, doc.getLength()));
// Apply the language detection
result.add(LANG_DETECTION, languageDetection(CONTENT, 10000));
}