![]() |
![]() |
![]() |
ChupaText Reference Manual | ![]() |
---|
DESCRIPTION
ChupaText is a text extraction tool and useful for full text search system.
It reads a file and extracts text and metadata and outputs them as MIME format. It means that you can use a mail parser for parsing ChupaText output.
It supports various file formats. See the following list (encrypted file isn't supported):
Adobe PDF
Microsoft Word
Microsoft Excel
Microsoft PowerPoint
HTML
text
It also supports the following archive and compression formats:
zip
tar
gz
OUTPUT FORMAT
chupatext command outputs extracted text data and metadata with MIME format. Header fields include metadata and body includes text data. Here are always included header fields:
It is always "text/plain; charset=UTF-8". |
|
It is number of bytes of text data. chupatext outputs a newline at the last. So chupatext actually outputs number of bytes of text data + 1 bytes (for a newline) as a body. For example, chupatext outputs 7 bytes data for 6 bytes text "Sample". It's "Sample" (6 bytes) + "newline (\n)" (1 byte). |
|
It is the input filename. |
|
It is the MIME type of the input file. It may includes the following optional parameters:
|
|
It is presentation information for input file. The type is always "inline".
|
Here are optional fileds:
EXIT STATUS
The exit status is 0 for success in extraction, non-0 otherwise. If --ignore-errors is specified, the exit status is 0 for failure in extraction.
FILES
Those files describe text extraction module information. |
|
Those files are text extraction modules. |
|
Those files are tet extraction modules implemented by Ruby. |