AWS Textract
This task will extract text from PNGs, JPGs, and PDFs and stores the text on the repository document in the simflofy_ai_text field.
Configuration- Authentication Connection: An Amazon authentication connection with your Amazon AWS credentials
Google Vision Text Extraction
Extracts text from .tiff, .pdf and .gif files and stores it on the repository document in the simflofy_ai_texts field.
- Authentication Connector: Your authentication connection for Google You can find it in the url while edit or view page for the connection
Tesseract Text Extraction
PREREQUISITE This task requires Tesseract to be installed on the system that Federation Services is running on.
This task uses Tesseract OCR to scan for text from images and PDF files, saving that text to a field in the repository documented called simflofy_ai_texts. Supported formats are .png, .jpg, .pdf, .tiff, .gif, and .bmp. PDFs are saved on a per-page basis to simflofy_ai_texts.
Note: Tesseract OCR will be an optional dependency of Federation
Services.
Configuration
- Tessdata Directory: The path to your Tessdata folder. This folder should have the trained data of the language you plan to OCR.
- Tesseract Library: The path to your Tesseract library folder containing the proper library files for your OS.
- Engine Mode: Select which engine Tesseract should use, legacy or LTSM. Ensure that ensure is installed before selecting it, or leave it on the default config for it to detect your engine.
- Page Segmentation Mode: By default Tesseract expects a page of text. You can change the way it segments a page if your images differ from this.
- Tesseract Language Code: The language code for the installed trained data in your Tessdata directory. This is in ISO 639-1/T format and is the letters before the .trained data extension for the trained data file.
- Use HOCR: Whether to use HOCR. When enabled, text will be output in HTML format rather than as raw text.