The Language Classify Engine automates the classification of processed documents by language.
The engine contains predefined and pretrained learnsets for the following languages.
- Danish
- English
- Finnish
- French
- German
- Greek
- Russian
- Spanish
- Swedish
The English, French, German, Greek and Russian languages are specifically optimized for invoice processing. You can add new languages and adjust the predefined ones.
The Language Classify Engine uses the ASSA content classification technology to distinguish between processed documents written in different languages. The engine imports the data in Unicode format and converts them to non-western language learnsets with a special phonetic English representation suitable for the core ASSA engine. After this conversion, the engine creates an ASSA classification pool that it then uses for classification purposes. The pre-trained version of this pool is distributed along with the BIC setup and is available without having to learn anything although it is adjustable if the customer would like to extend the predefined learnset.
When classifying the next incoming document, the system uses the available ASSA classification pool for the classification. During the classification, the engine ignores all numeric words found in the document since they are usually irrelevant in terms of language classification.
To use the engine for non-western languages, complete the steps in Activate Multi-Byte Encoding for CJKT Languages.