Use the engine for generic training only, that means for data extraction from unclassified documents. Train the engine with at least 10 documents per extracted field, while the ideal amount of documents per field is 50-100 samples.
We do not recommend to use the Brainware Field Extraction engine for the vendor level training, as the Brainware Extraction engine achieves perfect extraction results with just a few, that means 1-5, documents for a particular high-volume vendor class.
Even though the engine is integrated in the Supervised Learning Workflow (SLW), due to above mentioned reasons, we recommend to use the Brainware Extraction engine for automatic SLW vendor level training. For this purpose, assign the Brainware Extraction engine to the header fields of the SLW’s root classes using a special dedicated class for generic extraction through Brainware Field Extraction engine. For example, the following classes’ hierarchy:
Invoices
- Generic
- VendorClass1
- ...
- VendorClassN
The Invoices class is the root template class for SLW training, while the Generic class defines generic extraction through BFE engine. In this connection, the standard classification result for Invoices has to be set to Generic, so that in case a document has not been classified to one of available vendor class with precise layout extraction VendorClassX, it goes to “Generic” class with generic extraction pattern defined. After the generic processing has been applied, the classification result has to be set back to Invoices in case further SLW processing is to be maintained. This can be achieved by placing the following script code into PostExtract event’s handler of the Generic document class.
Example: Reset the classification result
' Cedar Document Class Script for Class "Generic" Private Sub Document_PostExtract(pWorkdoc As SCBCdrPROJLib.SCBCdrWorkdoc) pWorkdoc.DocClassName = "Invoices" End Sub
In some project configurations, when vendor classification succeeds but does not deliver any results for some selected fields, it may also make sense to take an advantage of the Retain previous extraction results option and define double extraction processing to combine generic extraction with vendor level extraction.