To begin configuring a text extraction column for Grouped Line Item Extraction, ensure that the Text Extraction tab is selected in the Modify Line Item Column dialog box. Then follow the procedures below.
-
Use the Keyword Type drop-down list to select the
Keyword Type that values in the selected column are assigned to as Keyword
Values.
To ignore the data in the selected column or capture it as XML data only, select <None> from the Keyword Type drop-down list.
Note:You can configure the Advanced Capture engine to split each value in the selected column into two separate Keyword Values by specifying a regular expression rule for the extracted text, setting two capture groups within this regular expression rule, and assigning a Keyword Type to each of these capture groups. This may be useful when two different types of data are listed on a document as one value (e.g., a transcript course subject and course number need to be captured as two separate values, but they appear together on a document as ENG101). If you are configuring the selected column for two Keyword Values, the first of these values (i.e., the value extracted from the first capture group) is assigned to your Keyword Type selection and the second of these values is assigned to your Second Capture Group Keyword Type selection. If you are configuring the selected column for just one Keyword Value, however, as long as you have set multiple capture groups within the regular expression rule, only the value extracted from the first capture group is still assigned to your Keyword Type selection; the values extracted from the subsequent capture groups are discarded.
-
If the form is configured to create an XML rendition of documents matched to it
(i.e., the Create XML data rendition option is selected
for the form), the XML Node drop-down list is displayed
next to the corresponding Keyword Type drop-down list.
Use the XML Node drop-down list to select the XML node (i.e., the element) that Keyword Values extracted from this column are contained in when the XML rendition of the document is created.
-
If you selected a Date, Date &
Time, or Currency Keyword Type for the
selected column, the Override Default Regional Setting
drop-down list is displayed below this selected Keyword Type. If you have not
selected one of these Keyword Types, the Override Default Regional
Setting drop-down list is disabled.
When available, you can use the Override Default Regional Setting option to override your system's default regional settings when extracting values for the applicable Keyword Types in the Data Field Zone. To use this option, do one of the following:
-
Select a regional language from the drop-down list to parse the applicable Keyword Values using the selected region's formatting rules and language-specific names for days, months, etc.
-
Select <Runtime System Default> to maintain your system's default regional settings when parsing these values.
Note:This option does not affect which languages the OCR format is configured to recognize. For information on configuring OCR formats, see the Full-Page OCR module reference guide or help files.
-
-
If you would like to assign specific colors to the selected Keyword Type, click
the Colors button. The Display
Colors dialog box is displayed.
Here you can change the colors in which any regular or suspect values for the corresponding Keyword Type are displayed in the Indexing panel once processing has taken place.
-
To change the display color for regular values, click Display Color to open your machine's color palette, select a color, and click OK.
-
To change the display color for suspect values, click Suspect Color to open your machine's color palette, select a color, and click OK.
-
To revert back to the default display color for regular or suspect values, click the Automatic button that corresponds to the desired type of values (i.e., the left button left for regular values, the right button for suspect values).
Note:Any colors assigned here can be overridden by colors assigned through Keyword Lookup/Replace settings and/or VB scripting.
-
-
If you are configuring the selected column for two Keyword Values, use the
Second Capture Group Keyword Type drop-down list to
select the second Keyword Type that values in the selected column are assigned
to as Keyword Values.
If you are configuring the selected column for just one Keyword Value, or if you wish to capture a second Keyword Value as XML data only, select <None> from the Second Capture Group Keyword Type drop-down list.
Note:If you are configuring the selected column for just one Keyword Value, and you have set capture groups within the regular expression rule, be aware that only the value extracted from the first capture group will be assigned to your Keyword Type drop-down selection.
-
If you wish to specify a regular expression rule for the extracted text, enter the rule in the Expression to Match field. The OCR engine will compare the extracted text to the defined regular expression rule; if the text is a match, the value is stored as a Keyword Value. If the text does not match, it is discarded.
For example, if you specify the following regular expression rule, any value containing a character that is not a letter, number, or space is discarded: [[:upper:][:lower:][:digit:][:space:]]+
Note:To access the Regular Expression Library, click in the field and press F2. See The Regular Expression Library for more information.
Note:All regular expressions must be ECMA compliant.
Note:If you are configuring the selected column for two Keyword Values, you must specify a regular expression rule for the extracted text and set two capture groups within the rule to align to these two values.
Tip:Regular expressions can be used to discard column data that consists of entirely unwanted characters (e.g., a 123***456, where you want to capture 123 and 456 as separate Keyword Values and discard the asterisk separator column). However, if the unwanted characters are located in the middle of valid data (e.g., 123***456, where you want to capture 123456 as one Keyword Value), you can configure a Keyword Lookup/Replace dictionary entry to replace the data.
-
If you entered a regular expression rule containing a capture group in the previous step, and you would like to have Keyword Values extracted for only this capture group, select the Extract capture group only check box. If you would like to have Keyword Values extracted for the entire regular expression (including the capture group), deselect the Extract capture group only check box.
Note:
This option is not available if the Second Capture Group Keyword Type drop-down list is set to anything other than <None>, or if the Value must be NULL text check box is selected.
- If you wish to set a different Suspect Level for the selected Keyword Type(s) than the parent Suspect Level set for the entire Line Item Extraction Zone, enter this value (0 to 99) in the Suspect Level field. By default, this value is set to 0, which retains the parent Suspect Level.
- If you would like to have Keyword Values extracted from the row only if no value exists in the specified column, select the Value must be NULL text check box. When this option is selected, the Column Data Required check box is selected automatically and the Expression to Match field is disabled.
- If a Keyword Value is required, select the Column Data Required check box. If a Keyword Value that is marked as required is missing, the data from the entire row is discarded.
-
If you wish to restrict the Keyword Values that are extracted from the column to only those characters fully contained within the column's boundaries, select the Enforce absolute column break on right check box.
When this option is not selected, some carryover is allowed when characters on either side of the column's right boundary are very close together, assuming that these characters belong to the same Keyword Value. The characters just outside the right boundary will be included in the extracted value.
-
If you wish to remove all whitespace (i.e., spaces, paragraph returns, etc.)
from a Keyword Value after it is read by the OCR engine, select the
Remove all whitespace from result check box.
For example, if the Keyword Value read by the OCR engine is PSY 103 and this option is selected, the Keyword Value is modified to PSY103.
Note:If you specified a regular expression rule for the extracted text in the Expression to Match field, the OCR engine will compare the extracted text without any whitespace to the defined regular expression rule.