Commonly Used - Alfresco Federation Services - 3.2 - 3.2 - Ready - Alfresco - external - Alfresco/Alfresco-Federation-Services/3.2/Alfresco-Federation-Services/Using/Task-List/Commonly-Used - 2025-03-04

Alfresco Federation Services

Platform
Alfresco
Product
Alfresco Federation Services
Release
3.2
License

Duplication Detection

This task checks the chosen field to find duplicate documents during the job run. It takes the selected actions against the document if a duplicate one exists. To see how each repository handles duplicates and versioning check the individual connector page.

When the duplication check is run multiple times, the original file will not be marked as duplicate.

Configuration:

Field to Compare: The field whose value will be used to check for duplicates. If this value is found in any other document it will be considered a duplicate. The default is the File Content Hash.

  • File Content Hash
  • Document Type
  • Document Source Id
  • Document URI
  • Version ID
  • Version Series Id

If you wish to compare file hashes (a sort of fingerprint for a document), you will need to precede this task with a Hash Value Generator Task (see File).

Duplication Check Scope:

  • Run Job: Check the documents associated with this job run only
  • Job: Check all documents ever ran for that job
  • Enterprise: Check all documents ever processed through 3Sixty

Action: What to do if a document is found.

  • Audit and continue
  • Skip the document
  • Fail the job

Tagging Duplicate Documents

Metadata can be added through mapping to tag documents discovered as duplicates when the Action field is set to Audit and Continue.

The fields that can be added are:

  • isDuplicate: true (if duplicate found) or false (if not) Important: When using the Duplication Detection task in a job the user must make sure that if they are mapping the “isDuplicate” field that they set the Target type to String and not Boolean or they will receive an error that they cannot change text to boolean. If this error is received the user has to drop the index and run the job again.
  • baseParentID: doc ID of the original document
  • duplicationParentID: comma separated list of doc IDs of the documents that it found duplicates against - blank if no duplicate detected
  • duplicationScope: blank if no duplicate detected, and
  • duplicationCriteria: which fields the duplicate was considered against; depends on what was selected below - blank if no duplicate detected

Filename Cleanse

This task uses regex (Regular Expressions) to alter filenames.

Common use cases are clearing unwanted characters, such as whitespace, or non-alphanumeric characters.

  • Regex to Match: The regex pattern to search for in the filename.

The default value matches any character that isn’t a letter, number, space or period an unlimited number of times.

  • Replacement: Replacement for matches.

EXAMPLE - ALPHANUMERICS

The pattern for alphanumeric characters is [a-zA-Z0-9], or if you wish to include underscores \w.

To select for non-alphanumeric characters we add the carat (^) before the pattern, so ^[a-zA-Z0-9].

The carat character simply translates to “Not”, so it negates whatever is after it.

EXAMPLE - CLEARING UNWANTED SPACES

The pattern \s is regex shorthand for “spaces”. If you’re worried about tabs, line breaks etc. add an asterisk (*) after the pattern for what is called “greedy” selection.

Adding this as your regex and setting the replacement as ''.

Filename Extraction

Extracts the file name from another field using regex (Regular Expressions). It will set the file name to the value that matches the regex.

  • Regex to match: The regex to match and convert to and set as the file name.
  • Data Field: The repository document field to match the regex on.

EXAMPLE - ALPHANUMERICS

The pattern for alphanumeric characters is [a-zA-Z0-9], or if you wish to include underscores \w.

To select for non-alphanumeric characters we add the carat (^) before the pattern, so ^[a-zA-Z0-9].

The carat character simply translates to “Not”, so it negates whatever is after it.

EXAMPLE - CLEARING UNWANTED SPACES

The pattern \s is regex shorthand for “spaces”. If you’re worried about tabs, line breaks etc. add an asterisk (*) after the pattern for what is called “greedy” selection.

Adding this as your regex and setting the replacement as ''.

Folder Path Cleanse

This task uses Regular Expressions (regex) to alter Folder Names. Functionally identical to Filename Cleanse Task, except that is changes the document parent path.

Regex to Match: The regex pattern to search for in the Folder Name.

The default value matches any character that isn’t a letter, number, space or period an unlimited number of times.

Replacement: Replacement for matches.

Obsolete Detection

The obsolete task can be used to identify obsolete documents when reading them from the source repository. This can be used either in the federation view for identifying documents that are obsolete based on the definition set in the task for the job by adding metadata or skipping/processing the documents which are obsolete. You can define what obsolete means for your organisation - you can define it based on the date created or updated and the content before what timeframe is considered as obsolete.

Note: The Obsolete Detection tasks uses the system time of the 3Sixty server when calculating date/time scenarios.
Fields
  • Before when will the files be considered obsolete
  • Before
  • Custom Date
  • Action
Metadata
  • isObsolete - Yes/No
  • obsoleteField - Date Created or Date Updated
  • obsoleteBefore - date after which the content is considered obsolete

Override Filename

The task uses the 3Sixty Expression Language to override the file name of each document. The functionally of this task is identical to the Override Folder Path task.

Configuration
  • Pattern: List the fields you want to use to rename the file
  • Deep Change: If the file has versions, update their names as well.

Override Folder Path

Add the Override Folder Path task from the drop-down on your job.

A sample pattern has been provided for you by default. You can also leverage the 3Sixty expression language when modifying your path. More information on the Expression Language can be found here: Federation Services Expression Language. Click the Done button when you have finished modifying your job task.

The example pattern is:

'/' + '#{rd.filename}' + '/simflofy'

Using the Expression Language you can see that rd. are internal 3Sixty Fields.

However, you may want to use metadata from the source system to generate your path.

Use field in place of rd to accomplish this or leave off the prefix entirely.

The best way to put your path together is to know what fields are available and what the field values look like. We suggest running the BFS output with no mappings and Include Un-Mapped Properties set to True. This will generate a xml file such as:

<properties>
<entry key="document.name">Alfresco Ingestion.pptx</entry>
<entry key="type">document</entry>
<entry key="folderpath">test3Sixty_Partners</entry>
<entry key="separator">,</entry>
<entry key="document.Culture">en-US</entry>
<entry key="document.CustomerId">123</entry>
<entry key="document.Category">legal</entry>
<entry key="document.lastindex">23</entry>
</properties>

Now let’s say we want the actual folder path to be a combination of folderpath + Culture + Category + Customer ID. To do that we just reference each field like:

'/' + '#{rd.path}' + '/' + '#{document.Culture}'+'/' +
'#{document.Category}'+'/' + '#{document.CustomerId}'

PII Detection

The PII Detection Job Task is uses regex expressions to detect PII in any document or metadata passing through 3Sixty. The regex expressions are stored in the form of a .properties file.

CAUTION: File size limit is 95MB.

PII FLAG

This task will always add the boolean field hasPii for the purposes of mapping and analysis.

DEFAULT FILE LOCATION

The default file is located at 3sixty-admin/WEB-INF/classes/simflofy-pii- detection.properties:

  • Field To Mark: The output metadata property to store PII detected. The value of this field will be a map
{
"PhoneNumber": 20,
"Names": 200
}
  • Break up PII data into individuals fields: Instead of adding the PII as a map, 3Sixty will break it up as individual fields for easier mapping/processing.
  • Prefix for PII fields: If breaking up PII data, the prefix to use for each field. If left blank ‘pii’ will be used.
  • Fields To Check: Source properties and/or document to check for PII. Use ALL_PROPS to check all properties, BINARY to check the document (extracted via Tika) or individual property names.

In this case, the above fields will come across as

pii.phonenumber and pii.names

Rename File On Duplicate File Path

Functions similarly to the Duplication Check task, except if it finds a duplicate, it will rename the file using the supplied pattern.

Tika Text Extractor

Apache Tika is an open-source tool used to extract text from documents. 3Sixty most commonly uses it to extract text during indexing for federated search.

For this feature to work on larger files, memory pool settings of 4GB is required, 8GB recommended. This can be updated in the Java tab of your Apache Tomcat Properties window.

CAUTION: File size limit is 95MB.
  • Tika Content Field: The field where the task will put the extracted content.
  • Max Content Length (B): Set the max content length which is checked before processing. The job will not process documents over this size. Set to 0 to process documents of any length.
  • File Extensions to Extract: Comma delimited list of file extensions to process or leave blank to process all. The extensions are checked at the same time as content length. Fail Document on Extraction Error: Fail the Document if there is an Extraction Error during processing.
  • Remove Content After Extraction: Remove the content from the documents. This will happen even if the document exceeds the maximum length.
  • Stage on Filesystem: Stage content on the filesystem for extracting text or set to false to use in memory.

Trivial Detection

The trivial task can be used to identify content that is trivial in nature (as it holds no importance from a corporate knowledge perspective). The definition of trivial depends on your organisation. You can set whether you want to filter on documents of certain sizes or if you want to filter out files with certain extensions or document types. For example, dmg, and exe are installers and may hold no corporate importance so you may want to skip such content from being registered as a record, you can now do so by adding them to the filtered list of extensions.

Configuration

You can add several filters to include documents that meet all the stated criteria.

  • Filter on document’ below a specific size
  • Filter files below size (bytes)
  • Filter on document’s above a specified size
  • Filter files above size (bytes)
  • Filter on file extension
  • Filter on document type

Once the filters are selected you can then determine what action should be taken with the files that meet the selected criteria.

  • Audit and Continue
  • Skip the files
  • Fail the job

Metadata

The trivial detection task also includes the following default set of metadata:

  • isTrivial - Yes/No
  • ignoreSizeBelow - content ignored below size in bytes
  • ignoreSizeAbove - content ignored above size in bytes
  • ignoreExtensions - comma separated list of extensions that were listed in the criteria
  • ignoreDocTypes - comma separated list of doc types that were listed in the criteria