Document Duplication Detection and Reporting - Alfresco Federation Services - 3.2 - 3.2 - Ready - Alfresco - external - Alfresco/Alfresco-Federation-Services/3.2/Alfresco-Federation-Services/Using/Integrations/Advanced-Configurations/Document-Duplication-Detection-and-Reporting - 2025-03-04

Alfresco Federation Services

Platform
Alfresco
Product
Alfresco Federation Services
Release
3.2
License

There may be a need to identify duplicate documents in your enterprise and Federation Services allows you to identify these duplicate documents in a variety of ways.

One way to identify duplicate documents is by using the Duplication Check Job Task which allows you to log, skip or fail documents that are duplicates. This works well for large scale integrations when combining a number of legacy source systems into one new enterprise content management system.

Another way to identify duplicate documents would be by leveraging Federation Services’s Reporting Output Connector. Using the reporting output connector you can read content in from any source system Federation Services supports and report on the content that is found. One of these reports is a hash of each document seen. Using this hash plus MongoDB’s aggregation framework we can generate a CSV or JSON reports of all duplicate records. You can obtain the hash of a document by including the Hash Generator Job Task in your Job Tasks.

After crawling your source system and outputting to the Federation Services Reporting Connector you can now run the following commands against MongoDB. Start my typing mongo in your terminal:

Depending on how many documents you found during the crawl > 100,000, you may need to add docHash index.

db.tsRecordProcessed.createIndex( { docHash: 1 } )

Next we group by docHash and output to a new collection named duplicates (you can name the new output collection anything you like):

db.tsRecordProcessed.aggregate([{$group:{_id:"$docHash",docs:{$push:"$doc_id"},doc_names:{$push:"$doc_name"}}},{$project:{docs:1,doc_names:1,numDocs:{$size:"$docs"}}},{$match:{numDocs:{$gt:1}}},{$out:"duplicates"}])

You can now export de-dupes collection to CSV or JSON using mongoexport:

mongoexport --db simflofy --collection duplicates --fields _id,docs,doc_names
--username user --password "pass" --type=csv --out duplicates.csv