Metadata Extractors and Embedders - Alfresco Content Services - 23.4 - 23.4 - Ready - Alfresco - external

Alfresco Content Services

Platform
Alfresco
Product
Alfresco Content Services
Release
23.4
License

Content Services performs metadata extraction on content automatically, however, you may wish to create custom metadata extractors to handle custom file properties and custom content models.

Architecture Information: Software Architecture

Every time a file is uploaded to the repository the file’s MIME type is automatically detected. Based on the MIME type a related Metadata Extractor is invoked on the file. It will extract common properties from the file, such as author, and set the corresponding content model property accordingly. Each Metadata Extractor has a mapping between the properties it can extract and the content model properties.

Metadata extraction is primarily based on the Apache Tika library. This means that whatever file formats Tika can extract metadata from, Content Services can also handle. To give you an idea of what file formats Content Services can extract metadata from, here is a list of the most common formats:

  • PDF
  • MS Office
  • Open Office
  • MP3, MP4, QuickTime
  • JPEG, TIFF, PNG
  • DWG
  • HTML
  • XML
  • Email

The properties that are extracted are limited to the out-of-the-box content model, which is very generic. Here are some example of extracted property name and what content model property it maps to:

  • author -> cm:author
  • title -> cm:title
  • subject -> cm:description
  • created -> cm:created
  • description -> NOT MAPPED - you could map it in a custom configuration
  • comments -> NOT MAPPED - you could map it in a custom configuration
  • If it is an image file:
  • EXIF metadata -> exif:exif (pixel dimensions, manufacturer, model, software, date-time etc.)
  • Geo metadata -> cm:geographic (longitude & latitude)
  • If it is an audio file -> audio:audio (album, artist, composer, engineer, genre etc.)
  • If it is an email file -> cm:emailed (from, to, subject, sent date)

One thing to note though, even if an extractor can extract any of the system controlled properties, such as created date, it will not be used. Created date, creator, modified date, and modifier is always controlled by the Content Services system, unless you are using the Bulk Import tool, in which case last modified date can be preserved.