Changing Default Property Mappings for XML Metadata Extraction - Alfresco Content Services - 23.4 - 23.4 - Ready - Alfresco - external

Alfresco Content Services

Platform
Alfresco
Product
Alfresco Content Services
Release
23.4
License

Now, what if you would like to extract metadata from an XML file, how would you go about that? This can be achieved with the XmlMetadataExtracter, which in-turn uses the XPathMetadataExtracter to navigate the XML and extract metadata. These extractors are still in the repository, see Metadata Extractors Moved to T-Engines.

Let’s say we had XML files looking like this:

<?xml version="1.0" encoding="UTF-8"?>
<doc id="doc001">
    <project>
        <number>PX001</number>
    </project>
    <securityClassification>Company Confidential</securityClassification>
    <text>
        Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent tincidunt luctus ante, in pulvinar ante rutrum
        quis. Etiam maximus arcu ut metus sollicitudin laoreet. Pellentesque ac purus nec massa euismod iaculis a sed
        sapien. Integer id nisi eu tellus commodo congue. In bibendum dapibus porttitor. Aenean lobortis sodales risus
         ....
    </text>
</doc>

And whenever we upload one we want to have the /doc/@id attribute set as acme:documentId, /doc/project/number set as acme:projectNumber, and /doc/securityClassification set as acme:securityClassification. This will require configuration like this, note these are new bean definitions, no overrides:

<beanid="org.alfresco.tutorial.metadataextracter.xml.AcmeDocXPathMetadataExtracter"class="org.alfresco.repo.content.metadata.xml.XPathMetadataExtracter"parent="baseMetadataExtracter"init-method="init">
  <property name="mappingProperties">
      <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean">
          <property name="location">
              <value>
                  classpath:alfresco/module/${project.artifactId}/metadataextraction/acme-content-model-mappings.properties
              </value>
          </property>
      </bean>
  </property>
  <property name="xpathMappingProperties">
      <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean">
          <property name="location">
              <value>
                  classpath:alfresco/module/${project.artifactId}/metadataextraction/acme-xml-doc-xpath-mappings.properties
              </value>
          </property>
      </bean>
  </property>
</bean>

<bean id="org.alfresco.tutorial.metadataextracter.xml.selector.AcmeDocXPathSelector"
    class="org.alfresco.repo.content.selector.XPathContentWorkerSelector"
    init-method="init">
  <property name="workers">
      <map>
          <entry key="/*">
              <ref bean="org.alfresco.tutorial.metadataextracter.xml.AcmeDocXPathMetadataExtracter"/>
          </entry>
      </map>
  </property>
</bean>

<bean id="org.alfresco.tutorial.metadataextracter.xml.AcmeDocXMLMetadataExtracter"
      class="org.alfresco.repo.content.metadata.xml.XmlMetadataExtracter"
      parent="baseMetadataExtracter">
  <property name="overwritePolicy">
      <value>EAGER</value> <!-- Put the extracted metadata into the content model property as long as it is not null -->
  </property>
  <property name="selectors">
      <list>
          <ref bean="org.alfresco.tutorial.metadataextracter.xml.selector.AcmeDocXPathSelector"/>
      </list>
  </property>
</bean>

The acme-content-model-mappings.properties file contains mappings from the extracted XML doc properties to the content model properties:

# Namespaces
namespace.prefix.acme=http://www.acme.org/model/content/1.0

# Mappings - metadata property -> content model property
documentId=acme:documentId
securityClassification=acme:securityClassification
projectNumber=acme:projectNumber

The property mapping can always be done in .properties files if we like, and we could have used a .properties file for the PDFBoxMetadataExtracter too. The other properties file called acme-xml-doc-xpath-mappings.properties contains the XPath expression configuration for where to find the metadata in the XML file:

# XPath Mappings - metadata property -> XML Document XPATH
documentId=/doc/@id
securityClassification=/doc/securityClassification
projectNumber=/doc/project/number