Enabling search on additional unstructured content types

You can enable searching on more unstructured content types so that custom attachments data can be processed by search and retrieved in store search results.

Important: HCL Commerce Search indexes decrypted unstructured data by default. That is, processing encrypted data with HCL Commerce Search is not supported.

Before you begin

Ensure that you complete the following tasks:
  • Your database contains customized content types.

Procedure

  1. Create a parser for the new file type.

    HCL Commerce supports extra parsers to enable searching on more file types.

    1. Prepare for the extension.

      Before you implement the logic for the new file type, the MIME types of the new parser must be selected.

      1. Open the tika-mimetypes.xml file. The file is in the tika-core-0.4.jar file, under org/apache/tika/mime.
      2. Enter the MIME type that you want to implement. For example, for media of type application/vnd.rn-realmedia:
        
        <mime-type type="application/vnd.rn-realmedia">
            <magic priority="50">
              <match value=".RMF" type="string" offset="0" />
            </magic>
            <glob pattern="*.rm"/>
          </mime-type>
      3. Find a reader that understands the file format so that it can be parsed successfully.
      4. If the parser must support more types, select more. These MIME types are required when you implement the logic.
    2. Implement the extension logic.
      1. Create a class that implements the org.apache.tika.parser.Parser interface. In com.ibm.commerce.tika.parser.video.VideoParser.getSupportedTypes(ParseContext), it must return the supported media type list.
        For example:
        
        private static final Set<MediaType> SUPPORTED_TYPES =
                Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList(
                        MediaType.application("vnd.rn-realmedia"))));
        
        	public Set<MediaType> getSupportedTypes(ParseContext context) {
        		return SUPPORTED_TYPES;
        	}
        The application media type is given the value vnd.rn-realmedia to match the previously selected MIME type.
      2. The com.ibm.commerce.tika.parser.video.VideoParser.parse(InputStream, ContentHandler, Metadata, ParseContext) must handle the content of the media that comes as the InputStream parameter. In addition, it must also handle the metadata container of the media that comes as the Metadata parameter.
        For example:
        
        metadata.set(Metadata.CONTENT_TYPE, "application/vnd.rn-realmedia");
        metadata.add(Metadata.PUBLISHER, "Publisher");
        metadata.add(Metadata.LANGUAGE, "RM_language");
        metadata.add(Metadata.COMPANY, "IBM Commerce");
        XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
        xhtml.startDocument();
        xhtml.endDocument();
        When the result is returned from this method, the metadata can have extra publisher, language, and company information. However, no content is extracted.
    3. Assemble the logic and enable HCL Commerce Search to recognize it.
      A service registry file helps insert the new parser to be known to the HCL Commerce Search framework.
      1. Create the following file:
        • META-INF/services/org.apache.tika.parser.Parser
      2. Insert the parser's full class name into the file. For example:
        
        com.ibm.commerce.tika.parser.video.VideoParser
        
      3. Export the code and the register file into a JAR file and save it in the same directory as the tika-parser-version.jar file.
  2. Confirm the results in HCL Commerce Search.

    HCL Commerce Search automatically finds the proper parser for the file content. For example, if a realmedia file is in the extracting request, HCL Commerce Search returns the parser result. The Solr Cell uses the result and composes a new document and sends it to the search server for create and update commands.

    For example, you can check the index content, where the result resembles the following snippet:
    
    content_type:=>application/vnd.rn-realmedia
    tika_company:=>IBM Commerce
    tika_publisher:=>Publisher
    tika_language:=>RM_language
    tika_stream_size:=>614135

What to do next

After enable searching on more unstructured content types by creating a new parser, you can search the storefront to confirm that the search results contain your custom unstructured content types.