Introduced in Feature Pack 2

Enabling search on additional unstructured content types

You can enable searching on additional unstructured content types so that custom attachments data can be processed by search and retrieved in store search results.

Important: WebSphere Commerce search indexes unencrypted unstructured data by default. That is, processing encrypted data with WebSphere Commerce search is not supported.

Before you begin

Ensure that you have completed the following tasks:

Procedure

  1. Create a new parser for the new file type.

    WebSphere Commerce supports using additional parsers to enable searching on additional file types.

    1. Prepare for the extension.

      Before implementing the logic for the new file type, the MIME types of the new parser must be selected.

      1. Open the tika-mimetypes.xml file. The file is located in the tika-core-0.4.jar file, under org/apache/tika/mime.
      2. Select the MIME type that you want to implement. For example, for media of type application/vnd.rn-realmedia:
        
        <mime-type type="application/vnd.rn-realmedia">
            <magic priority="50">
              <match value=".RMF" type="string" offset="0" />
            </magic>
            <glob pattern="*.rm"/>
          </mime-type>
      3. Find a reader that understands the file format so that it can be parsed successfully.
      4. If the parser must support additional types, select more. These MIME types are required when implementing the logic.
    2. Implement the extension logic.
      1. Create a class that implements the org.apache.tika.parser.Parser interface. In com.ibm.commerce.tika.parser.video.VideoParser.getSupportedTypes(ParseContext), it must return the supported media type list.
        For example:
        
        private static final Set<MediaType> SUPPORTED_TYPES =
                Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList(
                        MediaType.application("vnd.rn-realmedia"))));
        
        	public Set<MediaType> getSupportedTypes(ParseContext context) {
        		return SUPPORTED_TYPES;
        	}
        The application media type is given the value vnd.rn-realmedia to match the previously-selected MIME type.
      2. The com.ibm.commerce.tika.parser.video.VideoParser.parse(InputStream, ContentHandler, Metadata, ParseContext) must handle the content of the media that comes as the InputStream parameter. In addition, it must also handle the metadata container of the media that comes as the Metadata parameter.
        For example:
        
        metadata.set(Metadata.CONTENT_TYPE, "application/vnd.rn-realmedia");
        metadata.add(Metadata.PUBLISHER, "Publisher");
        metadata.add(Metadata.LANGUAGE, "RM_language");
        metadata.add(Metadata.COMPANY, "IBM Commerce");
        XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
        xhtml.startDocument();
        xhtml.endDocument();
        When the result is returned from this method, the metadata can have extra publisher, language, and company information. However, no content is extracted.
    3. Assemble the logic and enable WebSphere Commerce search to recognize it.
      A service registry file helps insert the new parser to be known to the WebSphere Commerce search framework.
      1. Create the following file:
        • META-INF/services/org.apache.tika.parser.Parser
      2. Insert the parser's full class name into the file. For example:
        
        com.ibm.commerce.tika.parser.video.VideoParser
        
      3. Export the code and the register file into a JAR file and save it in the same directory as the tika-parser-version.jar file.
  2. Confirm the results in WebSphere Commerce search.

    WebSphere Commerce search automatically finds the proper parser for the file content. For example, if a realmedia file is in the extracting request, WebSphere Commerce search returns the parser result, and the Solr Cell uses the result and composes a new document and sends it to the search server for create and update commands.

    For example, you can check the index content, where the result should resemble the following snippet:
    
    content_type:=>application/vnd.rn-realmedia
    tika_company:=>IBM Commerce
    tika_publisher:=>Publisher
    tika_language:=>RM_language
    tika_stream_size:=>614135

What to do next

After enable searching on additional unstructured content types by creating a new parser, you can search the storefront to confirm that the search results contain your custom unstructured content types.