Preparing data for Natural Language Processing

Incoming data must be pre-processed to be usable by HCL Commerce Search's Natural Language Processing feature.

HCL Commerce Search uses the Stanford CoreNLP language parser to provide the Query service with multilingual support, full grammatical parsing, and extensibility. The enhancements provided by HCL Commerce Search specifically target the needs of online shoppers, giving greater responsiveness and intelligence to the search system.

The Matchmaker is also an important feature of the Natural Language Processor's AI. Data needs to be prepared for its consumption as well.

During Query processing, textual data is analyzed in the following ways. This process identifies features in the text that the NLP processor can work with at query time.
Tokenization
The process of breaking the text down into smaller units called tokens that can be worked with in various ways. For a complete discussion of the tokenization process, see Tokenization in the Stanford CoreNLP documentation.
Stop word removal
Common words are removed so that unique terms stand out to the processor. For more information, see Dropping common terms: stop words.
Lemmatization and stemming
Words are reduced to their basic form, eliminating contractions and other variations on basic nouns. See Stemming and lemmatization.
Part-of-speech tagging
Individual words and phrases are categorized by type: noun, verb, preposition, etc. See Parts of Speech.
Named entity recognition (NER)
Identifies people, companies, and products in the text. The Query service constructs a custom NER file, which is a tab-separated list of word and value, where value is the classification given to the word. For example, a search term "white shirt girls" will be broken into three tokens: white/color, shirt/category, and girls/category. "white shirt girls under $37" would add under 37/filter as the fourth token.
You can add your own terms to the custom NER file; for more information, see Adding custom nouns and classifications to NLP Name-Entity-Recognition (NER).
Preparing data for Matchmaker
The Ingest service will analyze incoming data for three features relevant to the Matchmaker.
  • Color Matchmaker. Color names encountered in the indexing data are defined as attribute values. These are indexed with predefined color family names. At query time, a similar analysis is performed against the search phrase to identify appropriate color families to be used for filtering. In this way, only products of the same color family will be returned. For more information about the color families and how they are administered, see Color Matchmaker.
  • Measurement Matchmaker. Whenever a unit of measure is detected in an attribute value during indexing time, its corresponding cardinal number will be automatically converted into all supported measurement units within the same measurement family. At query time, a similar analysis is performed against the search phrase to identify the requested unit of measure. This filters against the same indexed measurement unit, even when the shopper provides a different unit than the one specified with the product. For more information, see Adding custom configuration to Measurement Matchmaker.
  • Dimension Matchmaker. Similarly to the Measurement Matchmaker, the indexing parser will also try its best to guess an appropriate dimension provided in an attribute value. It will index that dimension into the appropriate length and dimension category. These dimensions can be used for more precise filtering at query time.For more information, see Adding custom configuration to Dimension Matchmaker.
The supported measurement systems are:
  • Length: centimeter, millimeter, nanometer, kilometer, meter, inch, feet, mile, yard
  • Weight: tonne, kilogram, gram, milligram, stone, pound, ounce
  • Time: nanosecond, microsecond, millisecond, second, minute, hour, day, week, month, year
  • Volume: gallon, liter, milliliter

The query service initializes the Stanford Core NLP by passing the custom NER file to the Core NLP object. When a query is made, the search term is passed to the SearchNLPSupportProvider method, which in turn passes it to the Stanford Core NLP object. SearchNLPSupportProvider then returns the result.