Crawling WebSphere Commerce site content

You can use the site content crawler utility to crawl WebSphere Commerce site content in starter stores.

Before you begin

  • WebSphere Commerce DeveloperEnsure that the test server is started.
  • Ensure that your administrative server is started. For example:
    • If WebSphere Commerce is managed by WebSphere Application Server Deployment Manager (dmgr), start the deployment manager and all node agents. Your cluster can also be started.
    • If WebSphere Commerce is not managed by WebSphere Application Server Deployment Manager (dmgr), start the WebSphere Application Server server1.
  • Ensure that you complete the following task:
  • Important: Ensure that you configure the site content crawler configuration files for your site:
    • droidConfig.xml
    • filters.txt
    For more information, see Site content crawler configuration.
    Apache DerbyWebSphere Commerce DeveloperNote: To index site content, you must either set auto index to true in the droidConfig.xml file, or pass the -basePath -storeId -localename parameters to the di-buildindex utility.
Note: For crawling site content in a clustered environment:
  • The crawler must be run from a staging environment. That is, crawling should not be performed in a production environment. If the production content must be crawled, the crawler must be configured to hit the production site instead of running directly from production environment. This method simplifies the setup by restricting the crawler to run in a WebSphere Commerce staging environment and updating the index in the repeater.
  • When managing index configurations in a clustered environment, assuming a deployment manager is used to manage the Solr EAR in a cluster, each Solr node is considered as a search index subordinate that replicates against the repeater. Each subordinate Solr node has its own local configuration and search index directories and configuration files, with the index synchronized across the entire cluster through Solr replication. That is, the deployment manager manages the Solr EAR, while the local index copy is managed by the repeater through Solr replication.

Procedure

  1. Complete one of the following tasks:
    • LinuxAIXLog on as a WebSphere Commerce non-root user.
    • WindowsLog on with a user ID that is a member of the Windows Administration group.
    • For IBM i OS operating systemLog on with a user profile that has *SECOFR authority.
  2. Go to the following directory:
    • LinuxAIXFor IBM i OS operating systemWC_installdir/bin
    • WebSphere Commerce DeveloperWCDE_installdir\bin
  3. Run the crawler utility:
    • Windows crawler.bat -cfg cfg -instance instance_name [-dbtype dbtype] [-dbname dbname] [-dbhost dbhost] [-dbport dbport] [-dbuser db_user] [-dbuserpwd db_password] [-searchuser searchuser] [-searchuserpwd searchuserpwd]
    • LinuxAIXFor IBM i OS operating systemcrawler.sh -cfg cfg -instance instance_name [-dbtype dbtype] [-dbname dbname] [-dbhost dbhost] [-dbport dbport] [-dbuser db_user] [-dbuserpwd db_password] [-searchuser searchuser] [-searchuserpwd searchuserpwd]
    • WebSphere Commerce DeveloperDB2Oracle crawler.bat -cfg cfg -instance instance_name [-dbtype dbtype] [-dbname dbname] [-dbhost dbhost] [-dbport dbport] [-dbuser db_user] [-dbuserpwd db_password] [-searchuser searchuser] [-searchuserpwd searchuserpwd]
    • Apache DerbyWebSphere Commerce Developercrawler.bat -cfg cfg [-searchuser searchuser] [-searchuserpwd searchuserpwd]
    Where:
    cfg
    The location of the site content crawler configuration file. For example, solrhome/droidConfig.xml
    instance
    The name of the WebSphere Commerce instance with which you are working (for example, demo).
    dbtype
    Optional: The database type. For example, cloudscape, db2, or oracle.
    dbname
    Optional: The database name to be connected.
    dbhost
    Optional: The database host to be connected.
    dbport
    Optional: The database port to be connected.
    dbuser

    DB2Optional: The name of the user that is connecting to the database.

    OracleOptional: The user ID that is connecting to the database.

    dbuserpwd
    Optional: The password for the user that is connecting to the database.
    If the dbuser and dbuserpwd values are not specified, the crawler can run successfully, but cannot update the database.
    searchuser
    Optional: The user name for the search server.
    searchuserpwd
    Optional: The password for the search server user.
    Note: If specifying any optional database information, such as dbuser, the surrounding database information must also be specified, such as dbuserpwd.
  4. You can run the utility by using a URL on the WebSphere Commerce Search server.
    
    http://solrHost:port/solr/crawler?action=actionValue&cfg=pathOfdroidConfig&
    
    Where action is the action that the crawler should perform. The possible values are:
    start
    Starts the crawler.
    status
    Shows the crawler status.
    stop
    Stops the crawler.
  5. Ensure that the utility runs successfully.
    Running the utility with all the parameters involves the following factors:
    • Crawling and downloading the crawled pages in HTML format into the destination directory.
    • Updating the database with the created manifest.txt file.
    • Invoking the indexer.
    Each of the these tasks status messages is reported separately.
    Depending on the passed parameters, you can check that the utility runs successfully by:
    1. Verifying that the crawled pages are downloaded into the destination directory.
    2. If passing the database information: Verifying that the database has been updated with the correct manifest.txt location.
    3. If setting auto index to true: Verifying that the crawled pages are also indexed.

What to do next

After you crawl WebSphere Commerce site content, you can verify the changes in the storefront.