Perform cognitive search on static webpages

Follow the instructions in this tutorial to perform cognitive search on static webpages.

In this tutorial, you'll learn how to:

  • Web scrap a static webpage and mine the content and PDF files
  • Test cognitive search

Prerequisites

Before you start upload and train the documents for cognitive search, make sure have the following:

  • Access to Cognitive search feature
  • Static webpages with content and PDF or Word documents

Web scarp static web page

To begin with, create a new project and navigate to the Settings page, and perform the following:

  1. In the Knowledge Mining page, enter the static webpage and include keywords, which are to be included and excluded for mining the urls. For example, if you are looking out for Google's cloud product, you can enter "https://cloud.google.com/products" urls, and set the parameters to level "1", and enter "Natural Language AI" keyword in include keywords.

  2. Leave the default setting of Crawl documents to Yes, to include the pdf files available in the static pages.
  3. In case, if you want to crawl second level pages from static webpage, select appropriate Levels option.
    Note: Make sure the number of urls in the second level does not exceed above 200 urls.
  4. Click the Start Crawling button to start web url mining. As a result, the system will mine and list all the urls containing the keywords from first level.
  5. You can either select required urls or simply click the Download data button to scrap the web content and pdf files from the mined url. For our example, the system downloaded approximately six urls without any PDF files for the static base url as shown below.