Best practices for content scanning

Determine the technologies in use on your application that the content job will scan and refer to the following best practices for each type.

Note: Over time the configuration gap between creating a scan using a Content Scan job (old style) and a Job using the template is wide. Hence, creating scans with the Job using the template option is recommended.

Use a test user account

Use a test account that can be tracked to ensure that services are not really ordered and so that it can be reset in case it becomes corrupted. A test account also makes it easier for the Administrators to clean the site after the test. Consider the following factors for the test accounts:

It should only have access to test records in the database, so that modified records can be restored.
Delete new records created by the test account.
Ignore purchase orders or other transactions from the test account.
If the site has forums, the test account should only access test forums, so real customers will not be able to see the tests during the test phase. For example, seeing a cross-site scripting pop-up window can be frightening.
Use more than one test account if the site uses different privileges for different accounts. Using multiple test accounts will ensure a more comprehensive test of the application.

Starting URLs

Determine if you want to scan above the directory included in your starting URL. For example, you are given a starting URL of www.example.com/customers/default.aspx and you've selected the In starting domains only scan links in and below the directory of each starting URL check box. In this situation, the job cannot scan www.example.com/partners because it is not inside the customers directory. To scan inside the partners directory, you must clear the check box.

The In starting domains only scan links in and below the directory of each starting URL check box is selected by default.

The Security Test Policy and its Server Group selected on the job's Security page must correspond to the starting URLs. If the URLs or IP addresses that your server group permits you to scan are not in the starting URLs, then they will not be tested for security issues.

It is common practice to limit each scan job to one site. This best practice provides better executive level reporting in the dashboards and usually better reflects the different lines of business or areas of responsibility within an organization. Use the reporting mechanism to aggregate data from different jobs any way you want.

Additional domains

Determine if there are domains outside your starting URLs having content that you also want to have checked. If you find on a first scan that there is not enough content being scanned, check the Website Architecture Report to see if there are additional domains that you can add as internal to the scan. Use the What to Scan page to add additional domains.

Exclude cookies and parameters

Specifying excluded cookies and parameters is an effective way to count only one instance of a page that might change whenever a specific query string value changes (POST data or cookie). For example, the URL to a page might have a parameter called "navmenuhide=". The value of the parameter determines whether to hide or display the navigation menu on the page. This parameter might take the values 0 or 1. If you would like to scan only one version of this page, insert "navmenuhide=" as a Parameter and Cookie Exclusion on the Parameters and Cookies page of the job. This method can be useful for narrowing the scope of scans to exclude duplicate content where pages exhibit only superficial differences.

Static URLs

If the URLs on your site do not change, the scan needs to be aware of this fact so that it can use an alternate means of distinguishing them. Whether the site uses parameters (query string or POST data) or cookies to distinguish different pages, you can identify them as part of the normalization rules for a domain. To configure normalization on a domain, and have all your jobs recognize pages from that domain in the same manner, go to the What to Scan page of the job and click a domain to edit its properties.

JavaScript™ and Flash

When conducting preliminary scans of a site, select the Parse JavaScript™ to discover URLs check box. After you understand the options, determine to what extent JavaScript™ is being used on the site and the complexity of that JavaScript™.

If there are any links in the JavaScript™ and Flash files on the site, ensure that the scan can find them. If it cannot, add them to the Starting URLs, use XRules to mimic Flash logic. XRules are added to the Advanced Scan Options > XRules page of the job.

Custom error pages

Some websites use custom 404 pages to redirect the browser when it hits an internal broken link. Although it depends on how these types of pages have been set up, they might result in a 200-type response from the server, which the scan job interprets as unbroken, or OK. To alleviate this false positive, identify these custom error pages so the scan job will recognize and report them as a broken link when they are encountered.

Use the General Scan Options > Custom Error Pages page to tell the scan job which pages to consider broken. Custom error pages can be set up from a job or from the Administration tab.

You can quickly determine if the site uses custom 404 pages by entering an incorrect URL in your browser for the site and see if it produces a page with a unique error title or redirects to a unique URL. If it does, add the resulting page title or URL to the Custom Error Pages page.

Exclusions

Identify any links that can be excluded from the scan, or at least from the preliminary scan, such as links:

that change the password
that disable the account
that delete items, especially if it is irreversible
that offer the same content in a printer friendly format
to nonexistent spacer images, such as blank.gif or spacer.gif, that serve as HTML placeholders. Typically these images do not exist on the web server and they can be excluded to remove clutter from your reports.

Shopping cart functions

Always exclude URL patterns that result from "Add to Cart" type applications. Scanning might unduly strain these applications when multiple threads hit them every second. Perform a manual explore of one "Add to Cart" item to ensure that it's tested.

Use the Exclude Paths and Files page to exclude sections of your site from the scan. Exclude regexp:.*addtocart.*.

Calendars

Calendars can put the scan into an infinite loop by scanning every day of every year in the calendar. Use Session ID patterns and exclusions to minimize this occurrence.

Media files

Large media files such as .wmv and .mov can usually be excluded from the scan. If they are not excluded, they can dramatically increase the time it takes to scan your site.

Use the Exclude Paths and Files page to exclude sections of your site from the scan.

Sorts on rows or columns in a table

When the content on a page can be sorted, as with a table having sortable columns, consider excluding the URLs of each sorted page. For example, a report has two different URLs, but the content is the same - the only thing that has changed is the sorting of the Vulnerability column. If you do not exclude the URLs from each resorted page, the scan job will scan the same content several times. Use the Exclude Paths and Files page to exclude sections of your site from the scan.

Logins and stepped applications

When attempting to scan past a login page, here are some things to consider:

Does the application require a one-time login? Use the Login Management page to enter a user name and password so the scan can log in for you.
Does the login page redirect to other domains? If so, add the other domains as internal to the job. To add domains as internal, go to the What to Scan page and click Add Domain.
Does the login or entry page to your site or application consist of a series of stepped forms? If so, start the content scan with a recorded login sequence. Use the Login Management page to record a login sequence.
Does the site use session IDs? If so, you might need to add the session IDs as part of the scan job properties. To set up session IDs, go to the Parameters and Cookies page. When configured as parameter and cookie exclusions, they are added to normalize URLs. When configured as Session IDs, they enable the scan to continue through the site without being interrupted by incorrect session values.
Does the site use cookies to login? Scanning past the login can only work if you set the cookie to automatically log in before you run the scan.
Is there a logout link after the login form? If so, exclude the logout link so that the content scan does not follow that link and log itself out. Most often a logout is in the form of a Logout button. Other common variations are Log Off, Sign Out, and Exit. Some cases have even been reported where links to certain web pages can force the user to be logged out. If it is not obvious where the logout button is, or what conditions would cause you to be logged out of the session, it is best to contact the website developer.

Forms

Some forms have attributes whose values change. If these attributes change between scans, then the scan job considers each change to reflect a unique form and reports each changed form, which means the report results are inflated with duplicate forms. To avoid this issue, apply normalization rules to the domain that is being scanned. URLs and forms found by the scan job can be normalized from the What to Scan page by editing the properties of the domain.

Automatically supply the scan with values for your common forms so the scan can continue without interruption. For example, if the scan job encounters a form several times on your website or application, it would need to supply contents to the form each time it is encountered. Use the Automatic Form Fill page to add form values, such as a country or region name, so you do not have to personally interact with each occurrence of the country or region name form.