Troubleshooting agent connections

You can answer a number of questions to help isolate the issue behind problems with agents.

By default, the logs for the agent are in the folder agent_install/var/log.

Starting with V7.0.3, you can download and display agent logs in the user interface. This feature is only available for web agents V7.0.3 and later. The Request Agent Logs option is displayed on the selected agent page. After clicking Request Agent Logs, the agent.out (default file name) appears under the Available Logs section on the same page.

If the agent isn't displayed in the user interface, or if the agent is reported as offline, complete these actions:
  • Verify that the agent started successfully:
    • Look for error messages in the log. If the agent started successfully, the agent.out file has a message that says something like this:
      2017-07-18 08:37:47,575 INFO  AgentWorkerThread com.urbancode.air.agent.AgentWorker - Agent started
    • Check the prerequisites for the agent. See Installing agents.
    • If the agent prompt returns right away or the window flashes and closes after you try to start it, check the Java version and Java home. This situation typically happens when Java cannot be found.
    • Verify that the system that hosts the agent has enough free disk space, and that the file system and permissions are properly configured.
    • If the agent is configured as a service on Linux, start the agent from the command line. If you cannot start the agent from the command line, the problem is likely related to the way that the service was configured. See Running agents as services on Linux.
    • Verify that the agent installation properties are correct. See Agent installation properties.
    • If the agent is configured as a service on Windows, complete these actions:
      1. Remove the service and add it again. See Removing and reinstalling agents as Windows services.
      2. Remove the service and try to start the agent from the command line. If it starts, check permissions for the user who is running the service. See step 3 of Installing agents from the command line.
  • If the agent started successfully but the server shows the agent as offline or does not show the agent at all, check these things:
    • If the agent connects through an agent relay, verify that the relay is running.
    • Verify that the system that hosts the server has enough free disk space, and that the file system and permissions are properly configured.
    • Search the user interface for the agent endpoint ID from the file. If an agent with the same name is already in the system, HCL Launch adds the agent to the user interface as agent-<agent id>. It is possible that the agent is there, but not by a name that you might be looking for. This can happen if you clone a virtual machine or create a cloud environment without changing the agent name.
    • If the agent is using failover, make sure that failover is configured correctly.
    • Check for licensing issues.
    • Check the network connection between the agent and the server or between the agent and agent relay:
      • Verify that the Web relays are connected to Web agents.
      • Verify that a firewall is not preventing the agent from connecting to the server or agent relay. Agents always initiate connections to servers and agent relays, not the other way around.
      • Check to see that the agent system can ping the HCL Launch server or agent relay.
      • If agents are not connecting correctly on high-availability environments, make sure that the servers have cluster connections to each other.
      • Firewalls might prevent the agent from communicating with the server, particularly if the agent is on a public cloud and trying to connect to an HCL Launch server that is within a firewall. In this case, use an agent relay.
      • If the agent is on a cloud environment and the target cloud environment can connect to the HCL Launch server, log in to the cloud image and check the cloud-init log files in this folder:/var/lib/cloud/instance/.
      • If an agent on a cloud environment times out with a message like Agent did not come online after 6.0 minutes! Process not executed., check to see that the agent package was downloaded correctly. If it did, try lengthening the timeout by increasing the value of the agent_timeout parameter in the blueprint or configuration file.
      • Check for DNS issues that might prevent the agent from resolving the host name of the server or agent relay.
    • Can the built-in admin ID see the agent online? If so, permissions might be incorrectly configured on the agent. Remember that the built-in admin ID bypasses the team's checking.

If you do not know the location of the agent that cannot connect to the server, go to the agent properties, which include the last known host name and IP address of the agent.

If the agent is reported as connecting, complete these actions:
  1. When this situation occurs, the problem is an HTTP communication issue. To resolve this issue:
    1. Check the external agent URL in settings. In the product, click Settings, and then click System Settings. Make sure that the host name or IP address matches or resolves to the entry in the agents file.
    2. Use Telnet to connect to the HTTP port on the Server.

If the agent is reported as online but you continue to have problems, complete these actions:

  1. Verify that the agent and agent relay are running the recommended and minimum versions as indicated in the user interface.
  2. Check the server, agent relay, and agent logs for errors. Look for matching time stamps among the logs.
  3. Enable more detailed logging by inserting these lines in the agent_install/conf/agent/log4j file:
    Check the output for errors.

If the agent fails to connect to the server to an agent, check if the agent.log shows:

  • A clock skew exception - Set the system time on the server and the computers that agents are running on to times that are the same or within a few minutes of each other. The server and agent clock times must be close together if using end-to-end encryption. They do not need to be in the same time zone, but must agree about the global time within approximately 5 minutes.
  • The SSL handshake error - When a Web agent fails to connect to the server and returns an error com.urbancode.ds.subsys.deploy.agent.comm.netty.server.ServerPublicKeyPinHandler - SSL handshake failed. Configure the load balancer to use the SSL connection instead of the load balancer. In this configuration, you use TCP mode over HTTP mode in both, the front-end and back-end configurations. For instance, if the LB is HAProxy, it should treat the connection as a stream of information to proxy to a server, rather than use its functions available for HTTP requests. The following example is a front-end configuration on HAProxy:
    frontend localhost
        bind *:80bind *:443
        option tcplog
        mode tcp
        default_backend nodes