Troubleshooting Cloud Pak for Data using Diagnostics Data Viewer

Sanjit Chakraborty
5 min readApr 1, 2022

Diagnostics data is the information that manually or automatically captured by software and devices for the purposes of troubleshooting problems. It plays a key role when you’re trying to debug a problem and need to know exactly what happened at the time of failure. Whether you’er troubleshooting a problem or raising a product issue, it’s always helpful if you have the diagnostics information.

In this blog I will go over the diagnostics data collection process in Cloud Pak for Data (CPD) and explore the captured data in regard to problem determination.

How to gather diagnostics data?

There are two ways to capture diagnostics data from CPD.

  1. CPD Web Console
    a. Log in to the web console as an administrator (or as any user with Manage platform health permission).
    b. From the navigation menu, select Support > Diagnostics.
    c. Click New diagnostics job.
    d. Specify the time frame for which you want to gather diagnostic information by clicking the Gather diagnostics from the past menu.
    e. Select services for which you want to gather diagnostic information.
    f. Click Start.
    g. When the job finishes, select it, click on Vertical Ellipsis icon, and select Download ZIP to download the compressed file of the diagnostic information onto your computer.
  2. Command Line Interface — cpd-cli diag
    Alternatively, you can use the cpd-cli diag utility to acquire the diagnostics through the command line.
    a. Before you run any cpd-cli diag commands, ensure you download the cpd-cli v10.0.x command line utility for your operating system.
    b. Complete the steps in Creating a profile to use the cpd-cli management commands. Every cpd-cli diag command requires you to specify your own —-profilename.
    c. Use list, delete, download diagnostics options with the command line tool. For example:

What kind of data get captured?

There are different kind of information get captured part of the diagnostics collection; 1) OpenShift platform information, 2) Container logs, and 3) Custom service diagnostics. Inside the diagnostic captured (ZIP) file, data organized into several different folders.

Diagnostics collection captured information about OpenShift resource used by the CPD projects. The healthcheck folder contains different aspects of the system’s functionality, from pods to replica sets, metrics to cronjobs, etc. Inside the same folder, there is a portal file called summary.html that allows you to easily navigate through the health-related data.

There are may be many services deployed on a CPD cluster and each service consists of several pods. By default, logs from all pods for the service get collected. In case of investigating pod restart, you can find the previous pod log under the CPD container logs folder.

The monitoring and alerting framework captured health check data based on individual service installed on the cluster. These types of diagnostics data are very specific to the containers those run as part of the service pods and may also include health check information for those containers. Since every application service is different, the needs for the logs captured vary from service to service. Currently Db2, Data Virtualization, Watson Knowledge Catalog and Streams support custom log collection.

Where captured data stored?

Whenever you create a diagnostic job; it captured and stored data in ibm-nginx pod under /user-home/serviceability/collectedLogs directory. It’s not common but it could happen once in a while that you can’t download diagnostics data from the CPD Web console. In those situations, you can download diagnostics data from the ibm-nginx pod. There are multiple replicas of ibm-nginx pod, but they are identical. You can use any one of those pods, to retrieve diagnostics data.

# oc get pods | grep ibm-nginxibm-nginx-56d-88tkw     1/1     Running     0          31d
ibm-nginx-56d-rgnb4 1/1 Running 0 14d
# oc rsh ibm-nginx-56d-88tkw ls -l /user-home/serviceability/collectedLogstotal 278516
-rw-r--r--. 1 1000650000 root 285197822 Mar 7 16:04 hksxryb.zip

How to view/analyze captured data?

The summary.html inside the healthcheck folder is the key file to navigate through the OpenShift resource used by the CPD project. It gives a simple but very useful overlook of the CPD environment from a Web browser. This file is organized into different tabs for easy of browse OpenShift cluster information with highlights; to help you to find the area that need attention. All tabs inside the summary.html files provides a search facility and scalable to narrow down the output.

You can easily find out all services installed on the cluster along with their product version from the CLUSTER INFORMATION tab.

The PODS tab provides a comprehensive list of pods on the CPD environment. It displays all relevant metrics and metadata for each pod. In case of any warning or error message in the pod that will highlighted with yellow or red color to help identify the problem easily. It provides a convenient access to current pod logs and pod description from single place. The PODS tab can help with several troubleshooting aspects:
— Are there any pods in down state?
— Which pods associated with which service?
— Is there a high number of any pod restart?
— Are all containers up and running?
— Is there any Service Instance associated to the pod?
— What are the memory, CPU resource utilization for each pod?

The PERSISTENT VOLUME CLAIMStab list all PVCs associated to the CPD environment. Status column helps to identify any problematic PVC.

The SERVICEStab lists all Kubernetes service associated to pods. Information such as namespace and cluster IP are displayed along with selector key mapping, which is useful for a variety of purposes.

The JOBS tab lists all jobs configured, including metrics for all jobs, whether they were successful, failed, or active at the time of data collection.

The EVENTStab provided a comprehensive list of events, which is very useful to identify a potential problem. It includes all relevant event info, such as number of occurrences, objects involved, time of occurrence and message details. The color-coded entries depending on warning or error, helps easily identify the problem.

The DEPLOYMENTS tab lists all active deployments. You can use the active, unavailable and ready replica counts to determine a problem.

Similar to deployment, the REPLICA SETStab lists all replicas used on CPD environment. The available and ready replica counts can be use for determine any problem. You can use this information to determine the replica set is associated to the deployment.

At this stage, CPD diagnostics data captured most of platform specific information and the summary.html viewer provides some easy way to identify problem with the platform. Overall, the summary.html viewer is interesting from the analysis and troubleshooting point of view. It needs lot more diagnostics capabilities regarding individual application service hosted on the CPD platform. There will be more diagnostics options introduce in upcoming versions.

--

--

Sanjit Chakraborty

Sanjit enjoys building solutions that incorporate business intelligence, predictive and optimization components to solve complex real-world problems.