Best practices for managing Environments from running out of memory
In Cloud Pak for Data, an environment consists of the hardware and software configuration that you can use to run tools like Notebooks, Model builder, Data Refinery etc. Within the environments, you have dedicated resources and flexible options for the compute resources used by these tools. However, despite of all efforts in certain situations the Cloud Pak for Data cluster can starved for memory with overloaded environments. This post will talk about proactive monitoring of memory usage and some common practice to avoid cluster hang situation due to lack of memory.
Monitoring memory usage
The user interface of Cloud Pak for Data provides different options to monitor resources in the cluster. The list of active Environments can help to identify the environments that consuming high CPU and memory. A node in cluster can run out of resources both because of resource reservation and usage. To view all runtime environments, go to the ‘My Instances’ page from the menu icon. If any node reaches 100 percent usage for either memory or CPU, it can become unstable.
Admin Dashboard gives a high-level overview of the status of your cluster. You can monitor the health of cluster from the admin dashboard page. Select ‘Administer > Admin dashboard’ from the menu to monitor memory along with other resource usages. It shows the average memory usage across all of the nodes in the cluster. You can expand the cards to see the specific usage for each node. If you notice memory starvation on any node take necessary action to free up memory.
Inbuilt Alerts within Cloud Pak for Data can automatically notify you when a node or pod goes down or when a node at risk of overloading a resource. Along with many other alters there is an alert for memory usage on a node goes over 90%. You access the alerts page, by selecting ‘Administer > Monitor platform > Alert’ from the menu. You need take necessary action, when you see an alert that a node is running out of CPU or memory.
If needed you can adjust the alert threshold from the ‘Administer > Configure platform’ menu.
The Support Tool (icp4d_tools.sh) is another utility that checks the health of your cluster. It provides information about the health of pods, nodes, Docker, resources and other services that are required by Cloud Pak for Data. This utility let you know any user deployments running longer unusually, which can hold memory and other resource that could cause possible memory starvation. It will help you to identify any node with insufficient memory.
The support tool is included with your installation. However, updates are available on GitHub (https://github.com/IBM-ICP4D/icp4d-serviceability-cli.git) between releases of Cloud Pak for Data.
Following is an excerpt of the health check that displays user deployments running longer and memory status for nodes.
The Kubectl command line interface can be used to troubleshoot cluster memory starvation farther. The “kubectl top node” command allows you to see the resource consumption of each node in the cluster. You can easily identify any node with high memory usage. The “kubectl describe node” command provides details about resource usage on nodes including memory. It can help you to find out if the node having ‘MemoryPressure’. To drill down farther use the “kubectl top pods -–all-namespaces” command to figure out which pod is using most memory. In case of a memory starvation take necessary action to address the memory issue.
Existing Solutions
There are different approaches to address the memory starvation challenge. These solutions can be applied collectively.
Configure compute resource for runtime environment
On the Environments page under project, you can define the hardware size and software configuration for the runtime environments that you can associated with some analytic assets, such as notebooks.
You should reserve memory resources in advance from the Environments page in your project to avoid slow compute performance within your project runtime environments. However, you need to pay attention when reserving resources because any over allocation of memory reduces the availability of memory for other runtime environments. A runtime environment represents “N” number of Docker containers that consumes compute resources on the cluster. You can define settings for environments for specific images, such as RStudio, Jupyter, Zeppelin, SPSS Modeler, Watson Explorer etc. Use ‘Projects > Select your project > Environments > Choose the run time environment’ to adjust CPU and memory allocation to an environment. In this example ‘mortgage’ is the project name.
Stop runtime environments
You should stop all active environments when you no longer need them to free up resources. Project users with Admin role can stop environment in the project. Users added to the project with Editor role can also stop the environment that they started but can’t stop other user’s project environments. Form the menu use `Administer > Manage instances > Analytics environments > Stop now’ to stop runtime environments for release associated resources.
Cleanup environments
Currently, it’s a limitation in Cloud Pak for Data that there’s no way to automatically release active environments within projects that have been idle for a particular period of time. As a result, it may impact cluster performance after a period of time due to high memory usage. This will be address in future release.
As a workaround you can create a Kubernetes Cronjob to run jobs on a time-based schedule. These automated jobs run like cron tasks on a Linux or UNIX system. The remaining section of this document will help you to setup a Kubernetes Cronjob that runs every hour and delete following environments (pods) that idle longer than a day and free up resource:
- SPSS-Modeler-Server
- Zeppelin-Server
- DODs-processor-server
- WEX-server
- Shaper-server
- R-Studio-server
- Jupyter-server
- Jupyter-py36-server
- Jupyter-py35-server
To learn more about the running automated tasks with Kubernetes Cronjob, check the Kubernetes documentation: https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/
You can use following step to schedule environment cleanup job with Kubernetes Cronjob.
1. Login to Master#1 node on your cluster
ssh root@<Master#1>
2. Clone the repository on Master#1 node
git clone https://github.com/IBM-ICP4D/Clean-ENV.gitcd /root/Clean-ENV
3. Login to ICP to be able to access kubectl and docker. If you can execute kubectl and docker commands, you can skip this step.
cloudctl login -a https://mycluster.icp:8443 --skip-ssl-validation -u admin -p <password> -n default
4. Build and push docker image
bash build.sh
5. Adjust cronjob.yaml according to your requirement.
vi cronjob.yaml
schedule:
By default this job will run every hour, check the Kubernetes documentation if you want a different scheduling: https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#schedule
TARGET_DEPLOY:
The array of target Deployments prefix, which will only terminate the environment with these prefix. default value: all the env
- name: TARGET_DEPLOYvalue: "spss-modeler-server, zeppelin-server, dods-processor-server, wex-server, shaper-server, rstudio-server, jupyter-server, jupyter-py36-server, jupyter-py35-server"
KILL_AGE:
You will only terminate all the target environments lived longer than the age in second, default value: 86400 s = 1 day
- name: KILL_AGEvalue: "86400"
TARGET_NAMESPACE:
You will only terminate the environment in the specified namespace.
default value: zen
- name: TARGET_NAMESPACEvalue: "zen"
6. Start cronjob
kubectl apply -f cronjob.yaml
7. Stop cronjob
kubectl -n zen delete cronjob clean-env
As you can see some memory issues may crop up in your Cloud Pak for Data cluster but with proper monitoring and pro-active preventive measure you can avoid the unpleasant situation.
Credit
1. Thanks Frank Li and Owolabi Adekoya for helping with the Cronjob.