In a previous post, I laid out the process of setting up TensorFlow for CPU based model training. However, as this graph shows, the GPU is significantly faster at this process. That’s why I’m documenting my attempt to install the GPU based version of TensorFlow on an HPC.
Effectiveness of GPU vs CPU in Model Training
- Start dangerously modifying graphic card drivers straight away.
- Have your boss wisely recommend the merits of using Docker.
- Reluctantly begin researching Docker.
- Gradually realize the merits of using Docker.
Install the necessary prerequisites, Docker, and Nvidia-Docker which takes care of setting up the Nvidia host driver environment inside the Docker containers and a few other things.
Nvidia-Docker has the additional prerequisite of requiring an installation of an Nvidia driver and CUDA. CUDA is the API that allows the use of GPU for general computing.
Before installing new drivers, check to see whether they are already installed.
Nvidia Driver check:
$cat /proc/driver/nvidia/version OR nvidia-smi
$cat /usr/local/cuda/version.txt OR nvcc --version
If these drivers do not work, remove them completely before attempting fresh installs. I used:
sudo apt-get remove --purge cuda*
sudo apt-get remove --purge nvidia*
The process of installing these drivers can be found here. Complete steps 2.1-2.4 and then 3.6. For my OS, I followed the recommended specifications from TensorFlow’s own website found here. Rebooting after installing any Nvidia graphics driver is essential, don’t forget. Then test that the drivers are working using the commands mentioned above.
The next step is to pull a blank Ubuntu image using the command
docker pull ubuntu
To run the image:
sudo nvidia-docker run -it -p 6006:6006 -v /sharedfolder:/root/sharedfolder ubuntu:latest bash
To save the image after any modifications you make, use the docker commit command, note that images and containers can take up a lot of space. Use
du -hto check how much space is left on your drive. Once inside the docker container. I began to setup the necessary libraries, packages and software:
Replace the original models folder with the latest one from Tensorflow
Download the CUDA and CUDNN drivers for the ubuntu image as well and install them
Carry out training
Export/Freeze model into a usable state (Replace #### with the number of the latest model.ckpt file)
You can now test this model using tensorflow’s own object detection tutorial code. You’ll have to change the paths in boxes 5 and 9.