Configuring your GPU Server for the AI Racing League

Dan McCreary
6 min readJan 15, 2021
One of our older desk-side server with two Nvidia GTX 1080Ti GPU cards. Each card has 11GiB RAM and 2,560 cores and currently originally retailed for around $700 each. These cards are now two generations old and the latest GTX 3080 cards are about 100x faster.

We are trying to make the process of setting up an AI Racing League event as turn-key as possible. Central to the event’s success is allowing students to be able to quickly train a deep learning model in under 10 minutes. Although the Raspberry Pi and Nvidia Nano are excellent for gathering driving images and running real-time driving inference, they are just too slow to train your model on 10,000 image files. There are two options:

  1. Get a high-speed internet connection at your events and give everyone cloud-based accounts for doing their training.
  2. Have an on-site GPU server that the students can use.

We have decided to go down route #2 for several reasons. The first reason is at these events we can’t guarantee we will have high-speed access to the internet. The second is that we may not have the funds to pay for cloud-based training servers. The third is that it would take extra time to set up and configure all the cloud-based accounts.

So how expensive is it to set up a GPU-based server that can train a model in around 10 minutes? We think that a ballpark price of approximately $2,000 is reasonable. If you have some used PCs sitting around and you are willing to bargain on Craigslist, you could do this for under $1,000. In our testing, we decided to use a GPU server that was not exactly state-of-the-art. We picked one that was about three years old and had availability in the used markets. If you have friends that are gamers, they might be willing to donate their old GPUs to a worthy cause.

Here are the steps we need to go through to test the server:

  1. Purchase or find a used PC chassis that can hold PCI cards
  2. Find a GPU that supports TensorFlow
  3. Install the UNIX OS and the GPU and verify the configuration works from the UNIX shell
  4. Enable the SSH software on the server so students can log in
  5. Install Python, TensorFlow, and the DonkeyCar software onto the server
  6. Copy and sample DonkeyCar image collection training set from your Pi or Nano to the server and run the training software
  7. Copy the generated model to your DonkeyCar and verify it works

I would suggest configuring a desktop PC with about 16GiB RAM and at least one PCI GPU card. After you plug in the card, you will need to download the appropriate drivers. We are running a UNIX server, and the following commands run on a CentOS version 7.5.

The first step is to verify the UNIX operating system correctly detects the PCI card and runs the appropriate driver. The “list PCI” command will do this:

$ lspci | grep -i NVIDIA
15:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
15:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
2d:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
2d:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)

You will note that our system has found two graphics processors, the GeForce GTX 1080 Ti cards running rev a1 of their firmware. This verifies that the GPU drivers and installed and working correctly.

You can then run a command that is provided by the manufacturer to see what is going on inside the GPUs. In our case with the Nvidia we have the Nvidia Service Module Interface (SMI) command:

$ nvidia-smi
Sun Sep 8 15:17:08 2019
+ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -+
| NVIDIA-SMI 390.77 Driver Version: 390.77 |
| — — — — — — — — — — — — — — — -+ — — — — — — — — — — — + — — — — — — — — — — — +
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… Off | 00000000:15:00.0 Off | N/A |
| 18% 34C P8 17W / 250W | 2MiB / 11178MiB | 0% Default |
+ — — — — — — — — — — — — — — — -+ — — — — — — — — — — — + — — — — — — — — — — — +
| 1 GeForce GTX 108… Off | 00000000:2D:00.0 On | N/A |
| 25% 45C P8 19W / 250W | 287MiB / 11170MiB | 0% Default |
+ — — — — — — — — — — — — — — — -+ — — — — — — — — — — — + — — — — — — — — — — — +

+ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 2939 G /usr/bin/X 88MiB |
| 1 3788 G /usr/bin/gnome-shell 115MiB |
| 1 4752 G /usr/lib64/firefox/firefox 80MiB |
+ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -+ these GPUs.

Now we can get a list of the cards, some stats like the percentage utilization, the temperature (in C) of the card, the percent that the fan is running, how much RAM (11GiB), and what processes are using the GPUs.

Our next step is to download TensorFlow software. We want to make sure that the version of TensorFlow matches the version on the DonkeyCar and we need to make sure to add the “-gpu” option to the install command to make sure the TensorFlow software takes advantage of our GPUs.

Here are the basic commands to install the GPU version of TensorFlow:

$ pip install --upgrade pip
$ pip install tensorflow-gpu

There are several places that you might get errors. Here are a few of them.

Using Virtual Environments

With the DonkeyCar we can get by with just a single environment for doing data gathering and executing the model. On the server, we might want to do many different tasks with different versions of software libraries. To make this easier it is common to set up virtual environments. A virtual environment is a way to helps to keep dependencies required by different projects separate by creating isolated environments for them.

Python Versions

The most common problem I have run into is challenges with multiple versions of Python and libraries. Make sure that you are running Python 3.6 and that all your libraries work with this version of Python.

$ python --version
Python 3.6

You will also want to make sure all your libraries work with this version of Python.

UNIX Permissions

Many people that are building a DonkeyCar just login as root and execute commands as root. If you are the only person using that computer that is pretty safe. However, with our shared server, we want a finer grain of control.

We created a user called the “AI Racing League Admin” or the arl-admin. All the key administration functions should be run by this user without needing to become root.

We also wanted all the python programs to be run by this user. Make sure that you change the owner and group of the python binary (not the link) to be owned by your user.

sudo chown -R arl-admin.arl-admin /usr/share/anaconda3/lib/python3.6

Setting up the Secure Shell Server

In order to enable remote logins, you will need to enable the SSH daemon or “sshd”. You can use standard Linux admin tools to do this. Depending on the version of Linux you are using the command might be something like:

systemctl enable sshd
systemctl start sshd

Just Google “Enable SSHD on Linux” to get the full details.

--

--

Dan McCreary

Distinguished Engineer that loves knowledge graphs, AI, and Systems Thinking. Fan of STEM, microcontrollers, robotics, PKGs, and the AI Racing League.