Introduction
With ever more big data being generated it was only natural that JTA The Data Scientists needs to acquire and maintain High Performance Computing (HPC) solutions. As usual there is no one answer to solving our HPC requirements. All depends on the client and the task at hand. These are some of the considerations:
- We may need to perform HPC in the cloud or on our premises
- Our project may only use the compute resource as a one off, for example when we train a model
- We may need to connect the compute resource to our users, for example when they drill into vast data sets exposed in a custom visualization
- Our solution should host existing tools that require a high performance computing platform such as Keras running on Tensorflow
- We want to develop our own custom tools built around HPC technology
As we at JTA pride ourselves in being both disruptive and innovative we intend to do all of the above!
In this post we are going to talk about our OnPrem solution and how we use it.
OnPrem High Performance Computing
Liquid Freon cooled early supercomputers so that the circuits could run at speed. They were large, expensive, specialist pieces of kit. The performance, which at the time was stunning, is poor when compared to today’s processors. The first supercomputer, the cray 1, was 2 metres tall, weighed over 5 tons and needed 115 k watts of power. That is the sort of power that a modern car supercharger can provide. Even so, the system managed to provide 0.16 G flops of compute power.
The Cray 1:
Hardware Choices
JTA has our own OnPrem high performance computing built by our data scientists using GPUs. The design has been quite a difficult as there are so many options available and it is hard to know what to buy. After a lot of study we decided to build using the following options:
- We chose to use GPU cards built by NVIDIA®. This was not a difficult choice because NVIDIA® are in the lead for producing GPUs specifically for compute and not necessarily for graphics. NVIDIA® also has developed CUDA (Compute Unified Device Architecture) which allows for low level programming of the GPU from C++
- At the present time the cutting edge solution is the V100 card using NVLink technology which allows GPUs to communicate with each other very rapidly. We decided not to use NVLink because it is harder to buy host machines and the alternative using PCI is very well known and easily available. Besides the PCI solution is good enough.
- We also decided that we did not need to latest V100 card and the cheaper P100 is perfect for our needs. The latest cards always attract a premium price and one can save a lot of money by buying the penultimate generation.
- We also needed to buy host environments and for that we chose the Dell PowerEdge c4140. These servers occupy only 1U of rack space so it is easy to put 30 odd machines in a rack. The design is neat in that the server is thin but deep. The rear of the server houses a powerful server and the front half of the machine houses four GPUs.
Dell PowerEdge C4140 for High Performance Computing:
Each P100 has 3,584 CUDA Cores and, depending on the numeric precision, they run at up to 18,700 G flops. We average about 10,000 G flops from each GPU and so we get 40,000 G flops per server. That is equivalent to 250,000 Cray machines which would require the energy needed for 30,000 houses!
“A single GPU-accelerated node powered by four P100s interconnected with PCIe replaces up to 32 commodity CPU nodes for a variety of applications. Completing all the jobs with far fewer powerful nodes means that customers can save up to 70 percent in overall data center costs.” – NVIDIA®
We use the GPUs for training machine leaning models such as neural networks. If you are interested then see our whitepapers explaining Neural Networks.
We are currently developing systems to implement our text matching algorithms on High Performance Computing. We aim to massively improve performance when we are working on Natural Language Processing.