Facebook has made Torch, an open source development environment for numerics, machine learning, and computer vision, with a particular emphasis on deep learning and convolutional nets available to everyone. The latest release includes GPU-optimized modules for large convolutional nets (ConvNets), as well as networks with sparse activations that are commonly used in Natural Language Processing applications. The ConvNet modules include a fast FFT-based convolutional layer covered in an earlier TechEnablement article, “Facebook Open Source GPU FFT 1.5x Faster Than NVIDIA CUFFT“.
Torch includes a number of other CUDA-based modules and containers, including:
- Containers that allow the user to parallelize the training on multiple GPUs using both the data-parallel model (mini-batch split over GPUs), or the model-parallel model (network split over multiple GPUs).
- An optimized Lookup Table that is often used when learning embedding of discrete objects (e.g. words) and neural language models.
- Hierarchical SoftMax module to speed up training over extremely large number of classes.
- Cross-map pooling (sometimes known as MaxOut) often used for certain types of visual and text models.
- A GPU implementation of 1-bit SGD based on the paper by Frank Seide, et al.
- A significantly faster Temporal Convolution layer, which computes the 1-D convolution of an input with a kernel, typically used in ConvNets for speech recognition and natural language applications. The latest version improves upon the original Torch implementation by utilizing the same BLAS primitives in a significantly more efficient regime. Observed speedups range from 3x to 10x on a single GPU, depending on the input sizes, kernel sizes, and strides.
Soumith Chintala claims in the Facebook research blog post that “Torch is widely used at a number of academic labs as well as at Google/DeepMind, Twitter, NVIDIA, AMD, Intel, and many other companies”. For more information see http://torch.ch/.
- Interested readers can also find the TechEnablement deep-learning teaching code that achieved 13 PF/s average sustained performance on the farbopt github repository. More about how the parallel mapping that delivers petaflop performance on GPUs and Intel Xeon Phi can be found here.
- NVIDIA also provides the cuDNN deep-learning library.