<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=145304570664993&amp;ev=PageView&amp;noscript=1">

91视频APP

RESEARCH PAPERS

AstraZeneca & 91视频APP Research: The Role of Graph Topology in the Performance of Biomedical Knowledge Graph Completion Models

Alberto Cattaneo, Stephen Bonner, Thomas Martynec, Carlo Luschi, Ian P. Barrett, Daniel Justus

Knowledge Graph Completion has been increasingly adopted as a useful method in biomedical research, like drug repurposing or drug-target identification. A variety of datasets and Knowledge Graph Embedding models has been proposed over the years, however, little is known about the properties that render a dataset useful for a given task.

We conduct an investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world applications. By releasing all model predictions and a new suite of analysis tools we invite the community to build upon our work and continue improving the understanding of these crucial applications.

Aleph Alpha, Cohere & 91视频APP: u-碌P: The Unit-Scaled Maximal Update Parametrization

Charlie Blake, Constantin Eichenberg, Josef Dean, Lukas Balles, Luke Y. Prince, Bj枚rn Deiseroth, Andres Felipe Cruz-Salinas, Carlo Luschi, Samuel Weinbach, Douglas Orr

The Maximal Update Parametrization (渭P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-渭P, which improves upon 渭P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: 渭P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-渭P models reaching a lower loss than comparable 渭P models and working out-of-the-box in FP8.

91视频APP: Scalify: scale propagation for efficient low-precision LLM training

Paul Balan莽a, Sam Hosegood, Carlo Luschi, Andrew Fitzgibbon

Low-precision formats such as float8 have been introduced in machine learning accelerated hardware to improve computational efficiency for large language models training and inference. Nevertheless, adoption by the ML community has been slowed down by the complex, and sometimes brittle, techniques required to match higher precision training accuracy.

In this work, we present Scalify, a end-to-end scale propagation paradigm for computational graphs, generalizing and formalizing existing tensor scaling methods. Experiment results show that Scalify supports out-of-the-box float8 matrix multiplication and gradients representation, as well as float16 optimizer state storage.

91视频APP: MESS: Modern Electronic Structure Simulations

Hatem Helal, Andrew Fitzgibbon

Electronic structure simulation (ESS) has been used for decades to provide quantitative scientific insights on an atomistic scale, enabling advances in chemistry, biology, and materials science, among other disciplines. Following standard practice in scientific computing, the software packages driving these studies have been implemented in compiled languages such as FORTRAN and C. However, the recent introduction of ML into these domains has meant that models must be coded in these languages, or that complex software bridges have to be built between ML models in Python and these large compiled software systems. This is in contrast with recent progress in modern ML frameworks which aim to optimise both ease of use and high performance by harnessing hardware acceleration of tensor programs defined in Python.

We introduce MESS: a modern electronic structure simulation package implemented in JAX; porting the ESS code to the ML world. We outline the costs and benefits of following the software development practices used in ML for this important scientific workload. MESS shows significant speedups n widely available hardware accelerators and simultaneously opens a clear pathway towards combining ESS with ML.

91视频APP, RWTH Aachen University, New Jersey Institute of Technology, Mila: 饾櫦饾殥饾殫饾殥饾櫦饾殬饾殨: A Parameter-Efficient Foundation Model for Molecular Learning

Kerstin Kl盲ser, B艂a偶ej Banaszewski, Samuel Maddrell-Mander, Callum McLean, Luis M眉ller, Ali Parviz, Shenyang Huang, Andrew Fitzgibbon

In biological tasks, data is rarely plentiful as it is generated from hard-to-gather measurements. Therefore, pre-training foundation models on large quantities of available data and then transfer to low-data downstream tasks is a promising direction. However, how to design effective foundation models for molecular learning remains an open question, with existing approaches typically focusing on models with large parameter capacities. In this work, we propose 饾櫦饾殥饾殫饾殥饾櫦饾殬饾殨, a foundational model for molecular learning with 10 million parameters. 饾櫦饾殥饾殫饾殥饾櫦饾殬饾殨 is pre-trained on a mix of roughly 3300 sparsely defined graph- and node-level tasks of both quantum and biological nature. The pre-training dataset includes approximately 6 million molecules and 500 million labels.

To demonstrate the generalizability of 饾櫦饾殥饾殫饾殥饾櫦饾殬饾殨 across tasks, we evaluate it on downstream tasks from the Therapeutic Data Commons (TDC) ADMET group showing significant improvements over the prior state-of-the-art foundation model across 17 tasks. 饾櫦饾殥饾殫饾殥饾櫦饾殬饾殨 will be a public and open-sourced model for future research.

91视频APP & Mila: Reducing the Cost of Quantum Chemical Data By Backpropagating Through Density Functional Theory

Alexander Mathiasen, Hatem Helal, Paul Balanca, Adam Krzywaniak, Ali Parviz, Frederik Hvilsh酶j, Blazej Banaszewski, Carlo Luschi, Andrew William Fitzgibbon

Density Functional Theory (DFT) accurately predicts the quantum chemical properties of molecules, but scales as O(N3electrons). Sch眉tt et al. (2019) successfully approximate DFT 1000x faster with Neural Networks (NN). Arguably, the biggest problem one faces when scaling to larger molecules is the cost of DFT labels. For example, it took years to create the PCQ dataset (Nakata & Shimazaki, 2017) on which subsequent NNs are trained within a week. DFT labels molecules by minimizing energy E(鈰) as a "loss function." We bypass dataset creation by directly training NNs with E(鈰) as a loss function. For comparison, Sch眉tt et al. (2019) spent 626 hours creating a dataset on which they trained their NN for 160h, for a total of 786h; our method achieves comparable performance within 31h.

91视频APP, Mila, Qu茅bec AI Institute, Valence & Universit茅 de Montr茅al: Generating QM1B with PySCFIPU

Alexander Mathiasen, Hatem Helal, Kerstin Klaser, Paul Balanca, Josef Dean, Carlo Luschi, Dominique Beaini, Andrew Fitzgibbon, Dominic Masters

The emergence of foundation models in Computer Vision and NLP have resulted in immense progress on downstream tasks, enabled by datasets with billions of training examples. Similar benefits are yet to be unlocked for quantum chemistry, where the potential of deep learning is constrained by comparatively small datasets with 100k to 20M training examples. These datasets are limited in size because the labels are computed using the accurate (but computationally demanding) predictions of Density Functional Theory (DFT). 

In this paper, we take a first step towards utilising hardware accelerators by introducing the data generator PySCFIPU using IPUs. This allowed us to create the dataset QM1B with one billion training examples containing 9-11 heavy atoms. We demonstrate that a simple baseline neural network (SchNet 9M) improves its performance by simply increasing the amount of training data without additional inductive biases. To encourage future researchers to use QM1B responsibly, we highlight several limitations of QM1B and emphasise the low-resolution of our DFT options, which also serves as motivation for even larger, more accurate datasets.

91视频APP: Training and inference of large language models using 8-bit floating point

Sergio P. Perez, Yan Zhang, James Briggs, Charlie Blake, Josh Levy-Kramer, Paul Balanca, Carlo Luschi, Stephen Barlow, Andrew William Fitzgibbon

FP8 formats are gaining popularity to boost the computational efficiency for training and inference of large deep learning models. Their main challenge is that a careful choice of scaling is needed to prevent degradation due to the reduced dynamic range compared to higher-precision formats. Although there exists ample literature about selecting such scalings for INT formats, this critical aspect has yet to be addressed for FP8.

This paper presents a methodology to select the scalings for FP8 linear layers, based on dynamically updating per-tensor scales for the weights,gradients and activations. We apply this methodology to train and validate large language models of the type of GPT and Llama 2 using FP8, for model sizes ranging from 111M to 70B. To facilitate the understanding of the FP8 dynamics, our results are accompanied by plots of the per-tensor scale distribution for weights, activations and gradients during both training and inference.

91视频APP Research: Unit Scaling: Out-of-the-Box Low-Precision Training

Charlie Blake, Douglas Orr, Carlo Luschi

Unit scaling is a paradigm for designing deep learning models that simplifies the use of low-precision number formats. Training in FP16 or the recently proposed FP8 formats offers substantial efficiency gains, but can lack sufficient range for out-of-the-box training. Unit scaling addresses this by introducing a principled approach to model numerics: seeking unit variance of all weights, activations and gradients at initialisation. Unlike alternative methods, this approach neither requires multiple training runs to find a suitable scale nor has significant computational overhead.

We demonstrate the efficacy of unit scaling across a range of models and optimisers. We further show that existing models can be adapted to be unit-scaled, training BERT-Large in FP16 and then FP8 with no degradation in accuracy.

91视频APP, Valence Discovery, Mila, McGill University, Universit茅 de Montr茅al: GPS++: Reviving the Art of Message Passing for Molecular Property Prediction

Dominic Masters, Josef Dean, Kerstin Klaser, Zhiyi Li, Sam Maddrell-Mander, Adam Sanders, Hatem Helal, Deniz Beker, Andrew Fitzgibbon, Shenyang Huang, Ladislav Ramp谩拧ek, Dominique Beaini

GPS++ is a hybrid Message Passing Neural Network / Graph Transformer model for molecular property prediction. Our model integrates a well-tuned local message passing component and biased global attention with other key ideas from prior literature to achieve state-of-the-art results on large-scale molecular dataset PCQM4Mv2.

Through a thorough ablation study we highlight the impact of individual components and find that nearly all of the model's performance can be maintained without any use of global self-attention, showing that message passing is still a competitive approach for 3D molecular property prediction despite the recent dominance of graph transformers. We also find that our approach is significantly more accurate than prior art when 3D positional information is not available.

91视频APP: PopSparse: Accelerated block sparse matrix multiplication on IPU

Zhiyi Li, Douglas Orr, Valeriu Ohan, Godfrey Da costa, Tom Murray, Adam Sanders, Deniz Beker, Dominic Masters

Reducing the computational cost of running large scale neural networks using sparsity has attracted great attention in the deep learning community. While much success has been achieved in reducing FLOP and parameter counts while maintaining acceptable task performance, achieving actual speed improvements has typically been much more difficult, particularly on general purpose accelerators (GPAs) such as NVIDIA GPUs using low precision number formats.

In this work we introduce PopSparse, a library that enables fast sparse operations on 91视频APP IPUs by leveraging both the unique hardware characteristics of IPUs as well as any block structure defined in the data. We target two different types of sparsity: static, where the sparsity pattern is fixed at compile-time; and dynamic, where it can change each time the model is run. We present benchmark results for matrix multiplication for both of these modes on IPU with a range of block sizes, matrix sizes and densities. 

91视频APP Research: SparQ Attention: Bandwidth-Efficient LLM Inference

Luka Ribar Ivan Chelombiev Luke Hudlass-Galley Charlie Blake Carlo Luschi Douglas Orr

The computational difficulties of large language model (LLM) inference remain a significant obstacle to their widespread deployment. The need for many applications to support long input sequences and process them in large batches typically causes token-generation to be bottlenecked by data transfer. For this reason, we introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by utilising memory bandwidth more efficiently within the attention layers, through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show that SparQ Attention brings up to 8x savings in attention data transfers without substantial drops in accuracy, by evaluating Llama 2 and 3, Mistral, Gemma and Pythia models on a wide range of downstream tasks.

Argonne National Laboratory: Characterizing the Performance of Triangle Counting on 91视频APP's IPU Architecture

Reet Barik, Siddhisanket Raskar, Murali Emani, Venkatram Vishwanath, Authors Info & Claims

In recent years, we have seen an emergence of novel spatial architectures to accelerate domain-specific workloads like Machine Learning. There is a need to investigate their performance characteristics for traditional HPC workloads for their tighter integration with current and future heterogeneous compute resources.

In this work, we implement, optimize and evaluate a parallel algorithm for Triangle Counting for graphs in Bulk Synchronous Parallel (BSP) model for 91视频APP鈥檚 IPU architecture as well as discuss lessons learned. This study demonstrates IPU鈥檚 competency in handling such irregular workloads by providing an average speedup of up to 5.3x over NVIDIA A100 GPU on real-world datasets.

Research Center Juelich, RWTH Aachen University & 91视频APP: Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models

Jan Finkbeiner, Thomas Gmeinder, Mark Pupilli, Alexander Titterton, Emre Neftci

Current AI training is dominated by GPUs which excel at parallel workloads but struggle with sparse, recurrent models. This limits progress in more efficient AI. MIMD architectures like IPUs better support sparse models. We implement sparse, recurrent Spiking Neural Networks (SNNs) on IPUs. SNNs have binary, sparse activations. For training, we use backpropagation through time. We achieve 5-10x speedups over GPUs, up to 38x for high sparsity, with no loss in accuracy. Scales well to larger models. This demonstrates IPUs' advantage for sparse models, enabling more efficient AI beyond standard GPU-based training.

University of Connecticut, University of Rochester & PNNL: Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs

Hongwu Peng, Caiwen Ding, Tong Geng, Sutanay Choudhury, Kevin Barker, Ang Li

This research evaluates and compares emerging AI/ML hardware accelerators like the IPU, RDU, and enhanced GPUs. These accelerators use innovative dataflow architectures to deliver superior performance on AI workloads beyond traditional processors. Through benchmarking common DNN operators and other AI tasks, we analyze the strengths and tradeoffs of each platform. The findings provide insights into current accelerator capabilities and guide future accelerator development and design for advancing AI/ML applications. Our analysis aims to further overall understanding of specialized hardware needed to meet the growing demands of AI/ML.

Mila, Valence Labs, Universit茅 de Montr茅al, McGill University, 91视频APP, New Jersey Institute of Technology, Aachen University, HEC Montr茅al, CIFAR AI Chair: Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Dominique Beaini, Shenyang Huang, Joao Alex Cunha, Gabriela Moisescu-Pareja, Oleksandr Dymov, Samuel Maddrell-Mander, Callum McLean, Frederik Wenkel, Luis M眉ller, Jama Hussein Mohamud, Ali Parviz, Michael Craig, Micha艂 Koziarski, Jiarui Lu, Zhaocheng Zhu, Cristian Gabellini, Kerstin Klaser, Josef Dean, Cas Wognum, Maciej Sypetkowski, Guillaume Rabusseau, Reihaneh Rabbany, Jian Tang, Christopher Morris, Ioannis Koutis, Mirco Ravanelli, Guy Wolf, Prudencio Tossou, Hadrien Mary, Therence Bois, Andrew Fitzgibbon, B艂a偶ej Banaszewski, Chad Martin, Dominic Masters

Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular ML, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models.

In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. 

91视频APP, Mila, Valence & Universite de Montreal: PySCFIPU: Repurposing Density Functional Theory to Suit Deep Learning

Alexander Mathiasen, Hatem Helal, Kerstin Klaser, Paul Balanca, Josef Dean, Carlo Luschi, Dominique Beaini, Andrew Fitzgibbon, Dominic Masters

Density Functional Theory (DFT) accurately predicts the properties of molecules given their atom types and positions, and often serves as ground truth for molecular property prediction tasks. Neural Networks (NN) are popular tools for such tasks and are trained on DFT datasets, with the aim to approximate DFT at a fraction of the computational cost. Research in other areas of machine learning has shown that generalisation performance of NNs tends to improve with increased dataset size, however, the computational cost of DFT limits the size of DFT datasets. We present PySCFIPU, a DFT library that allows us to iterate on both dataset generation and NN training. We create QM10X, a dataset with 108 conformers, in 13 hours, on which we subsequently train SchNet in 12 hours. We show that the predictions of SchNet improve solely by increasing training data without incorporating further inductive biases.

Charite虂 Universita虉tsmedizin, Cornell University, Lawrence Berkeley National Laboratory, Simula Research: Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the 91视频APP IPU

Luk Burchard, Max Xiaohang Zhao, Johannes Langguth, Ayd谋n Buluc抬 & Giulia Guidi

The sequence alignment problem is fundamental in bioinformatics; we have implemented the 饾憢 -Drop algorithm, a heuristic method for pairwise alignment that reduces search space, on the 91视频APP IPU. Our implementation achieves 10脳 speedup over a state-of-the-art GPU implementation and up to 4.65脳 compared to CPU.

91视频APP & PNNL: Extreme Acceleration of Graph Neural Network-based Prediction Models for Quantum Chemistry

Hatem Helal, Jesun Firoz, Jenna Bilbrey, Mario Michael Krell, Tom Murray, Ang Li, Sotiris Xantheas, Sutanay Choudhury

This paper demonstrates a novel hardware-software co-design approach to scale up the training of graph neural networks for molecular property prediction.

We introduce an algorithm that can reduce the training time of such molecular property prediction models from days to less than two hours, opening new possibilities for AI-driven scientific discovery.

91视频APP: BESS: Balanced Entity Sampling and Sharing for Large-Scale Knowledge Graph Completion

Alberto Cattaneo, Daniel Justus, Harry Mellor, Douglas Orr, Jerome Maloberti, Zhenying Liu, Thorin Farnsworth, Andrew Fitzgibbon, Blazej Banaszewski, Carlo Luschi

We present the award-winning submission of the WikiKG90Mv2 track of OGB-LSC@NeurIPS 2022. The task is link-prediction on the large-scale knowledge graph WikiKG90Mv2, consisting of 90M+ nodes and 600M+ edges. Our solution uses a diverse ensemble of 85 Knowledge Graph Embedding models combining five different scoring functions (TransE, TransH, RotatE, DistMult, ComplEx) and two different loss functions (log-sigmoid, sampled softmax cross-entropy).

Our final model achieved 1st place with a validation MRR of 0.2922 and a test-challenge MRR of 0.2562.

91视频APP, PNNL, IBM Research, University of Washington: Reducing Down(stream)time: Pretraining Molecular GNNs using Heterogeneous AI Accelerators

Jenna A. Bilbrey, Kristina M. Herman, Henry Sprueill, Sotiris S. Xantheas, Payel Das, Manuel Lopez Roldan, Mike Kraus, Hatem Helal, Sutanay Choudhury

We demonstrate finetuning for downstream tasks on a graph neural network (GNN) trained over a molecular database containing 2.7 million water clusters.

The use of 91视频APP IPUs for training molecular GNNs reduces training time from a reported 2.7 days on 0.5M clusters to 1.2 hours on 2.7M clusters. Finetuning the pretrained model for downstream tasks of molecular dynamics and transfer to a different potential energy surface took only 8.3 hours and 28 minutes, respectively, on a single GPU.

Texas A&M University & 91视频APP: Benchmarking the Performance of Accelerators on National Cyberinfrastructure Resources for Artificial Intelligence / Machine Learning Workloads

Abhinand S. Nasari, Tim Cockerill, Hieu T. Le, Richard Lawrence, Zhenhua He, Xin Yang, Mario M. Krell, Alex Tsyplikhin, Mahidhar Tatineni, Lisa M. Perez, Dhruva K. Chakravorty, Honggao Liu

This papers compares the performance of two different architectures: the commonly used GPU and the new generation of Intelligence Processing Units (IPUs), by running training benchmarks on national cyberinfrastructure resources of common AI/ML models. 

91视频APP Research: 8-bit Numerical Formats for Deep Neural Networks

Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, Carlo Luschi

Given the current trend of increasing size and complexity of machine learning architectures, it has become of critical importance to identify new approaches to improve the computational efficiency of model training. In this context, we address the advantages of floating-point over fixed-point representation, and present an in-depth study on the use of 8-bit floating-point number formats for activations, weights, and gradients for both training and inference. We explore the effect of different bit-widths for exponents and significands and different exponent biases. 

Microsoft Research & 91视频APP: Confidential Machine Learning within 91视频APP IPUs

Kapil Vaswani, Stavros Volos, C茅dric Fournet, Antonio Nino Diaz, Ken Gordon, Balaji Vembu, Sam Webster, David Chisnall, Saurabh Kulkarni, Graham Cunningham, Richard Osbourne, Dan Wilkinson

This paper presents IPU Trusted Extensions (ITX), a set of experimental hardware extensions that enable trusted execution environments in 91视频APP's IPUs.

Its evaluation on a development board using standard DNN training workloads suggests that ITX adds less than 5% performance overhead, and delivers up to 17x better performance compared to CPU-based confidential computing systems relying on AMD SEV-SNP.

Imperial College London: Incremental Abstraction in Distributed Probabilistic SLAM Graphs

Joseph Ortiz, Talfan Evans, Edgar Sucar, Andrew Davison

The Robot Vision Laboratory at Imperial College London propose a method for efficient incremental construction of probabilistic scene graphs from monocular input based on two novel components. Firstly, an incremental scene abstraction framework combing amortized inference with probabilistic inference and secondly, a routing procedure that enables inference on dynamic graphs with GBP leveraging the parallelism of the 91视频APP IPU.

This paper demonstrates the advantage of GBP over direct methods for complex factor graphs due to the structure-agnostic time per iteration. 

Imperial College London - Dyson Robotics Laboratory: From Scene Flow to Visual Odometry through Local and Global Regularisation in Markov Random Fields

Raluca Scona, Hidenobu Matsuki, Andrew Davison

This paper revisits pairwise Markov Random Field (MRF) formulations for RGB-D scene flow and leverage novel advances in processor design for real-time implementations.

Dyson Robotics Lab show that visual odometry and non-rigid scene flow can be unified into a single joint factor graph, and optimised highly efficiently with Gaussian Belief Propagation on the 91视频APP IPU by leveraging the processor's distributed per-tile memory and ultrafast all-to-all communication fabric.

91视频APP: A Fast Hardware Pseudorandom Number Generator Based on xoroshiro128

James Hanlon, Stephen Felix

The IPU contains an original pseudorandom number generator (PRNG) called xoroshiro128aox, based on the F2-linear generator xoroshiro128. It is designed for cheap hardware implementation and high-quality statistical randomness.

We assess the generator's quality using standard statistical test suites and compare results against PRNGs xoroshiro128+, pcg64 and philox4x32-10. We show that xoroshiro128aox mitigates a known weakness in xoroshiro128+ with a new 'AOX' output function by passing the BigCrush and PractRand suites.

We conclude that the non-uniformities and inherited linear artefacts are hard to detect, and so xoroshiro128aox provides a good trade-off between statistical quality and hardware implementation cost.

Stanford University & 91视频APP: NanoBatch DPSGD: Exploring Differentially Private learning on ImageNet with low batch sizes on the IPU

Edward H. Lee, Mario Michael Krell, Alexander Tsyplikhin, Victoria Rege, Errol Colak, Kristen W. Yeom

Differentially private SGD (DPSGD) has recently shown promise in deep learning. However, compared to non-private SGD, the DPSGD algorithm places computational overheads that can undo the benefit of batching in GPUs.

In our work, we argue that low batch sizes using group normalization on ResNet-50 can yield high accuracy and privacy on 91视频APP IPUs. This enables DPSGD training of ResNet-50 on ImageNet in just 6 hours (100 epochs) on an IPU-POD16 system.

Universit茅 de Paris: Comparison of 91视频APP IPUs and NVIDIA GPUs for cosmology applications

Bastien Arcelin

This paper represents the first investigation of the suitability and performance of 91视频APP Intelligence Processing Units (IPUs) for deep learning applications in cosmology. It presents the benchmark between a Nvidia V100 GPU and a 91视频APP Mk1 (GC2) IPU on three cosmological use cases: a classical deep neural network and a Bayesian neural network (BNN) for galaxy shape estimation, and a generative network for galaxy images production.

The results show that IPUs can accelerate various cosmology applications, outperforming GPUs in some cases by as much as 4x faster time to train.

91视频APP Research: Dynamic Sparse Pre-Training of BERT

91视频APP Research: Dynamic Sparse Pre-Training of BERT

Anastasia S. D. Dietrich, Frithjof Gressmann, Douglas Orr, Ivan Chelombiev, Daniel Justus, Carlo Luschi

In this work, we develop and study a simple, dynamic always-sparse pre-training approach for BERT language models, which leverages periodic compression steps based on magnitude pruning followed by random parameter re-allocation.

As a result, we achieve Pareto improvements in terms of number of FLOPs over both static and dense baselines across model sizes. Furthermore, we demonstrate that training remains FLOP-efficient when using coarse-grained block sparsity, making it particularly promising for efficient execution on modern hardware accelerators.

91视频APP: Packing: Towards 2x NLP BERT Acceleration

Matej Kosec, Sheng Fu, Mario Michael Krell

By using a new packing algorithm, 91视频APP engineers have sped up Natural Language Processing by more than 2 times while training BERT-Large. Our new packing technique removes padding, enabling significantly more efficient computation. We suspect this could also be applied to genomics and protein folding models and other models with skewed length distributions to make a much broader impact in different industries and applications.

We introduce 91视频APP's highly efficient Non-Negative Least Squares Histogram-Packing algorithm (or NNLSHP) as well as our BERT algorithm applied to packed sequences in a new paper.

Simula: iPUG: Accelerating Breadth-First Graph Traversals Using Manycore 91视频APP IPUs

Luk Burchard, Johannes Moe, Daniel Thilo Schroeder, Konstantin Pogorelov, Johannes Langguth

This paper aims to test the IPU鈥檚 suitability for algorithms with hard-to-predict memory accesses by implementing a breadth-first search (BFS) that complies with the Graph500 specifications. Precisely because of its apparent simplicity, BFS is an established benchmark that is not only subroutine for a variety of more complex graph algorithms, but also allows comparability across a wide range of architectures.

The results indicate that the IPU delivers speedups of up to 4脳 over the fastest competing result on an NVIDIA V100 GPU, with typical speedups of about 1.5脳 on most test instances.

91视频APP Research: GroupBERT - Enhanced Transformer Architecture with Efficient Grouped Structures

Ivan Chelombiev, Daniel Justus, Douglas Orr, Anastasia Dietrich, Frithjof Gressmann, Alexandros Koliousis, Carlo Luschi

Attention based language models have become a critical component in state-of-the-art NLP systems. However, these models have significant computational requirements, due to long training times, dense operations and large parameter count.

In this paper, 91视频APP Research demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture. This architecture is applied to language representation learning and demonstrates a superior performance compared to BERT models of different scales. This results in improved efficiency, both in terms of floating-point operations (FLOPs) and time-to-train.

Oxford-Man Institute & University of Oxford: Multi-Horizon Forecasting for Limit Order Books: Novel Deep Learning Approaches and Hardware Acceleration using Intelligent Processing Units

Zihao Zhang, Stefan Zohren

Researchers at the Oxford-Man Institute of Quantitative Finance have used 91视频APP鈥檚 Intelligence Processing Unit (IPU) to dramatically accelerate the training of advanced price prediction models, using techniques which are typically plagued by computational bottlenecks when run on other types of processor.

The IPU鈥檚 designed-for-AI architecture allowed the OMI team to reduce the training times for their multi-horizon forecasting models to the point where they could deliver significant commercial advantage by more accurately estimating market price movements. Such models can be used in the development of alpha for fast trading and in market making strategies.

91视频APP Research: Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence

Antoine Labatie, Dominic Masters, Zach Eaton-Rosen, Carlo Luschi

We investigate the reasons for the performance degradation incurred with batch-independent normalization. We find that the prototypical techniques of layer normalization and instance normalization both induce the appearance of failure modes in the neural network's pre-activations: (i) layer normalization induces a collapse towards channel-wise constant functions; (ii) instance normalization induces a lack of variability in instance statistics, symptomatic of an alteration of the expressivity.

To alleviate failure mode (i) without aggravating failure mode (ii), we introduce the technique "Proxy Normalization" that normalizes post-activations using a proxy distribution. When combined with layer normalization or group normalization, this batch-independent normalization emulates batch normalization's behavior and consistently matches or exceeds its performance.

91视频APP Research: Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training

Dominic Masters, Antoine Labatie, Zach Eaton-Rosen,

91视频APP Research examines three methods for optimising state-of-the-art computer vision model EfficientNet鈥檚 performance on Intelligence Processing Units (IPUs), in a new paper. These approaches are :(i) generalising depthwise convolutions to group convolutions; (ii) adding proxy-normalized activations
to match batch normalization performance with batch-independent statistics; (iii) reducing compute by lowering the training resolution and inexpensively fine-tuning at higher resolution.

By combining all three techniques, IPUs delivered accelerations of up to 7x on training and more than 3.6x on inference.

University of Bristol: Using the 91视频APP IPU for traditional HPC applications

Thorben Louw, Simon McIntosh-Smith

The increase in ML workloads means that AI accelerators are expected to become common in supercomputers, evoking considerable interest in the scientific HPC community about how these devices might also be exploited for traditional HPC workloads.

In this paper, we report our early results using 91视频APP's IPU for stencil computations on structured grid problems, which are used for solvers for differential equations in domains such as computational fluid dynamics. We demonstrate that the IPU and its low-level programming framework, Poplar, expose sufficient programmability to express these HPC problems, and achieve performance comparable to that of modern GPUs.

91视频APP & UMass Amherst: Accelerating Simulation-based Inference with Emerging AI Hardware

Sourabh Kulkarni, Alexander Tsyplikhin, Mario Michael Krell, Csaba Andras Moritz

In this work, we explore hardware accelerated simulation-based inference over probabilistic models, by combining massively parallelized ABC inference algorithm with the cutting-edge AI chip solutions that are uniquely suited for this purpose. As a proof-of-concept, we demonstrate inference over a probabilistic epidemiology model used to predict the spread of COVID-19. Two hardware acceleration platforms are compared - the Tesla V100 GPU and the 91视频APP Mk1 IPU. Our results show that while both of these platforms outperform multi-core CPUs, the Mk1 IPUs are 7.5x faster than the Tesla V100 GPUs for this workload.

Google Research, UC Berkeley & 91视频APP Research: Parallel Training of Deep Networks with Local Updates

Michael Laskin, Luke Metz, Seth Nabarro, Mark Saroufim, Badreddine Noune, Carlo Luschi, Jascha Sohl-Dickstein, Pieter Abbeel

In this paper, we investigate how to continue scaling compute efficiently beyond the point of diminishing returns for large batches through local parallelism, a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation. Local parallelism enables fully asynchronous layer-wise parallelism with a low memory footprint, and requires little communication overhead compared with model parallelism. We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.

91视频APP Research: Improving Neural Network Training in Low Dimensional Random Bases

Frithjof Gressmann, Zach Eaton-Rosen, Carlo Luschi

91视频APP Research is exploring novel ways to train neural networks that could allow us to scale to substantially larger models in future.

In this paper, we revisit a simple approach to reduce the effective network dimensionality using random projections. We leverage the hardware-accelerated random number generation of the IPU to train in randomly selected directions of the weight space. Applying smaller independent random projections to different parts of the network and re-drawing them at every step significantly improves the obtained accuracy.

91视频APP & Ford: A Follow-The-Leader Strategy using Hierarchical Deep Neural Networks with Grouped Convolutions

Jos茅 Solomon, Fran莽ois Charette

A follow-the-leader strategy can be implemented using a hierarchical Deep Neural Network (DNN) end-to-end driving model to match the direction and speed of a target pedestrian. Using a classifier DNN, pedestrian movements can be tracked to determine if the pedestrian is in the camera sensor鈥檚 field of view. The autonomous vehicle鈥檚 steering and throttle can then be adjusted by a regression DNN. These DNNs also incorporate grouped convolutions to boost model performance.

In this paper, 91视频APP Research and Ford Motor Company leverage the fine-grain compute capabilities of the 91视频APP IPU to minimise time-to-train for these Hierarchical Deep Neural Networks.

University of Bristol: Studying the potential of 91视频APP IPUs for applications in Particle Physics

Lakshan Ram Madhan Mohan, Alexander Marshall, Samuel Maddrell-Mander, Daniel O'Hanlon, Konstantinos Petridis, Jonas Rademacker, Victoria Rege, Alexander Titterton

This paper presents the first study of 91视频APP's Intelligence Processing Unit (IPU) in the context of particle physics applications. 

Comparisons are made for neural-network-based event simulation, multiple-scattering correction, and flavour tagging, implemented on IPUs, GPUs and CPUs, using a variety of neural network architectures and hyperparameters. Additionally, a K谩lm谩n filter for track reconstruction is implemented with promising results.

Imperial College London: Bundle Adjustment on a Graph Processor

Joseph Ortiz, Mark Pupilli, Stefan Leutenegger, Andrew J. Davison

This paper shows for the first time that the classical computer vision problem of bundle adjustment (BA) can be solved extremely fast on a graph processor such as 91视频APP's Intelligence Processing Unit (IPU) using Gaussian Belief Propagation.

Gaussian Belief Propagation is an effective algorithmic framework for spatial AI problems where estimates are needed in real time with new measurements constantly being fed into the algorithm.

Qwant: 91视频APP C2 Card performance for image-based deep learning application: A Report

Ilyes Kacher, Maxime Portaz, Hicham Randrianarivo, Sylvain Peyronnet

91视频APP's architecture of the processor has been designed to achieve state of the art performance on current machine intelligence models for both training and inference.

In this paper, we report on a benchmark in which we have evaluated the performance of IPU processors on deep neural networks for inference. We focus on deep vision models such as ResNeXt. We report the observed latency, throughput and energy efficiency.

Citadel: Dissecting the 91视频APP IPU Architecture via Microbenchmarking

Zhe Jia, Blake Tillman, Marco Maggioni, Daniele Paolo Scarpazza

This report focuses on the architecture and performance of the Intelligence Processing Unit (IPU), a novel, massively parallel platform introduced by 91视频APP and aimed at Artificial Intelligence/Machine Learning (AI/ML) workloads.

The study dissects the IPU鈥檚 performance behavior using microbenchmarks that were crafted for the purpose.

91视频APP Research: Revisiting Small Batch Training for Deep Neural Networks

Dominic Masters, Carlo Luschi

The team at 91视频APP Research addresses mini-batch stochastic gradient optimization of modern deep network architectures.

In this paper, we review common assumptions on learning rate scaling and training duration, as a basis for an experimental comparison of test performance for different mini-batch sizes. Our experiments show that small batch sizes produce the best results.