AI and Compute – Source openai.com
We’re releasing an analysis showing that since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.5 month-doubling time (by comparison, Moore’s Law had an 18-month doubling period). Since 2012, this metric has grown by more than 300,000x (an 18-month doubling period would yield only a 12x increase). Improvements in compute have been a key component of AI progress, so as long as this trend continues, it’s worth preparing for the implications of systems far outside today’s capabilities.
The chart shows the total amount of compute, in petaflop/s-days, that was used to train selected results that are relatively well known, used a lot of compute for their time, and gave enough information to estimate the compute used. A petaflop/s-day (pfs-day) consists of performing 1015 neural net operations per second for one day, or a total of about 1020 operations. The compute-time product serves as a mental convenience, similar to kW-hr for energy. We don’t measure peak theoretical FLOPS of the hardware but instead try to estimate the number of actual operations performed. We count adds and multiplies as separate operations, we count any add or multiply as a single operation regardless of numerical precision (making “FLOP” a slight misnomer), and we ignore ensemble models. Example calculations that went into this graph are provided in this appendix. Doubling time for line of best fit shown is 3.43 months.
Three factors drive the advance of AI: algorithmic innovation, data (which can be either supervised data or interactive environments), and the amount of compute available for training. Algorithmic innovation and data are difficult to track, but compute is unusually quantifiable, providing an opportunity to measure one input to AI progress. Of course, the use of massive compute sometimes just exposes the shortcomings of our current algorithms. But at least within many current domains, more compute seems to lead predictably to better performance, and is often complementary to algorithmic advances.
For this analysis, we believe the relevant number is not the speed of a single GPU, nor the capacity of the biggest datacenter, but the amount of compute that is used to train a single model — this is the number most likely to correlate to how powerful our best models are. Compute per model differs greatly from total bulk compute because limits on parallelism (both hardware and algorithmic) have constrained how big a model can be or how much it can be usefully trained. Of course, important breakthroughs are still made with modest amounts of compute — this analysis just covers compute capability.
The trend represents an increase by roughly a factor of 10 each year. It’s been partly driven by custom hardware that allows more operations to be performed per second for a given price (GPUs and TPUs), but it’s been primarily propelled by researchers repeatedly finding ways to use more chips in parallel and being willing to pay the economic cost of doing so.
Looking at the graph we can roughly see four distinct eras:
- Before 2012: It was uncommon to use GPUs for ML, making any of the results in the graph difficult to achieve.
- 2012 to 2014: Infrastructure to train on many GPUs was uncommon, so most results used 1-8 GPUs rated at 1-2 TFLOPS for a total of 0.001-0.1 pfs-days.
- 2014 to 2016: Large-scale results used 10-100 GPUs rated at 5-10 TFLOPS, resulting in 0.1-10 pfs-days. Diminishing returns on data parallelism meant that larger training runs had limited value.
- 2016 to 2017: Approaches that allow greater algorithmic parallelism such as huge batch sizes, architecture search, and expert iteration, along with specialized hardware such as TPU’s and faster interconnects, have greatly increased these limits, at least for some applications.
AlphaGoZero/AlphaZero is the most visible public example of massive algorithmic parallelism, but many other applications at this scale are now algorithmically possible, and may already be happening in a production context.
We see multiple reasons to believe that the trend in the graph could continue. Many hardware startups are developing AI-specific chips, some of which claim they will achieve a substantial increase in FLOPS/Watt (which is correlated to FLOPS/$) over the next 1-2 years. There may also be gains from simply reconfiguring hardware to do the same number of operations for less economic cost. On the parallelism side, many of the recent algorithmic innovations described above could in principle be combined multiplicatively — for example, architecture search and massively parallel SGD.
On the other hand, cost will eventually limit the parallelism side of the trend and physics will limit the chip efficiency side. We believe the largest training runs today employ hardware that cost in the single digit millions of dollars to purchase (although the amortized cost is much lower). But the majority of neural net compute today is still spent on inference (deployment), not training, meaning companies can repurpose or afford to purchase much larger fleets of chips for training. Therefore, if sufficient economic incentive exists, we could see even more massively parallel training runs, and thus the continuation of this trend for several more years. The world’s total hardware budget is 1 trillion dollars a year, so absolute limits remain far away. Overall, given the data above, the precedent for exponential trends in computing, work on ML specific hardware, and the economic incentives at play, we think it’d be a mistake to be confident this trend won’t continue in the short term.
Past trends are not sufficient to predict how long the trend will continue into the future, or what will happen while it continues. But even the reasonable potential for rapid increases in capabilities means it is critical to start addressing both safety and malicious use of AI today. Foresight is essential to responsible policymaking and responsible technological development, and we must get out ahead of these trends rather than belatedly reacting to them.
(If you’d like to help make sure that AI progress benefits all of humanity, join us at OpenAI. Our research and engineering roles range from machine learning researchers to policy researchers to infrastructure engineers.)
Two methodologies were used to generate these data points. When we had enough information, we directly counted the number of FLOPs (adds and multiplies) in the described architecture per training example and multiplied by the total number of forward and backward passes during training. When we didn’t have enough information to directly count FLOPs, we looked GPU training time and total number of GPUs used and assumed a utilization efficiency (usually 0.33). For the majority of the papers we were able to use the first method, but for a significant minority we relied on the second, and we computed both whenever possible as a consistency check. In the majority of cases we also confirmed with the authors. The calculations are not intended to be precise but we aim to be correct within a factor 2-3. We provide some example calculations below.
Example of Method 1: Counting operations in the model
This method is particularly easy to use when the authors give the number of operations used in a forward pass, as in the Resnet paper (the Resnet-151 model in particular):
(add-multiplies per forward pass) * (2 FLOPs/add-multiply) * (3 for forward and backward pass) * (number of examples in dataset) * (number of epochs) = (11.4 * 10^9) * 2 * 3 * (1.2 * 10^6 images) * 128 = 10,000 PF = 0.117 pfs-days
Operations can also be counted programmatically for a known model architecture in some deep learning frameworks, or we can simply count operations manually. If a paper gives enough information to make this calculation, it will be quite accurate, but in some cases papers don’t contain all the necessary information and authors aren’t able to reveal it publicly.
Example of Method 2: GPU Time
If we can’t count operations directly, we can instead look at how many GPUs were trained for how long, and use reasonable guesses at GPU utilization to try to estimate the number of operations performed. We emphasize that here we are not counting peak theoretical FLOPS, but using an assumed fraction of theoretical FLOPS to try to guess at actual FLOPS. We typically assume a 33% utilization for GPUs and a 17% utilization for CPU’s, based on our own experience, except where we have more specific information (e.g. we spoke to the author or the work was done at OpenAI).
As an example, in the AlexNet paper it’s stated that “our network takes between five and six days to train on two GTX 580 3GB GPUs”. Under our assumptions this implies a total compute of:
Number of GPUs * (peta-flops/GTX580) * days trained * estimated utilization = 2 * (1.58 * 10 ^ -3 PF) * 5.5 * 0.33 = 500 PF = 0.0058 pfs-days
This method is more approximate and can easily be off by a factor of 2 or occasionally more; our aim is only to estimate the order of magnitude. In practice when both methods are available they often line up quite well (for AlexNet we can also directly count the operations, which gives us 0.0054 pfs-days vs 0.0058 with the GPU time method).
1.2M images * 90 epochs * 0.75 GFLOPS * (2 add-multiply) * (3 backward pass) = 470 PF = 0.0054 pfs-days
Selected Additional Calculations
Method 2: 1 GPU * 4 days * 1.54 TFLOPS/GTX 580 * 0.33 utilization = 184 PF = 0.0021 pfs-days
Method 2: 1 GPU * 12 days * 1.54 TFLOPS/GTX 580 * 0.33 utilization = 532 PF = 0.0062 pfs-days
Method 1: Network is 84x84x3 input, 16, 8x8, stride 4, 32 4x4 stride 2, 256 fully connected First layer: 20*20*3*16*8*8 = 1.23M add-multiplies Second layer: 9*9*16*32*4*4 = 0.66M add-multiplies Third layer: 9*9*32*256 = 0.66M add-mutliplies Total ~ 2.55M add-multiplies 2.5 MFLOPs * 5M updates * 32 batch size * 2 multiply-add * 3 backward pass = 2.3 PF = 2.7e-5 pfs-days
Method 1: (348M + 304M) words * 0.380 GF * 2 add-multiply * 3 backprop * 7.5 epoch = 7,300 PF = 0.085 pfs-days
Method 2: 10 days * 8 GPU’s * 3.5 TFLOPS/ K20 GPU * 0.33 utilization = 8,100 PF = 0.093 pfs-days
Method 1: 1.2 M images * 74 epochs * 16 GFLOPS * 2 add-multiply * 3 backward pass = 8524 PF = 0.098 pfs-days
Method 2: 4 Titan Black GPU’s * 15 days * 5.1 TFLOPS/GPU * 0.33 utilization = 10,000 PF = 0.12 pfs-days
Method 1: 1 timestep = (1280 hidden units)^2 * (7 RNN layers * 4 matrices for bidirectional + 2 DNN layers) * (2 for doubling parameters from 36M to 72M) = 98 MFLOPs 20 epochs * 12,000 hours * 3600 seconds/hour * 50 samples/sec * 98 MFLOPs * 3 add-multiply * 2 backprop = 26,000 PF = 0.30 pfs-days
Method 2: 16 TitanX GPU’s * 5 days * 6 TFLOPS/GPU * 0.50 utilization = 21,000 PF = 0.25 pfs-days
Method 2: 60 K80 GPU’s * 30 days * 8.5 TFLOPS/GPU * 0.33 utilization = 4.5e5 PF = 5.0 pfs-days
Method 1: 50 epochs * 50,000 images * 10.0 GFLOPSs * 12800 networks * 2 add-multiply * 3 backward pass = 1.9e6 PF = 22 pfs-days
Method 2: 800 K40’s * 28 days * 4.2 TFLOPS/GPU * 0.33 utilization = 2.8e6 PF = 31 pfs-days Details given in a [later paper](https://arxiv.org/pdf/1707.07012.pdf).
Method 2: sqrt(10 * 100) factor added because production model used 2-3 orders of magnitude more data, but only 1 epoch rather than 10. 96 K80 GPU’s * 9 days * 8.5 TFLOPS * 0.33 utilization * sqrt(10 * 100) = 6.9e6 PF = 79 pfs-days
Appendix: Recent novel results that used modest amounts of compute
Massive compute is certainly not a requirement to produce important results. Many recent noteworthy results have used only modest amounts of compute. Here are some examples of results using modest compute that gave enough information to estimate their compute. We didn’t use multiple methods to estimate the compute for these models, and for upper bounds we made conservative estimates around any missing information, so they have more overall uncertainty. They aren’t material to our quantitative analysis, but we still think they are interesting and worth sharing:
Attention is all you need: 0.089 pfs-days (6/2017)
Adam Optimizer: less than 0.0007 pfs-days (12/2014)
Learning to Align and Translate: 0.018 pfs-days (09/2014)
GANs: less than 0.006 pfs-days (6/2014)
Word2Vec: less than 0.00045 pfs-days (10/2013)
Variational Auto Encoders: less than 0.0000055 pfs-days (12/2013)
The authors thank Katja Grace, Geoffrey Irving, Jack Clark, Thomas Anthony, and Michael Page for assistance with this post.