There’s a global competition to build the biggest, most powerful computers on the planet, and Meta (AKA Facebook) is about to jump into the melee with the “AI Research SuperCluster,” or RSC. Once fully operational, it may well sit in the top 10 fastest supercomputers in the world, which it will use for the massive number crunching needed for language and computer vision modeling.
Large AI models, of which OpenAI’s GPT-3 is probably the best known, don’t get put together on laptops and desktops; they’re the final product of weeks and months of sustained calculations by high-performance computing systems that dwarf even the most cutting-edge gaming rig. And the faster you can complete the training process for a model, the faster you can test it and produce a new and better one. When training times are measured in months, that really matters.
RSC is up and running and the company’s researchers are already putting it to work… with user-generated data, it must be said, though Meta was careful to say that it is encrypted until training time and the whole facility is isolated from the wider internet.
The team that put RSC together is rightly proud at having pulled this off almost entirely remotely — supercomputers are surprisingly physical constructions, with base considerations like heat, cabling and interconnect affecting performance and design. Exabytes of storage sound big enough digitally, but they actually need to exist somewhere too, on site and accessible at a microsecond’s notice. (Pure Storage is also proud of the setup they put together for this.)
RSC is currently 760 Nvidia DGX A100 systems with a total 6,080 GPUs, which Meta claims should put it approximately in competition with Perlmutter at Lawrence Berkeley National Lab. That’s the fifth most powerful supercomputer in operation right now, according to longtime ranking site Top 500. (No. 1 is Fugaku in Japan by a long shot, in case you’re wondering.)
That could change as the company continues building out the system. Ultimately they plan for it to be about three times more powerful, which would in theory put it in the running for third place.
There’s arguably a caveat in there. Systems like second-place Summit at Lawrence Livermore National Lab are employed for research purposes, where precision is at a premium. If you’re simulating the molecules in a region of the Earth’s atmosphere at unprecedented detail levels, you need to take every calculation out to a whole lot of decimal points. And that means those calculations are more computationally expensive.
Meta explained that AI applications don’t require a similar degree of precision, since the results don’t hinge on that thousandth of a percent — inference operations end up producing things like “90% certainty this is a cat,” and if that number were 89% or 91% wouldn’t make a big difference. The difficulty is more about achieving 90% certainty for a million objects or phrases rather than a hundred.
It’s an oversimplification, but the result is that RSC, running TensorFloat-32 math mode, can get more FLOP/s (floating point operations per second) per core than other, more precision-oriented systems. In this case it’s up to 1,895,000 teraFLOP/s, or 1.9 exaFLOP/s, more than 4x Fugaku’s. Does that matter? And if so, to whom? If anyone, it might matter to the Top 500 folks, so I’ve asked if they have any input on it. But it doesn’t change the fact that RSC will be among the fastest computers in the world, perhaps the fastest to be operated by a private company for its own purposes.