Products You May Like
Frontier, at the Oak Ridge National Laboratory.
Credit: ORNL
What do fusion research, weather forecasting, epidemiology, and population dynamics have in common? They’re all fields of study loaded with problems so big that they’re delegated to supercomputers.
A supercomputer is a computer of such implausible size and power that it’s in a whole different league. During the Cold War era, supercomputing architecture made unprecedented leaps, driven by a surge in demand for greater computing power across science, industry, and the military. But since computing tech is continually advancing, the top tier is a moving target. For example, today’s iPhones regularly clock in the teraflop range, to which even the supercomputers of the late 90s could only aspire.
“Computer” originally referred to a human, i.e., “One who computes.” The term originates in astronomy, and it wasn’t unusual for astronomers and, later, scientists more generally to have one or more computers on staff. By the 1800s, human computers were employed in processing astronomical data and creating navigational tables for ships at sea. Women made increasing contributions to the field well into the 20th century, as documented in the 2016 film Hidden Figures about the female mathematicians who worked at NASA during the early days of the Space Race.
The Road So Far
The term “supercomputer” first shows up in the historical record in 1929 as two words: “Super computer.” While the ‘computer’ in question is absolutely a machine—a custom-made tabulator IBM had built for Columbia University—its performance is described as “equal to 100 mathematicians,” illustrating the profoundly human terms for conceptualizing machine computational performance that were in play at the time. (I shall refrain, despite the temptation, from describing future CPU ratings in terms of Mathematician Instructions per Second).
Credit: New York World
Early computers were typically custom-built installations tailored for specialized high-end scientific, industrial, and governmental applications. These one-of-a-kind machines were designed to address specific computational needs and were often situated at major research institutions or government facilities.
The first programmable, general-purpose digital computer, ENIAC (Electronic Numerical Integrator and Computer), was designed to calculate artillery firing tables. ENIAC was commissioned by the US Army in 1942 and first put to use by the Army’s Ballistic Research Lab in December 1945. It was transferred to the Aberdeen Proving Ground in ’47—no easy task for a thing the size of a three-bedroom house—where it remained in continuous use until 1955.
ENIAC at Aberdeen Proving Ground, Maryland. Glen Beck (background) and Betty Snyder (foreground) program the ENIAC in building 328 at the Ballistic Research Laboratory.
Credit: US Army (Public Domain)
J. Presper Eckert and John Mauchly, the same duo who invented ENIAC, went on to design another major figure in computer history: UNIVAC. UNIVAC was a line of stored-program computers first used by the United States Census Bureau. The most famous UNIVAC machine was the UNIVAC I, whose understated heritage as a tool for assessing relationships between groups threw it into the political spotlight in 1951, when it (correctly) predicted Dwight Eisenhower’s landslide victory in the 1952 presidential election. The US Army requested a UNIVAC from Congress before the year was out.
Front panel of the UNIVAC LARC supercomputer.
Credit: U.S. Army (Public Domain)
UNIVACs were massive mainframe computers, and the LARC (“Livermore Advanced Research Computer”) was specifically designed for the Lawrence Livermore National Laboratory to handle complex scientific computations, including hydrodynamic simulations for nuclear weapons design. LARC CPUs were capable of performing an addition operation in about 4 microseconds.
Computers like ENIAC and UNIVAC used vacuum tubes, but each vacuum tube was another point of failure in systems that were becoming more and more complicated. The tubes were prone to exploding during times of great thermal stress, especially when warming up and shutting down the machine. In the end, computer operators had to pre-emptively replace vacuum tubes on a maintenance schedule in order to avoid unplanned downtime. Meanwhile, however, wartime innovation had produced altogether new technologies, such as solid-state electronics, which offered an upgrade to vacuum tubes on every level.
Transistors were theorized as early as 1929, but the first working device was built at Bell Labs in 1947. That same year, a patent was granted for a new kind of non-volatile data storage, a specially laced fabric of ferromagnetic donuts strung on conductive wires, called magnetic core memory.
Magnetic Core Memory and the ‘LOL’ Method
Magnetic core memory (sometimes called “core memory” or just “core”) bounced around academia for a few years before it found its breakout role as the main memory bank for the originally classified Project Whirlwind: a vacuum-tube-powered computer designed jointly by MIT and the US Navy to drive a flight simulator for training bomber crews. Core was already in use by the time it was patented, and after years of legal wrangling, IBM paid MIT $13 million for the rights to MIT engineer Jay Forrester’s core memory patent in the largest patent settlement to date.
Close-up of a core memory module.
Credit: Konstantin Lanzet/Wikimedia Commons
Threading a wire through one of the donuts created a “one,” and bypassing a given donut created a “zero” at the corresponding place in the code. Running a current through a donut reversed its magnetic polarity, which was otherwise stable, even when the machine was unpowered.
All else being equal, it takes less energy to bit-flip a smaller donut. As you might expect, that created an incentive to make things smaller. As components shrank, cores shrank along with them, from 0.1″ in the 1950s to 0.013″ in 1966 (or, as we’d say in 2024, 330,200 nanometers). Engineers joked that just like donut holes, new magnetic cores were made from the holes punched out of the previous generation of cores.
Diagram of a 4×4 plane of magnetic core memory in an X/Y line coincident-current setup. X and Y are drive lines, S is sense, Z is inhibit. Arrows indicate the direction of current for writing.
Credit: Tetromino/Wikimedia Commons
Core memory was cheap enough that it ended up serving as main memory for computers well into the 1960s, and for some of the same reasons, it also figured into the Space Race. Guidance software for the Apollo 11 lunar landing was physically woven into a version of magnetic core memory called core rope memory.
Weaving the memory was delicate work, entrusted to the steady hands of women from the Navajo community of Shiprock, New Mexico. Women of color skilled in traditional weavercraft were such an important part of the process that it was sometimes called the “LOL method,” after lead Apollo flight software designer Margaret Hamilton’s affectionate shorthand for the “Little Old Ladies” who built it.
Margaret Hamilton, standing beside the code she wrote for the Apollo missions.
Credit: NASA
Magnetic core memory does have a few drawbacks. Perhaps the worst is that reading from core memory was destructive: it erased data through reading it. Core memory was a mainstay of high-end computing installations through the 1960s, including the IBM 704 mainframe, but like vacuum tubes, the technology doesn’t scale well in comparison with solid-state tech like DRAM and transistors.
When Bell Labs announced it had a working transistor, companies like General Electric, Honeywell, and IBM quickly moved to include transistors in their products. Smaller, cheaper, and more reliable, transistors began to take over from vacuum tubes, from military radar to industrial control tech and research mainframes.
Big Blue
IBM was prototyping point-contact transistors within three years of their announcement, and introduced its first fully transistorized product—the IBM 608 calculator—in 1955. The company bought its transistors for the first few years before bringing transistor production in-house in 1956 and has been involved in leading-edge semiconductor research ever since.
IBM was in a position to take a leading role in computer manufacturing and sales because of its prior corporate identity as the Computing-Tabulating-Recording Company (CTR), the collective enterprise of several different companies that manufactured a wide array of devices for measurement, calibration, and calculating. (In 1924, CEO Thomas Watson Sr. decided he’d never liked the “clumsy” hyphenated name and changed it to International Business Machines.)
IBM has had an outsized impact on computing history in the US, but it has also received criticism for being a patent troll, due in part to its record of buying out smaller businesses to obtain patents and other IP. The company’s familiar 80-column punch card format came to dominate computing well into the mid-1900s. (It also caught the attention of the Nazis, who made “extensive use” of the punch cards and tabulating machines produced by IBM’s Hollerith facility.)
The nickname “Big Blue” referred to both IBM’s ponderous size and its informal dress code of a dark blue suit, a white shirt, and a “sincere” tie. Watson introduced “generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker.”
Thanks to its business heritage, IBM’s portfolio included scales, meat and cheese slicers, industrial time recorders, tabulators, and the punched cards used to store digital data for processing by computers of the day. During the late 40s, IBM began to branch out from electromechanical tabulating designs to computers with stored programs. In 1952, the company debuted its first commercially available scientific computers: the vacuum-tube-powered IBM 700 series, starting with the IBM 701.
The next year, IBM engineer John Backus proposed developing a more practical alternative to the assembly language used to program the IBM 704 mainframe. Thus was born the Formula Translating System, or FORTRAN for short. The IBM 709 was a vacuum-tube computer introduced in 1958 with a new operating system, SOS, and we are delighted to inform you that SOS supported a new type of compressed binary program file, called SQUOZE. Also written for the 709: a version of assembly language for solving linear equations, called SNOBOL.
Front panel of the IBM 709
Credit: Arnold Reinhard, ArnoldReinhold, CC BY-SA 3.0 via Wikimedia Commons
IBM introduced its first transistorized mainframe, the 7090 (originally a transistorized successor to the 709 called the 709-T), in 1959, and its first attempt at a supercomputer, the 7030, in 1961.
IBM 7030 (‘Stretch’)
The 7030 (lovingly nicknamed “St. Retch” by the Illiac II team, in a fit of academic snark) started out as IBM’s entry in the LARC contract competition. Back in 1956, Lawrence Livermore had approached both IBM and Univac, requesting bids for the UNIVAC LARC contract.
But shortly before IBM delivered their proposal, a team of project engineers voiced urgent concerns that the 7030 design as a whole was a “mistake.” Among other issues, the 7030’s timing had an interrupt problem, and its transistor technology was about to be made obsolete by diffusion transistors, which had superior performance.
Credit: Rama, CC BY-SA 3.0 FR, via Wikimedia Commons
IBM representatives returned to Livermore and withdrew from the contract, and instead proposed another, “dramatically better” system: “We are not going to build that machine for you; we want to build something better! We do not know precisely what it will take but we think it will be another million dollars and another year, and we do not know how fast it will run but we would like to shoot for ten million instructions per second.”
UNIVAC won the contract.
Faced with the prospect that the Los Alamos National Lab might also order a LARC from UNIVAC, IBM submitted a proposal to Los Alamos for a new and improved version of the design Livermore had rejected. Again, the company forged ahead with the 7030, but with the Stretch, IBM’s reach ultimately exceeded its grasp. Once again, the Stretch missed deadlines and performance targets, and in 1961, then-CEO Thomas J. Watson cut IBM’s asking price for the Stretch by half, withdrawing the machine from further sales.
Still, in 1962, the 7030 was the fastest computer in the world—and it held that title until 1964, when the CDC 6600 came online.
First Supercomputer: CDC 6600
The CDC 6600, introduced by the Control Data Corporation, usually gets the nod as the first supercomputer. Chiefly designed by Seymour Cray, the CDC 6600 boasted up to triple the throughput of the now-infamous 7030, and with more than 80 units sold, it was a commercial success, to boot.
Where other supercomputers featured a front panel loaded with buttons and switches, Seymour Cray preferred a minimalist front panel for the CDC 6600 (above).
Credit: Public Domain
Few individuals have had so great an influence on supercomputing as Cray. His father, Seymour Cray Sr., was a civil engineer who encouraged his son’s boyhood science experiments. Cray Jr. followed in his father’s footsteps but opted for electrical engineering; after serving as a radio operator in the Pacific theater during World War II, he took a job at ERA (Engineering Research Associates), the science division of the Remington Rand company that built UNIVAC.
Cray worked in ERA’s pure sciences division, as opposed to the business division that was UNIVAC, and you can see a kind of absolutism reflected in the computers he built—and where he chose to build them. Unsatisfied with ERA, he and coworker William Norris spun off into the Control Data Corporation, or CDC, where Cray designed the CDC 1604 and contributed to the CDC 3000 series, before his work on the iconic 6600.
“Silicon” as a shorthand for a processor has roots in the CDC 6600, where CDC switched away from germanium transistors to silicon to take advantage of silicon’s vastly higher switching speed. At its peak, it could execute more than two million instructions per second (2 MIPS). The first CDC 6600s were delivered in 1965 to the Lawrence Livermore and Los Alamos National Labs.
At launch, the CDC 6600 had a 60-bit processor that ran at 10MHz and up to 982KB of memory. Outside of innovations in material science, its blazing speed largely came from its use of pipelining: feeding data from one operation directly into the next like a relay race, instead of staging it in a memory bank from which it must then be retrieved.
It is an understatement of British proportion to say that it rustled some jimmies at IBM when the 6600 stole the supercomputing crown. Watson wrote a frustrated memo to his employees on Aug. 28, 1963:
Last week, Control Data […] announced the 6600 system. I understand that in the laboratory developing the system there are only 34 people including the janitor. Of these, 14 are engineers and 4 are programmers […] Contrasting this modest effort with our vast development activities, I fail to understand why we have lost our industry leadership position by letting someone else offer the world’s most powerful computer.
Cray’s brief reply: “It seems like Mr. Watson has answered his own question.”
Cray-1
After the CDC 6600, Cray went on to design the 7600 (a resounding success) and 8600 (an expensive, unreliable, overheating failure). But he again grew dissatisfied—this time, with what he perceived as interference from CDC management, its allocation of funds during the development of the CDC STAR-100. and the company’s mundane focus on “business and commercial” data processing.
In 1972, he more-or-less amicably split from CDC and spun off again into his own startup/fiefdom, Cray Research, with funding including $250K from Norris. Starting with the 6600’s smash success, Cray designs dominated the supercomputer market so thoroughly until the 80s that the period is sometimes known as the “Cray era.” These machines filled the ranks of the world’s fastest computers, significantly advancing fields such as weather forecasting and molecular dynamics. Meanwhile, IBM struggled to build a profitable supercomputer, which meant that Cray Research and the Control Data Corporation were each other’s chief competitors.
As a designer, Seymour Cray maintained a singular focus on the “iron triangle” of supercomputing: latency, power, and heat. He relentlessly miniaturized components and shortened wires to decrease speed-of-light delays, using his ever-present quad pad to plan out integrated circuits much smaller than those built by CDC’s competitors. However, with silicon’s improved clock speed came new challenges. The shrinking circuits were increasingly vulnerable to thermal buildup—a problem that has never quite gone away.
The Cray-1, introduced in 1976, relied on vector processing, significantly enhancing computational speed and efficiency compared with the CDC 6600’s scalar processing design. While it wasn’t the first computer that implemented vector processing, it’s generally viewed as the first computer to do so successfully. It also used a new, high-speed type of circuit called emitter-coupled logic.
At its peak, Cray-1 hit 160 megaflops (mega = million; flops = FLoating point (i.e. decimal) Operations Per Second). Modern CPUs from Intel, AMD, and ARM all implement vector processing via specialized instruction extensions, and the Cray-1 can be seen as the great-great-grandfather for this type of feature.
Seymour Cray, peeking out from behind a Cray-1 supercomputer.
Credit: Michael Hicks, CC BY 2.0, via Wikimedia Commons
Like the CDC 6600, the Cray-1 relied on an unusual physical geometry. Cray himself referred to the system as “the world’s most expensive love seat” due to its 270-degree circular shape. This design both allowed for necessary cable and wire routing and disguised some of the power supplies and piping for the Freon cooling system.
A 1978 analysis of the Cray-1 notes that “With a full-size memory, this compact package weights over five tons and consumes about 150,000 watts of electricity,” which gives you some idea how far our collective idea of a “compact” computer installation has come. There’s some ambiguity on Cray-1 power consumption; 114kW is also frequently listed.
Cray-2
Its successor, the Cray-2 introduced in 1985, also became an important and influential supercomputer, but its legacy isn’t as straightforward as the Cray-1’s. The Cray-2 was the first supercomputer Seymour Cray built that successfully used multiple processors, and it captured the public imagination through its innovative Fluorinert cooling system. But it wasn’t an easy machine to build. The initial system design proved unworkable. Cray (the engineer) had to invent a new method of stacking circuit boards together, with each board connected to the one on top of it via pogo pins.
Pogo pins in the Cray-2 logic module.
Credit: Alan Killian (Public Domain)
This ultra-tight packing solved the integrated circuit density issue that had bedeviled the system. Still, there was no way to conventionally cool the chips if they were packed so closely together. This is where the Fluorinert comes into the picture.
The Cray-2, which consumed an easy 150-200 kW, ended up nicknamed “Bubbles” for the tanks of bubbling liquid Fluorinert required for its “waterfall” cooling system. (Cue plastic fish in the heat exchange tanks, cardboard Loch Ness Monsters taped to the side, etc.) Powerful though it was, the cooling system still had one critical weakness. Researchers at Livermore discovered in the 90s that a small but dangerous amount of the Fluorinert would, when exposed to the high-heat marathon of supercomputer number-crunching, degrade into perfluoroisobutylene (PFIB): a gas so toxic it’s scheduled and regulated by the Chemical Weapons Convention. Simple filters weren’t enough to neutralize the hazard; instead, inline catalytic scrubbers were installed to eliminate the gas entirely.
The Cray-2’s performance versus the Cray-1 depended heavily on the kind of code you were running. Vectorized code ran substantially faster compared with Cray-1, while the scalar performance improvement was much smaller. An archived comparison between the Cray-2 and SX-2, presented by Tor Bloch at the 1st International Conference on Computing in High-Energy and Nuclear Physics (CHEP) in 1985, explains the relatively small scalar performance improvement thusly:
This is because the basic speed-up of the clock period (4.1ns instead of 12.5ns) is obtained only partly (factor of 1.5) by a real speed up due to size and circuits. The rest is gained by having fewer gates per clock period. This penalizes scalar execution speed because more of each clock period ends up being used for the latching of intermediate results in the pipeline.
Translation: The clock speed improvement (from 80MHz to 243MHz) was partially offset by the Cray-2 performing less work per clock cycle than the Cray-1.
The Fluorinert-filled heat exchange fountain of a Cray-2.
Credit: Marcin Wichary (CC-BY 2.0 via Wikimedia Commons)
A NASA performance comparison of the Cray-2 versus the rival Cray X-MP (also built by Cray Research, but based more directly on the Cray-1 and designed by Steven Chen) found that the Cray-2 was typically not as fast as the Cray X-MP, but that the particulars of the comparison varied depending on the test. Tests that were specifically optimized for the Cray-2 showed much higher performance and it beat the Cray X-MP in these scenarios.
The NASA report also highlights the Cray-2’s massive memory pool. The Cray-2 shipped with 64 to 512 “megawords” of memory, where a word is 64 bits. Converted into modern parlance, that’s between 512MB and 4GB of memory. That much memory dwarfed other supercomputers of its day, as NASA notes. “It should be re-emphasized that the principal advantage of the Cray-2 is its very large memory, which allows jobs that previously could only be run using massive disk or solid state I/O to now run in main memory. This is a MAJOR advantage, and it should not be allowed to be overshadowed by the slightly lower performance of the Cray-2 on some codes.” (emphasis original)
Attack of the Killer Micros: The Parallel Era
Optimizations for latency and heat dissipation are, and always will be, mandatory, but they’re still subject to diminishing returns. If the “Cray era” of computing was about min-maxing a single core, the parallel era began when supercomputer builders started including multiple processing units in a supercomputer—and it hit its stride with multiple computing cores within each processor.
Early supercomputers were characterized by a relatively low number of cores and a single, shared memory space. Later builds saw a shift towards “multicomputers,” defined as a network of independent computers that communicate over a common interface and form a single system.
The other major characteristic of this in-between era is the so-called “Attack of the Killer Micros.” The title echoes both a talk given by Eugene Brooks at Lawrence Livermore Labs in 1990 and a 1991 New York Times article dealing with the same topic. The article forecasts the eventual death of specialized supercomputing-specific architectures at the hands of the so-called “killer microprocessors” of the consumer market. While such chips were still well behind the supercomputers of their day, they were improving far more quickly than the specialized architectures they would eventually all but replace.
Intel’s own supercomputing efforts in the 1990s exemplify this transition. The company’s first productized supercomputers, such as the Intel Paragon, were based on the Intel i860 microprocessor. The i860 was the world’s first million-transistor processor, and it was based on a RISC VLIW (Very Long Instruction Word) architecture. It combined 2,048 processors initially, with the project later expanded to 4,096 chips.
An Intel Paragon supercomputer in its cabinet, on display at the Computer History Museum
Credit: Carlo Nardone/Computer History Museum
The thing about multicomputers, though, is that scaling up from a handful of processors sharing a common memory space to arrays of system cabinets introduced its own cooling changes.
Cluster Computing and the Rise of Beowulf
According to thermodynamics, cooling is automatic, but effective heat management isn’t free. This was already a problem during the Cray era; supercomputers of the day already consumed several kilowatts, and they required liquid cooling systems that demanded frequent babysitting. The cooling problem only escalated as companies put more and more processors in a single machine—and still the need for computing power grew.
Researchers are usually strapped for resources. At the same time, streamline though we may, it seems like the workload only gets bigger while the budget lines stay (at best) the same. Seeing the writing on the proverbial wall, scientists tried a very different approach. Supercomputers already distribute a problem across many processors. So, what if you could distribute the cooling, too? Instead of the traditional, monolithic computer mainframe, researchers networked “clusters” of hundreds or thousands of inexpensive, off-the-shelf devices called “nodes.”
Structure of a Beowulf cluster
Credit: Public Domain
It didn’t take long for high-end computing enthusiasts with modest budgets to adapt the concept of a supercomputing node to low-cost networks of commodity hardware. These sprawling systems came to be known as Beowulf clusters. In the original Beowulf Cluster “how-to,” Jacek Radajewski and Douglas Eadline explain:
In most cases, client nodes in a Beowulf system are dumb, the dumber the better. Nodes are configured and controlled by the server node, and do only what they are told to do. In a disk-less client configuration, a client node doesn’t even know its IP address or name until the server tells it.
Beowulf clusters use commodity hardware and software, so they’re cheap, simple, and trivially reproducible. That made a Beowulf cluster the chosen structure for a 2003 University of Illinois Urbana-Champaign project that made national news by assembling several dozen PlayStation 2 units into a teraflop-scale supercomputer. A May 2003 New York Times article by John Markoff remarked, “Perhaps the most striking aspect of the project, which uses the open source Linux operating system, is that the only hardware engineering involved was placing 70 of the individual game machines in a rack and plugging them together with a high-speed Hewlett-Packard network switch.”
“It took a lot of time because you have to cut all of these things out of the plastic packaging,” senior research scientist Craig Steffen told the paper.
Even as systems like Paragon were proving the potential of interconnected compute nodes at the top of the stack, the invention of Beowulf clusters made high-performance computing more affordable than ever. It foreshadowed the deployment of top-performing x86 machines a few years later.
ASCI Red: 1 TFLOPS
If Beowulf clusters showed what commodity x86 hardware could do for research organizations, the Accelerated Strategic Computing Initiative, or ASCI Red supercomputer, which came online in early 1997, demonstrated that the “attack of the killer micros” era had truly arrived. ASCI Red was based on Intel’s Paragon design but utilized x86 Pentium Pro and later Pentium II OverDrive processors. It was the first supercomputer capable of reaching more than 1 TFLOPS in the Linpack benchmark and was designed to provide much higher disk I/O than previous supercomputers had typically allowed.
All four rows of the fully operational ASCI Red supercomputer (end view) inside Sandia National Laboratories.
Credit: Sandia National Laboratories
This is not to imply that supercomputing converted to x86 overnight. There are still a handful of POWER-based supercomputers in the world (six in the TOP500), and x86 isn’t the only commodity ISA represented—Fujitsu’s A64FX architecture, which powers the Fukagu supercomputer, is derived from the ARM ISA. Some POWER-based supercomputers, like the IBM Blue Gene series, would go on to break power and efficiency records as each generation of improvements debuted.
But the arrival of ASCI Red marks the moment when x86 demonstrated it could win the entire performance stack—from the commodity PCs where it initially debuted to the dizzying heights of computing Intel could have only dreamed about when Seymour Cray unveiled the Cray-1 in 1975.
From the 1990s through to the early 2010s, supercomputing focused on questions of ramping up system density at every level. The number of cores per node grew, as did the number of nodes per cabinet and the number of cabinets per total supercomputer installation. Blue Gene/L began with a system of 131,072 CPUs; the eventual Blue Gene/Q-based system, named Sequoia, would reach up to 96 racks and 1.6 million processor cores.
High-Performance Computing in the 21st Century
The concept of a historical century doesn’t always map neatly to a period of exactly 100 years. Historians have argued, for example, that the 20th century truly begins with the outbreak of World War I in 1914, with the period 1900-1913 serving more as an extended good-bye to the worldviews, values, and alignments of the Great Powers that characterized Victorian-era England. One could similarly argue that supercomputing in the 21st century has been defined more by trends that kicked off around 2009-2010 as opposed to anything happening in 2000.
As always, the boundaries between eras can be a bit fuzzy. What defines the modern supercomputing market is the addition of GPUs to what had previously been a very CPU-centric set of tasks.
The idea of using a GPU-like architecture for super computing isn’t new; the IBM Cell microprocessor used in the Sony PlayStation 3 and the closely related IBM PowerXCell 8i have design characteristics that echo the programmable GPU architectures that would arrive a few years down the road. The 2009 iteration of China’s Tianhe-1 combined a mixture of Xeon E5540 and Xeon E5450 processors alongside 2,560 ATI Radeon HD 4870X2 dual-GPU graphics processors. Tianhe-1A, which debuted roughly a year later, swapped the 4870X2 cards for 7,168 Nvidia Tesla M2050 graphics processors.
I don’t want to give the impression that we’ve ever stopped stacking processor cores up like cordwood. For example, IBM’s Watson (2011) employed 90 of its POWER7 processors. Supercomputers like the modern Frontier can deploy CPUs with high core count densities (64-core Epyc 7713’s, in Frontier’s case) to hit very high core counts (606,208) in a relatively low number of sockets (just 9,472).
There are still CPU-only supercomputers on the TOP500. But the adoption of GPUs has both boosted supercomputer performance and allowed for an overall increase in system efficiency when considered across the entire TOP500. That’s according to an analysis of efficiency published on Arxiv earlier this year. In the graph below, heterogeneous systems are supercomputers that include GPUs while homogeneous systems are CPU-only.
Figure 5: Maximum Energy efficiency growth Green500 systems by date distinguished by architecture type (homogeneous vs heterogeneous)
Credit: Benhari, Denneulin, Desprez et al
The authors write: “This figure (above) definitely shows that homogeneous systems are less efficient than heterogeneous ones by an order of magnitude and that the gap between the two categories is increasing.”
The trend hasn’t been altogether positive, as the paper also notes that an observed decrease in supercomputer efficiency over the past decade may partly reflect the greater difficulty in extracting maximum performance from complex architectures that blend both CPUs and GPUs together. Nevertheless, the net impact of GPUs on supercomputing processing and efficiency has been large enough to call their adoption an era in its own right.
As for the future of supercomputing, that’s difficult to predict. The history of supercomputing is the history of finding new ways to reduce latencies and power consumption, to pack more processing cores and compute elements more tightly together, and to keep the system cool enough to operate. Future supercomputers will undoubtedly innovate in these areas. But it’s also fair to ask what the absolute limit is regarding large-scale computer installations. Any effort to reach zettascale computing could require 100MW or more of power.
The current leading supercomputer, the AMD-powered El Capitan, offers 1.74 exaFLOPS and consumes 35MW of electricity at peak operation. Zettascale computing in 100MW assumes tremendous improvements in energy efficiency relative to any system available today. Supercomputer industry leaders have expressed concern about the industry’s ability to continue scaling up to even 10 exaFLOPS unless new methods of boosting growth are found—a problem that is readily visible if you look at recent performance scaling in the TOP500. Still, the sky’s the limit. When Livermore’s forthcoming El Capitan supercomputer comes online, it’s projected to hit 2 exaFLOPS.
So what’s your favorite supercomputer? Are you a Cray loyalist, or a cluster evangelist? Or have we just sold our souls to Big Blue? (We’re certainly spoiled for options.) We hope you’ve enjoyed this look at the history of supercomputing and how it’s changed from the days of “100 mathematicians” to the modern, heterogeneous era. Here’s to zettascale and beyond.