So…You Want to Build a Data Center

Additional content provided by Tucker Beale, Sr. Analyst, Research.

As the capital expenditure (capex) race for compute continues, we thought that it would be worthwhile to briefly outline the current state of play facing the well-publicized data center buildout. To understand why so much capex is needed to support artificial intelligence (AI), we must first understand how data centers are built and operated.

Let’s start with the basics. What are data centers and why has the AI race sparked such demand for them? A data center is essentially a warehouse-sized computer. Just like your home computer, data centers need chips to carry out the computations that power our lives. These chips require sophisticated software, data storage, fast connections, reliable power, and no small amount of cooling to operate properly. The difference between our devices and these data centers is the scale and, in the case of the data centers powering AI, chip specialization.

The Almighty Chips (Semiconductors)

Most of the chips locally powering your devices are called central processing units, or CPUs. CPUs handle a wide range of computations and can be thought of as generalists powering most of the work that our computers do. While CPUs are great for carrying out a wide range of computations, occasionally the need arises for your computer to do a lot of very simple and very similar computations as quickly as possible. The traditional use case for these types of computations was graphics visualization for programs such as video games. The equation to change a pixel on a screen from one color to another may be very simple, but there are a lot of pixels that need to be updated and at a high frequency. This is where graphics processing units or GPUs come in.

Think of CPUs as painting a picture with a paintbrush, whereas GPUs create the same image using 1,000 paintball guns all firing at the same time. As luck would have it, parallel processing of simple computations at lightning speed was the unlock needed to power AI. Tensor processing units, or TPUs, are custom application-specific integrated circuits (ASICs) that are more efficient for some AI applications than GPUs but need to be custom-tailored to a more limited set of use cases. Efficiency gains are incredibly important when delivering compute at scale, but GPUs remain the primary driver of AI.

This is in part due to the proliferation of Nvidia’s CUDA parallel computing platform and software used in the development of AI models. The tight integration of the software development layer (CUDA) and hardware (GPUs) has created an AI platform with obvious switching costs as well as network economies: as the community of AI developers all “speaking the same language” grows larger, more data center customers prefer the platform with the largest, most productive developer ecosystem. Tradeoffs between efficiency and flexibility have led data center operators to consider a mix of general-purpose GPUs and task-specific TPUs in the compute stack. When building data centers, deciding which chips to fill them with matters as continuous semiconductor innovation can drive obsolescence over shorter-than-expected timelines.