How many calculations do Graphics Cards Perform?
00:00:00Ultra-realistic video games demand graphics cards that perform about 36 trillion calculations per second, far surpassing the simpler computational needs of older titles. Visualize this power as billions of individuals each performing a long multiplication every second—a collective effort equivalent to 4,400 Earths working in unison. The sophisticated design and architecture of modern GPUs enable them to process vast amounts of data efficiently. This extraordinary capability not only brings lifelike gaming graphics to life but also empowers technologies like Bitcoin mining, neural networks, and artificial intelligence.
The Difference between GPUs and CPUs?
00:02:15A graphics card’s GPU features over 10,000 cores to handle massive volumes of simple arithmetic, while a CPU on the motherboard has only 24 cores that execute complex tasks at high speeds. The GPU resembles a colossal cargo ship methodically processing large amounts of data, in contrast to the CPU’s nimble, versatile performance akin to a speedy jumbo jet. This distinction highlights that GPUs excel when performing numerous simple calculations concurrently, whereas CPUs are designed for flexible, rapid processing across a variety of applications.
GPU GA102 Architecture
00:04:56A state-of-the-art GPU is built around a GA102 chip containing 28.3 billion transistors mounted on a printed circuit board, primarily housing advanced processing cores. Its design divides the chip into 7 Graphics Processing Clusters, each with 12 streaming multiprocessors that further break down into 4 warps featuring 32 CUDA cores and a dedicated Tensor core, alongside one ray tracing core per multiprocessor. In total, the GPU integrates 10,752 CUDA cores for basic arithmetic operations, 336 Tensor cores for matrix computations and AI, and 84 ray tracing cores for realistic light simulations, efficiently balancing performance across gaming, computational, and graphic rendering tasks.
GPU GA102 Manufacturing
00:06:59The GA102 chip design is shared by various graphics cards, such as the 3080, 3090, 3080 Ti, and 3090 Ti, despite their differences in pricing and release timing. Manufacturing imperfections like patterning errors or dust create localized defects that are isolated by deactivating only the affected streaming multiprocessor. Chips are then sorted based on the number of functional CUDA cores, with higher-end models featuring flawless arrangements and additional differences determined by clock speed and memory configuration.
CUDA Core Design
00:08:48A single CUDA core is a compact calculator built from about 410,000 transistors, primarily executing fused multiply-add operations that are vital for graphics processing. It supports 32-bit floating-point and integer arithmetic, while also handling functions such as bit-shifting, masking, instruction queuing, and output accumulation. Each core completes one multiply and add operation per clock cycle, enabling thousands of cores to deliver trillions of calculations per second. Specialized function units separately manage complex operations like division, square roots, and trigonometric functions, with additional chip components coordinating memory control and data flow.
Graphics Cards Components
00:11:09Graphics cards integrate essential ports for displays, power, and motherboard connectivity into a compact design. A voltage regulator module on the PCB efficiently converts a 12-volt supply into a precise 1.1 volts, delivering hundreds of watts to the GPU. The intense power flow generates significant heat, which is mitigated by a robust heatsink featuring four heat pipes that efficiently transfer thermal energy from the GPU and memory chips to radiator fins.
Graphics Memory GDDR6X GDDR7
00:12:04Modern graphics systems rely on rapid memory transfers that move 3D game environments from storage into 24 gigabytes of specialized graphics memory, enabling the GPU to perform trillions of calculations per second. Multiple memory chips work simultaneously like cranes loading cargo, delivering a combined bandwidth of over a terabyte per second that far exceeds typical CPU memory. Data is continuously shuttled between these chips and the GPU’s small cache to render high-definition scenes without delay. Advances in encoding, transitioning from PAM-4 to a more efficient PAM-3 method in next-generation memory, further boost data transfer rates by utilizing multiple voltage levels.
All about Micron
00:15:11Micron pushes the limits of data transfer and chip design with advanced high bandwidth memory. Its HBM technology stacks DRAM chips connected by through-silicon vias to form compact cubes of AI memory. The latest HBM3E delivers 24 to 36 gigabytes per cube, surrounding AI chips with a total of 192 gigabytes of high-speed memory while using 30% less power than competitive systems.
Single Instruction Multiple Data Architecture
00:16:51GPUs leverage a single instruction multiple data approach to efficiently handle problems that are easily divided into parallel tasks, such as video game rendering and bitcoin mining. This method exploits embarrassingly parallel challenges where tasks split with minimal effort. The same instructions operate concurrently across thousands or even millions of data points, significantly boosting performance.
Why GPUs run Video Game Graphics, Object Transformations
00:17:49GPUs build 3D worlds by converting vertices from local model spaces into a shared world space using parallel SIMD operations. A cowboy hat, composed of 14,000 vertices, is repositioned by adding world space coordinates to each vertex concurrently. This independent calculation scales across millions of vertices and thousands of objects, enabling the camera to accurately determine object positions. The process demonstrates how GPUs harness massive parallelism to efficiently perform tens of millions of additions in real-time graphics rendering.
Thread Architecture
00:20:53CUDA cores execute individual instruction threads organized into warps, thread blocks, and grids, efficiently scheduled by the Gigathread Engine. Traditional SIMD architecture mandates that all 32 threads in a warp operate in strict lockstep, executing the same instructions concurrently. Modern SIMT architecture gives each thread its own program counter while sharing a 128KB L1 cache, enhancing flexibility and managing conditional branches with ease. The term 'warp' originates from the Jacquard Loom, reflecting a historical connection between weaving methods and current GPU design.
Help Branch Education Out!
00:23:31The introduction sets the stage for exploring advanced topics such as bitcoin mining, tensor cores, and neural networks. Branch Education is dedicated to creating free, visually engaging educational videos that dive deeply into science, engineering, and technology. Their mission is to compile multiple detailed videos into a comprehensive engineering curriculum for high school and college students. They encourage active community support through likes, comments, shares, and subscriptions, with additional in-depth content available on Patreon.
Bitcoin Mining
00:24:29Bitcoin mining involves running the SHA-256 algorithm on transaction data, timestamps, and a changing nonce to generate a random 256-bit output, much like creating lottery tickets. Each nonce alteration produces a new ticket, and mining succeeds when a ticket with 80 leading zeros is produced, awarding a reward and resetting the process. Initially, GPUs were used to handle millions of these iterations simultaneously, but modern ASICs now perform trillions of hashes per second, accelerating the competition in blockchain security.
Tensor Cores
00:26:50Tensor cores execute a concurrent operation where two matrices are multiplied and a third added, with every output value emerging as the sum of a row-column dot product plus an associated matrix element. This precise arithmetic operation is fundamental for the vast computational demands seen in neural networks and generative AI. By ensuring that all three matrices are available simultaneously, the hardware performs trillions to quadrillions of calculations concurrently, optimizing the process for larger matrix operations.
Outro
00:27:58Ray tracing cores exemplify a breakthrough in graphics card technology, showcasing advanced design and performance. Community backing from Patreon and YouTube Membership sponsors is highlighted as essential in powering creative output. Branch Education transforms these technical insights into immersive 3D animations that demystify modern technology. The narrative invites viewers to explore additional content and join in supporting this dynamic educational mission.