Future of Special Purpose Hardware Part 1: General Purpose07 Jul 2021 | 10 minutes to read
In the 90s, to be able to play the newest games with the best sound you needed a sound card, like a Adlib or Soundblaster. This was done because motherboards of the time period didn’t include hardware for sound. Back them, hardware was much weaker than today and the amount of processing power required to generate good sound was beyond the bounds of the hardware. By moving the sound generation to a separate add-in card, users could choose which card they wanted from a variety of different price and performance points. With the card, the processing required to generate the sound could be offloaded onto the card, only requiring the CPU to coordinate things.
Current performance outlook #
Thanks to Moore’s Law and other performance improvements, we have gotten to a point where we no longer need task-specific hardware for computers for most tasks. For tasks like gaming or video editing, it’s usually best to have a graphics card to offload the costly graphics processing onto hardware specifically designed for it. While a special-purpose graphics card processor might be less powerful than the host CPU, the careful joining of software and hardware enable it to outperform the general-purpose CPU on those graphics/parallel compute related tasks.
As we push performance and compute tasks further and further, it can be hard to develop and work on these tasks from a single home computer. To run large workloads, process large amounts of data, or train a large AI, you need more power. Cloud services have filled this niche, allowing a user with enough capital to farm out work onto hundreds or thousands of CPUs or special hardware at once.
This gives use two ways of allocating resources for a task. You can either run it locally, on your own general purpose hardware or in the cloud, where a larger amount of computing power is available for a price. This is similar to computing in the past, where work was done on a central mainframe and dumb terminals were used to interface with it.
Instead of spending days or months to train an machine learning model against a dataset, you can rent out many large GPU instances in the cloud and train through terabytes of data much quicker. Software design has taken advantage of parallel computing, where a calculation can be spread out across many “worker nodes”, allowing the user to arrive at a conclusion faster. For example, Aran Komatsuzaki trained the ML model GTP-J over a period of five weeks, using a total of 1.5e22 FLOPS on a Google Cloud TPU v3-256 (Google’s 3rd gen Tensor Processing Unit, 256 cores, 4TiB Memory).1 2
To get a good comparison of performance, I found some benchmarks for current processors to compare against.3 At the top of the site’s scoreboard was the Intel Core i9-11900K, which had a score of 851.2 GFLOPS. Using this number, I calculated how long it would take for the processor to complete the same training. This ignores many factors like memory, bandwidth and storage, so it’s not a great comparison but it gives us a better idea of the work required to train the model.
# Convert training FLOPS to GFLOPS 1.5e22 / 1e9 = 1.5e13 # (Training Flops)/(CPU FLOPS) = # seconds processor takes to train 1.5e13/851.2 = 1.76e10 seconds; ~ 29,137.20 weeks or 558.41 years
The fact that the model was able to be trained in only 5 weeks is testament to the performance gained by parallelizing the task and using task-specific hardware. This sort of task would have been difficult, if not impossible to perform on a local computer. Leveraging additional computing resources from cloud providers can be useful, but there is a large overhead cost you pay for that performance.
To bring the cloud-level performance back down to the local computer, we need to insert an additional compute layer between the local CPU and servers in the cloud. Just like we used sound card for generating sound and graphics cards to offload graphics intensive work from the processor, we might start seeing additional add in cards that assist the computer with certain types of computations. I figure these cards would fall into two categories: “general purpose”, cards whose job is to offload general purpose tasks from the primary processor, and “task-specific” cards, which are designed with a single, specific task in mind.
Expansion Card Foundations #
These expansion cards would be similar to graphics cards, but would allow the OS to farm out tasks to free up space on the primary CPU. General purpose cards would be integrated at the kernel level, and assist the kernel in most if not all computing tasks such as virtualization, not just graphics-type work. They would provide additional resources to the computer such as processing, memory, and storage. Task-specific cards would have hardware geared toward a specific task like neural networks, machine learning, or individual programs.
The best place for this general purpose card to go is the PCI slot. It provides a standard, high-performance interface that our GP card can use to talk to the main CPU. It already has a defined API (PHI? Program-hardware interface?), and already has standardized hardware templates designers can use.
These, of course, already exist. One good example is network cards, which help manage network communication, allowing the CPU to focus on other things while the network card does what it does best: moving packets. Network cards are single purpose though, you can’t offload anything but networking onto them unless you code up some custom drivers.
As our desktop computers begin to hit a performance plateau, we will need to look for performance benefits in other areas. Having separate hardware that can be added to an existing computer to improve performance. It also helps make existing computers useful for longer and create less e-waste in landfills. Users would no longer need to purchase an entirely new computer when they need to upgrade, only purchase an add-in card. The card’s modular package and concentrated performance allow a user to easily improve performance of their system without having to put a lot of time and effort into it.
General Purpose Cards #
A general purpose add-in card should act as additional processor to the computer, allowing it to offload nearly anything to free up CPU performance for other tasks.
An example from the past could be the CDC-6600. It wa a mainframe released in 1964, and is considered to be the first successful supercomputer. It had an interesting hardware design which is similar to what we are positing here, where it had a single primary processor and 10 other “Peripheral processors” that work could be farmed out to. The CPU had a simplified instruction set, sort of like the forefather of RISC architecture. For more complex tasks, like memory access and I/O, it had dedicated peripheral processors that the primary processor would farm work out to. This allowed everything to operate in parallel and improved the throughput of the machine.
This paradigm of having a single primary and n peripheral processors in the CDC-6600 is exactly what we want to achieve with the add-in cards. The additional compute power could be tapped in to, allowing the primary processor to focus on other tasks while the add-in card covers less important things. The card would bring extra processing power to a system, and also potentially add additional system resources like RAM and disk. A general purpose card would provide more computing resources to the primary hardware.
This card could be integrated at different levels into the host hardware/OS. One option would have it present itself to the OS simply as additional hardware. It would be just like adding an additional ram stick or hard drive, which would be presented to the OS. It would be integrated at the kernel level, outside of user space. The host system OS would manage the hardware resources on the card. The card itself wouldn’t be making too many decisions, it would simply be more raw resources for the system. This type of card already somewhat exists that target single, specific components like a hard disk.4 I couldn’t find a dedicated RAM card, but I did find a PCI card that a user would mount RAM on and use as a hard disk.5 I figure a RAM-only card wouldn’t perform too well, since RAM depends on tight latencies with the processor, so RAM by itself my perform better as an additional layer of cache rather than memory. If the card ends up being a “system on a card”, with it’s own CPU, RAM, and disk, I could see it being more feasible since calculations would be taking place primarily on the card, rather than across the host hardware and the card.
Another option could be to have the card act as a distinct computer. This would be like sticking a raspberry pi on a PCI card. In this scenario, it would present itself to the host as a completely separate computer, with it’s own hardware and OS. Instead of the host computer accessing the hardware directly, it would instead interface with the card’s OS, where it would offload larger, packaged computing tasks. The host wouldn’t interact the card’s resources directly either. It might be aware of card statistics, but the host wouldn’t be able to directly control the card’s resources. The card OS would manage the incoming work and how to best apply the card resources to the task. This would allow the host to not have to worry about the operation of the card itself, just the tasks its sending to the card.
One last option I could imagine could be a add-in FPGA card. An FPGA (field-programmable gate array) allows a user to “program hardware”, where computer hardware is described using a language like verilog and implemented in configurable logic blocks. Because it can be configured for specific tasks, an FPGA will usually perform quicker and more efficiently at a task than a general purpose CPU. This is because the different algorithms and operations can be implemented using hardware instead of software. An FPGA add-in card could be dynamically programmed for certain tasks as the host CPU requires it. As the host need to compute different things, it could re-program the FPGA on the fly and send data through it. This would have a much higher barrier to entry for a user to do themselves, so it would probably require pre-compiled task configurations that would be available to download and use. While not as easy to “slot in” to existing system architectures, it could provide a large, configurable performance boost to pre-designed tasks. This type of card already exists, but can be prohibitively expensive, especially the top-of-the-line cards.6
Specialized hardware #
The cards provide a stable platform on which to build on top of. Ability to add great performance at a cheaper cost (Don’t have to re-buy entire computer, only the card)
Gives us the chance to make specialized hardware available to normal computer.
Example: https://spectrum.ieee.org/computing/hardware/the-future-of-deep-learning-is-photonic Matrix multiplications using light photons instead of transistors. It’s very specialize dhardware that you would never find integrated into a CPU or motherboard. By having it on an add-in card, easier to seel since buyers only need an existing computer, not an entirely new setup
GPT-j model training blog post: https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/ ↩
Google TPU info: https://cloud.google.com/tpu/docs/types-zones#europe ↩
Intl core i9 9900k benchmark: https://gadgetversus.com/processor/intel-core-i9-9900ks-gflops-performance/ ↩
Normal, run-of-the-mill PCI-e hard drive: https://www.newegg.com/western-digital-1tb-black-an1500-nvme/p/N82E16820250159 ↩
Use RAM as a hard drive: https://www.newegg.com/gigabyte-gc-ramdisk-others/p/N82E16815168001 ↩