Recap of the 2022 AI Hardware Summit Conference

Fine-tuning innovations from many vendors

Dan McCreary
10 min readSep 17, 2022
One of the many sessions at the 2022 AI Hardware Summit held in Santa Clara, CA. Photo by the author.

This week I attended the AI Hardware Summit in Santa Clara, California. This blog will recap key machine-learning (ML) and graph hardware acceleration trends. I will also talk about some unexpected findings.

Nothing in this blog should be considered an endorsement of any vendor or product. All opinions are my own.

Focus of the AI Hardware Summit Conference

This conference is for anyone working on ML and graph accelerators for training and inference. About half the vendors are building dedicated hardware for ML training acceleration, and half the vendors are building devices that specialize in inference. About half of the inference vendors are data-center focused, and another half are building low-power and low-cost edge inference hardware near cameras and sensors. These edge devices are trained to do tasks such as monitoring devices and reporting anomalies back to central data centers.

Mythic showed an awesome demo of a real-time object and pose detection that could be used for fall detection in a senior care facility. Photo by the author.

Many of the edge device vendors had booths that ran real-time object detection on participants that walked by the booths. The “Edge AI Summit” conference was also running concurrently with this conference since there is a natural overlap.

There was also representation by the companies that help you build ML-acceleration chips. Companies like Synopsys and Cadence touted not only their own IP to perform ML functions but how they use ML to accelerate the placement of ML components on System on Chip products. Many companies, such as AMD/Xilinx, also had examples of FPGA components that accelerated ML and graph analytics workloads. And many of these libraries can now be accessed directly from your Python Jupyter notebooks. More on that later.

Finally, there was a much larger showing of RISC-V IP and vendors at this conference than in prior years. ARM didn’t have a presence at the conference. There are now many new RISC-V ecosystem companies selling IP that will work on a new generation of RISC-V cores and focus on surrounding ML-acceleration circuits such as memory management and network protocol hardware.

There were also a few talks on industry-specific AI problems such as automotive and healthcare. I presented on the key issues of AI in Healthcare. These talks were attended by investors trying to match AI hardware startups with hard business problems.

Find, Finetune and Publish with Python

This year's key trend was adopting “The HuggingFace Way.” HuggingFace is a 160-person company and has a valuation of over $2B. Everyone is trying to catch up to HuggingFace.

In the past, many presentations focused on the “speeds and feeds” of new AI hardware for training and inference workloads. This year many vendors followed the precedences set by HuggingFace’s main machine learning workflow:

  1. Quickly find the right model using faceted search.
  2. Quickly fine-tune the model with your own data using a few lines of Python.
  3. Publish a mini-web app using Python tools like Streamlit or Gradio for example, with HuggingFace Spaces.

This is a marked movement away from companies that focused on trying to explain how their hardware and software was different, better, and lower cost than standard NVIDIA GPU hardware. This solution-focus is typical of companies Crossing the Chasm of technology adoption. They focus on faster and lower-cost solutions, and they realize that the “how does it work” is really only of concern to the innovators and early adopters (like me).

The focus here is Transformer-based large language models such as BERT and GPT-3. I will also discuss the new popular text-to-image Stable Diffusion models later in the blog. These are large and powerful models that solve many tasks. Building these models from scratch costs tens of millions of dollars. Fine-tuning the models can be done for pennies.

Streamlit and Gradio are now heavily favored tools by data scientists that want to rapidly create web applications just using Python. Data scientists don’t want to learn JavaScript, React, or NodeJS. They want everything they do to “just work” in their Jupyter notebooks, and they want to generate all the user interface controls from Python. This is why these open source tools have become incredibly popular in a short time.

HuggingFace

Our three instructors for the HuggingFace course on using Optimum to fine-tune models with Graphcore and Hamana. Photo by the author.

HuggingFace is by far the leader in the true democratization of machine learning. No other vendor is close to them now. They have more models and more tools for fine-tuning than anyone. With over 70K ML models (and growing), they have more models than Google, Azure and AWS combined.

Their team started the conference with a class on using their new Optimum library to fine-tune ML models. From the HuggingFace site:

Optimum aims at providing more diversity towards the kind of hardware users can target to train and finetune their models.

This conference was all about innovations in the new diversity of AI accelerators. In the past, many ML developers were forced to use rigid and inflexible SIMD hardware such as GPUs that were optimized for video game rendering. Now we have many custom silicon chips tuned to a variety of ML workloads. HuggingFace is perhaps the only company enabling this diversity with their Optimum libraries.

Intel Habana — Gaudi2

My biggest surprise was the rapid rise of Intel’s Habana Gaudi2 hardware. In the past, Habana didn’t even publish their ML benchmarks on MLCommons. Now things have changed. Not only are their training and inference benchmarks published, but they are also very strong.

Yet benchmarking is not everything. Habana has also joined Graphcore to be one of the workhorses behind the HuggingFace site using the Habana Optimum libraries. This will make it easier for teams to use Gaudi2 hardware to fine-tune their models.

AMD/Xilinx — FPGA and Graph Accelerator Innovation

I connected with the amazing Kumar Deepak, a Senior Fellow at AMD/Xilinx. Kuman led the team that built graph hardware accelerators that allow us to create ultra-fast FPGA recommendation engines in Python.

If you think that AI is only about ML, you are sadly mistaken. Even the most advanced deep-neural network ML systems can’t yet do simple tasks such as generalization and explanation. To build general AI, we will need both ML and symbolic reasoning working together using large-scale enterprise knowledge graphs. All the ML in the world will not help you if you have no strategy to represent knowledge and perform reasoning, abstraction, and planning on this knowledge.

AMD/Xilinx was the only company that had FPGA graph acceleration hardware on the conference floor. AMD's Alveo U50 pack 50 billion transistors that can be rewired in two seconds. They can execute important graph algorithms like Cosine Similarity and return results in millisecond time scales over datasets of 10 million embeddings stored in HBM. Being able to build a recommendation engine directly from your Python Jupyter Notebook is a huge step. Every e-commerce should use these boards to increase the speed and precision of their recommendations.

Graphcore-MIMD Innovation

Graphcore has been the leader in flexible, low-cost MIMD chips tuned for ML workloads. Photo by author.

Graphcore has consistently been the leader in low-cost, high-performance MIMD architectures customized to work well on AI workloads. Graphcore also announced that they are early partners with HuggingFace Optimize libraries. This partnership will allow users to quickly try out the cost and performance of fine-tuning large-language models and compare the MIMD approach to the more traditional SIMD GPU approach. I believe in the long run, Graphcore will frequently be the fast-turnaround and low-cost option for many problems that involve sparse data problems.

SambaNova — DataFlow Archiciteure Innovator

Many of my readers know I have a huge fan of understanding the tradeoffs of radically new chip architectures. SambaNova is one of my favorites. Just as Graphcore has been the champion of MIMD, SambaNova is the champion of DataFlow chip design. In the past, SambaNova tried to promote themselves to people that really understood how DataFlow circuits could outperform traditional clocked circuits for highly-parallel computations such as training deep neural networks. This year we started to see a new message. Their focus is promoting the fact that they have most of the large-language models like GPT pre-trained and ready to fine-tune…just like the message from HuggingFace.

In some sense, I am a little disappointed that these wonderful DataFlow chips are now being hidden inside a black box. But to be honest, most of the people I talk to don’t really have a deep appreciation of how the engineers from SambaNova have attacked the difficult problems of getting these chips to work reliably under so many complex race conditions.

Cerebras — WSI Packaging Innovator

Cerebras packages thousands of traditional CPUs and memory on a wafer-scale device that can also be used for machine learning workloads. Photo by the author.

Many companies underappreciate the importance of how silicon chips are manufactured and packaged. Cutting chips apart, adding lots of slow I/O, and then reconnecting them up have many disadvantages. Cerebras takes the approach of keeping all the chips on a single large wafer and then exploring redundancy to switch out chips that have defects using the software. As a result, they have incredibly good cost/performance numbers. Today Cerebras is mostly being used in the large national labs and big pharma, but they continue to build new tools to make their hardware more accessible to ML developers.

Other Datacenter ML Accelerators

Although these four vendors were the most mature data-center-based ML accelerators, there were also many new companies that I was not familiar with. Most of them seem to be pushing the SMDI architectures similar to GPUs, but some had interesting innovations. AMD is not only building GPUs but also promoting something called “Computational Storage” that moves CPUs closer to solid-state memory. These architectures, combined with the movement to chiplet designs, will keep these startups pushing the performance/cost boundaries of innovation.

Keep an eye out for both AMD and Qualcomm to continue building innovative ML hardware.

Many FPGA-based Edge Devices

Archronix demonstrated an FPGA card that could simultaneously convert 1,000 audio streams to text. Photo by the author.

The number of “smart” devices in the field and devices customized for specific workloads is predicted to continue to grow following Moore’s Law growth rates. For example, the need to convert speech to text is a universal problem that now has many custom hardware solutions. The photo above shows Archronix demonstrating a PCI card that can convert over 1,000 speech streams into text in real-time. The cost is now a small fraction of the costs just a few years ago, and the quality of transcription is also higher.

Hardware for Fine Tuning Stable Diffusion

Many people at the conference discussed the incredibly fast rise in using the Stable Diffusion model on HuggingFace. Many organizations want to fine-tune this popular model using their own personal style or company branding. Some of the presenters admitted staying up late trying to get Stable Diffusion to work for their presentations. One presenter even showed a short video of his talk where the slides' text was all rendered with Stable Diffusion. Not all of them were good, but all of them were interesting! There were comparisons with OpenAI’s Dall-E-2, which usually did a better job, but not everyone could get an account.

In a nutshell, what is happening with Stable Diffusion today is what the industry is moving to. A very large model that everyone wants to build upon to meet their needs. Now the question comes up — what hardware architecture is the best fit for this task? This is where the one-size-GPU-for-all model breaks down. We need continuing diversity and innovation to handle this task and future AI workloads.

Missing: Low-Cost Graph Acceleration ASICs

Both AMD/Xilinx and DataVortex have been building graph acceleration hardware for several years using FPGAs. Yet the market for high-end graph analytics is not yet large enough for any company to publically publish their Graph-500 benchmarks on fully-custom ASIC hardware. Although we know that this market is predicted to grow exponentially over the next few years, many companies seem distracted by the sexiness of the overcrowded ML acceleration market. I could not find a single person that seemed to understand the long-term market opportunity here.

As the 2018 DARPA HIVE work can tell us, graph accelerators like the Intel PIUMA have been been on the drawing board and in research labs for years. But they have yet to reach the larger communities that need them the most. Who knows? Perhaps government agencies don’t want really advanced neuro-symbolic AI hardware accessible to the general public. So it looks like we will wait until 2023 to see real results for Hardware Optimized Graph or HOG Heaven. Let’s just hope the vendors get this working before I retire!

Will New Entrants Saturate the Machine Learning Training Marketplace?

Today NVIDIA dominates the ML-training market. Although I have covered some of the new competitors to NVIDIA, there are dozens more new entrants at this conference offering new approaches to GPUs that I didn’t have time to cover. We know that Google, Amazon, and Tesla are all building their own custom ASIC chips for ML acceleration in their own data centers. They would not be doing this if low-cost video cards could be used for their workloads. Despite the collapse of the crypto industry and the drop in prices of GPUs, they are not right for everything.

I fear that too many companies are chasing the same AI training market. This is a really hard problem where only companies with $100M budgets for 3mm TSMC fabs will have the speed and memory management to stay competitive. Not all new startups will survive. But one thing is for sure; incredibly smart people are testing many innovative approaches to lower the cost of building and fine-tuning large language models. In any mature industry, one size will not fit everyone.

--

--

Dan McCreary

Distinguished Engineer that loves knowledge graphs, AI, and Systems Thinking. Fan of STEM, microcontrollers, robotics, PKGs, and the AI Racing League.