Memory Challenges in AI and Machine Learning Compute

Growing memory demands

As AI and ML performance demands continue to grow rapidly, memory is becoming increasingly important.

In fact, when it comes to memory for artificial intelligence, there are many new requirements, particularly:

Larger capacity – Model sizes are enormous and growing rapidly, potentially reaching tens of terabytes. Such data scales require increasing DDR main memory capacity.
More bandwidth – As large amounts of data must be moved, all types of DRAM continue to compete to increase data rates to provide more memory bandwidth.
Lower latency – Another aspect of the demand for speed is lower latency so processor cores do not idle waiting for data.
Lower power consumption – Power has become a limiting factor in AI systems as engineering approaches physical limits. The demand for higher data rates also raises power consumption. To mitigate this, IO voltages are being reduced, but this reduces voltage margins and increases the chance of errors, which in turn requires higher reliability.
Higher reliability – To address rising error rates at higher speeds, lower voltages, and smaller process nodes, there is increasing use of on-die ECC and advanced signal techniques for compensation.

New memory technologies: opportunities

Another important topic is the challenges and opportunities of new memory technologies in AI. New technologies offer several potential advantages, including:

Optimization of capacity, bandwidth, latency, and power for a targeted set of use cases. AI is a large market with significant funding, which can drive new memory development. Historically, GDDR (for graphics), LPDDR (for mobile), and HBM (for high-bandwidth applications such as AI) were created to address use cases unmet by existing memory.
CXL – CXL provides opportunities to greatly expand memory capacity and increase bandwidth while abstracting memory types from the processor. In this way, CXL offers a useful interface for integrating new memory technologies. CXL memory controllers provide a translation layer between processor and memory, allowing new memory tiers to be inserted after locally attached memory.

Adoption challenges

Although new memory types optimized for specific use cases benefit many applications, they face additional challenges:

For the foreseeable future, DRAM, on-die SRAM, and flash will continue to coexist, so nothing should be expected to completely replace them. Annual R&D and capital investment in these technologies, combined with decades of high-yield manufacturing experience, make it essentially impossible to replace any of them in the short term. Any new memory technology must interoperate well with these memories to be adopted.

The scale of AI deployments and the risks associated with developing new memory technologies make adopting entirely new memories difficult. Memory development timelines are typically 2–3 years, but AI evolves so quickly that it is hard to predict which specific features will be needed in the future. The risk is high, and the risk that a new technology will be enabled and available is also high.

Any performance advantage of a new technology must be sufficient to offset added cost and risk. Given the demands on infrastructure engineering and deployment teams, this means new memory technologies must overcome a very high barrier.

Conclusion

Memory will remain a key enabler for future AI systems. The industry must continue to innovate for future systems to deliver faster and more capable AI, and the industry is responding.