Critical ops trainer 2.7

3/11/2024

On occasion, the cumulative embedding tables may become so large that they would be hard to fit on a single GPU’s memory.

It isn’t too hard to imagine a hundred million distinct users. Modern recommenders can have hundreds of features, with many categorical features and cardinalities to the order of hundreds of millions! Take a “userID” feature for example. However, modern recommenders tend to be memory and I/O bound as opposed to compute bound. NVIDIA GPUs are great at handling parallelized computation, and have been successful in deep learning domains like Computer Vision (CV) or Natural Language Processing (NLP) where computation itself is usually the dominant factor in throughput as compared to the time it takes to bring the data itself to the model. In this article, we’ll discuss what bottlenecks are typically observed with recommender workloads in practice, and how they can be identified and alleviated. This doesn’t just mean speeding up inference, but also training workflows so developers can iterate quickly. They are one of the most important applications of deep learning, yet as it stands today, recommenders remain some of the most challenging models to accelerate due to their data requirements.

They help you choose a movie for Saturday night, or discover a new artist when you've looped over your go-to playlist one too many times.

Recommenders personalize our experiences just about everywhere you can think of.

0 Comments

Critical ops trainer 2.7

Leave a Reply.

Author

Archives

Categories