Parallel Examples - Search News

Low latency OPT inference

FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen allows high-throughput generation by IO-efficient offloading, compression, and large ...

IEEE

TOP: Task-Based Operator Parallelism for Asynchronous Deep Learning Inference on GPU

Abstract: Current deep learning compilers have made significant strides in optimizing computation graphs for single- and multi-model scenarios. However, they lack specific optimizations for ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

Low latency OPT inference

TOP: Task-Based Operator Parallelism for Asynchronous Deep Learning Inference on GPU

Trending now