NVIDIA and Google unveiled enhancements for Gemma, Google’s state-of-the-art new lightweight 2 billion- and 7 billion-parameter open language models, across all NVIDIA AI platforms, to reduce costs and accelerate creative work for domain-specific use cases.
Teams from the two companies worked closely together to accelerate the performance of Gemma, which was created using the same technology and research as the Gemini models, using NVIDIA TensorRT-LLM, an open-source library for optimizing large language model inference when running on NVIDIA GPUs in the data center, in the cloud, and on PCs with NVIDIA RTX GPUs.
Developers can now aim for the more than 100 million NVIDIA RTX GPUs that are now used in high-performance AI PCs across the globe.
The H200 Tensor Core GPUs from NVIDIA, which Google plans to roll out this year and have 141GB of HBM3e memory at 4.8 terabytes per second, and the A3 instances on Google Cloud, which are based around the H100 Tensor Core GPU, are also compatible with Gemma.
Enterprise developers can further refine Gemma and integrate the optimized model into their production applications by utilizing NVIDIA’s broad toolbox, which includes NVIDIA AI Enterprise with the NeMo framework and TensorRT-LLM.
Learn more about the improvements TensorRT-LLM is making to Gemma’s inference, along with more information for developers. TensorRT-LLM was used to optimize several Gemma model checkpoints and the FP8-quantized model.
To experience Gemma 2B and Gemma 7B directly from your browser, visit the NVIDIA AI Playground.
Jewel Paying a Visit to Chat With RTX
Gemma support will soon be added to Chat with RTX. This NVIDIA tech demo gives users generative AI capabilities on their local, RTX-powered Windows PCs by combining TensorRT-LLM software and retrieval-augmented generation.
Customers can easily connect local files on a PC to a large language model using Chat with RTX, which enables them to personalize a chatbot with their information.
The model runs locally, so user data is stored on the device and outcomes are received quickly. Rather than relying on cloud-based LLM services, Chat with RTX enables users to manage sensitive data locally on a PC without requiring an internet connection or exchanging it with a third party.