LM Studio Uses CUDA 12.8 and NVIDIA GeForce RTX GPUs to Speed Up LLM Performance
Improved model controls and dev tools, along with improved RTX GPU performance, are features of the most recent desktop program update.
Developers and hobbyists are looking for quicker and more flexible ways to execute large language models (LLMs) as the use cases for AI continue to grow, ranging from document summarization to custom software agents.
High-performance inference, improved data privacy, and complete control over AI deployment and integration are made possible by running models locally on PCs equipped with NVIDIA GeForce RTX GPUs. Users may easily explore and work with LLMs on their own hardware to tools like LM Studio, which is available for free.
One of the most popular programs for local LLM inference is now LM Studio. The program, which is based on the fast llama.cpp runtime, enables models to operate completely offline and may be used as OpenAI-compatible API endpoints to be integrated into unique workflows.
With the introduction of LM Studio 0.3.15, RTX GPU performance is enhanced by CUDA 12.8, which greatly speeds up model load and response times. New developer-focused features like a revamped system prompt editor and improved tool usage through the “tool_choice” option are also included in the upgrade.
The most recent updates to LM Studio enhance both its usability and speed, enabling the maximum throughput on RTX AI PCs to yet. Better tools for developing and integrating AI locally, quicker reactions, and snappier interactions are all results of this.
Where AI Acceleration Meets Common Apps
Because of its adaptability, LM Studio can be used for both light experimentation and extensive integration into unique processes. Developer mode allows models to be contacted via desktop chat or OpenAI-compatible API endpoints. This facilitates the integration of local LLMs with custom desktop agents or workflows in applications like as Visual Studio Code.
For instance, the well-known markdown-based knowledge management application Obsidian can be integrated with LM Studio. Local LLMs running through LM Studio enable users to query their own notes, create content, and summarise research using community-developed plug-ins like Text Generator and Smart Connections. By connecting straight to the local server of LM Studio, these plug-ins allow for quick, private AI interactions without the need for the cloud.
New developer features in the 0.3.15 release include an enhanced system prompt editor to handle longer or more complex prompts and more precise control over tool usage through the “tool_choice” option.
With the help of the tool_choice parameter, developers can regulate how models interact with external tools by requiring a tool call, turning it off completely, or letting the model make the decision on its own. Building structured interactions, retrieval-augmented generation (RAG) workflows, or agent pipelines benefits greatly from this additional flexibility. When combined, these enhancements improve the use cases for developers creating with LLMs in both experimental and production.
Numerous open models, such as Gemma, Llama 3, Mistral, and Orca, as well as a range of quantisation formats, from 4-bit to full precision, are supported by LM Studio.
RAG, document-based Q&A, multi-turn chat with lengthy context windows, and local agent pipelines are examples of common use cases. Additionally, users on RTX AI PCs can easily integrate local LLMs by employing local inference servers that are driven by the NVIDIA RTX-accelerated llama.cpp software library.
LM Studio offers complete control, speed, and privacy on RTX, whether you’re optimising for efficiency on a small RTX-powered PC or maximising throughput on a powerful desktop.
Get the Most Out of RTX GPU Throughput
The open-source runtime llama.cpp, which is intended for effective inference on consumer hardware, is the foundation of LM Studio’s acceleration. In order to optimise RTX GPU performance, NVIDIA collaborated with the LM Studio and llama.cpp communities to incorporate a number of improvements.
Important optimisations consist of:
- By combining several GPU operations into a single CPU call, CUDA graph enablement lowers CPU overhead and increases model throughput by up to 35%.
- By enhancing the way LLMs handle attention, a crucial activity in transformer models, flash attention CUDA kernels can increase throughput by up to 15%. Longer context windows are made possible by this optimisation without requiring more memory or processing power.
- Supports the latest RTX architectures: Since LM Studio’s CUDA 12.8 upgrade is compatible with high-end desktops, clients may expand their local AI processes from laptops to them. the entire spectrum of RTX AI PCs, from GeForce RTX 20 Series to NVIDIA Blackwell-class GPUs.
LM Studio immediately switches to the CUDA 12.8 runtime with a suitable driver, allowing for much faster model load times and improved performance overall.
These enhancements expedite response times and smooth inference on all RTX AI PCs, from tiny, light laptops to powerful desktops and workstations.
Start Using LM Studio
Linux, macOS, and Windows have free LM Studio. Users may anticipate more enhancements in performance, customisation, and usability with the most recent 0.3.15 release and continuous optimisations, making local AI quicker, more adaptable, and easier to use.
Developer mode exposes an OpenAI-compatible API, and the desktop chat interface lets users import models.
Download the most recent version of LM Studio and launch it to get started right away.
- To access the Discover menu, click the magnifying glass symbol on the left side.
- Look for the CUDA 12 llama.cpp (Windows) runtime in the availability list after selecting the Runtime options on the left panel. Click the “Download and Install” button.
- Once the installation is finished, choose CUDA 12 llama.cpp (Windows) from the Default Selections dropdown to set LM Studio to utilise this runtime by default.
- To optimise CUDA execution, load a model in LM Studio and click the gear icon to the left of the loaded model to access the Settings menu.
- Drag the “GPU Offload” slider to the right to offload all model layers onto the GPU and toggle “Flash Attention” to be on from the dropdown menu that appears.
NVIDIA GPU inference can be successfully conducted on a local configuration after these functionalities have been enabled and configured.
Model presets, several quantisation formats, and developer controls such as tool_choice for precise inference are all supported by LM Studio. The llama.cpp GitHub project is actively maintained and is always evolving with performance improvements driven by the community and NVIDIA for anyone who wish to participate.
RTX 50-series GPUs and enhanced tool usage in the API with LM Studio 0.3.15
The stable version of LM Studio 0.3.15 is currently accessible. Support for NVIDIA RTX 50-series GPUs (CUDA 12) and UI improvements, such as a redesigned system prompt editor user interface, are included in this release. Additionally, it have provided a new option to log each created fragment to API server logs and enhanced the API support for tool use (tool_choice parameter).
RTX 50-series GPU compatibility with CUDA 12
For Linux and Windows, LM Studio supports RTX 50-series GPUs CUDA 12.8 with llama.cpp engines. As anticipated, this modification speeds up first-time model load times on RTX 50-series GPUs. If NVIDIA drivers are suitable, LM Studio will update RTX 50-series GPUs to CUDA 12.
The bare minimum of driver versions is:
- Windows: version 551.61 or above
- Linux: at least 550.54.14
If the driver version is suitable with your RTX 50-series GPU, LM Studio will immediately update to CUDA 12. Even if your RTX 50-series GPU driver is incompatible, LM Studio will use CUDA 11. Command + Shift + R controls this.
New UI for the System Prompt Editor
System prompts are an effective tool for modifying your models’ behaviour. They are a few words to several pages long. A substantially bigger visual space for editing lengthy prompts is introduced in LM Studio 0.3.15. The sidebar’s small prompt editor is still functional.
Better Support for Tool Use APIs
The tool_choice option, which lets you customise how the model uses tools, is now supported by the OpenAI-like REST API. Three values are possible for the tool_choice parameter:
- “tool_choice”: “none” indicates that no tools will be called by the model.
- “tool_choice”: “auto” The model determines whether or not to invoke tools using the option.
- “tool_choice”: “required” Only output tools (llama.cpp engines)
Additionally, NVIDIA solved a fault in the OpenAI-compatibility mode of LM Studio that prevented the chunk “finish_reason” from being set to “tool_calls” when it should have been.
Community Presets (Preview)
System prompts and model settings can be conveniently packaged together using presets.
You can download user-made presets online and share your own presets with the community starting with LM Studio 0.3.15. Additionally, you can like and fork other people’s setups.
Go to Settings > General > Enable publishing and downloading presets to activate this feature.
Once enabled, right-clicking on a preset in the sidebar will reveal a new “Publish” button. You can then share your preset with the community.