Recently, I got into the local large model scene, something I've wanted to do for a while, but my laptop's dedicated graphics card only has 4GB of VRAM with the RTX 3050 Laptop, making it difficult to get started. Now, I've found that the inference performance on Apple devices is quite good, and I happen to have an M2 Mac Mini on hand, which is particularly fortunate, leading to this article.
When we talk about user-friendliness, we usually refer to something that is ready to use right out of the box, preferably with a graphical interface. Docker and Ollama might be considered ready to use, but they don't really have anything to do with a graphical interface. What I want to highly recommend is — LM Studio.
Preparation#
Why do I recommend it? Because it’s good. Open its download page, and wow, it looks modern. Just download the client according to your system requirements; Apple devices need M-series chips.

Just install it normally, and after opening it, you can see the main interface (of course, it won't look like this the first time you open it).

Move your gaze to the gear icon in the lower right corner, and you can open the settings to switch the language to Chinese. Although the translation is not complete, it's better than nothing.

Alright, the preliminary preparations are almost done, and we can bring up our large model.
Downloading and Loading the Large Model#
The reason LM Studio is good is that it has a very convenient path for downloading large models.

Just click on the magnifying glass icon (the fourth one from the top), and you can search for various large models. Since these models come from Hugging Face, you need a relatively clean IP to download them.

We can choose based on the model size. Since Apple's M-series chips use a unified memory architecture, the memory and VRAM share the same memory pool. According to Apple's latest news, VRAM can occupy up to 75% of the total memory (I think, but I can't remember exactly), and large models will also consume some VRAM during operation, so a model size around half of the total memory size should be able to run.
Additionally, it's worth mentioning that LM Studio supports Apple's MLX deep learning framework, which has lower data transfer overhead than Pytorch and is more suitable for M-series chips than the common GGUF format. Therefore, it's best to choose MLX models when selecting.
Once the model is downloaded, you can load it. After repeated experiments, the best model my 8GB memory Mac Mini can run is the Qwen2-7B-Instruct-4bit model, which not only supports a full 32k context but also has quite a decent speed, and its proficiency in Chinese is better than that of foreign large models.

To be honest, after the launch of the Qwen model, my impression of Alibaba Cloud has completely reversed. Although Alibaba Cloud's Singapore data center caught fire and there was almost no disaster recovery, training Qwen with native support for Japanese and Korean is great, which is beneficial for manga translation and deserves praise. Jack Ma can be said to have "washed away the dust of ages."
Then you can have a conversation with Qwen2-7B, and the generation speed can be adjusted according to your needs, but you can use my M2 as a reference.

It's about 19.9 tokens/s, which is in a usable state. Compared to Phi 3's nonsensical output, Gemma 2's lack of understanding of Chinese, Deepseek's bulkiness, and Mistral's self-questioning, Qwen2 appears cute and calm. I love it. As for RAG and local API calls, let's talk about that next time.

Indeed, 4-bit quantization is still too clumsy. I’ll try Qwen2.5 another day to see if it’s just as clumsy. I still love it and won't call it a fool.
This article is synchronized and updated to xLog by Mix Space. The original link is https://www.actorr.cn/posts/default/usingLMStudio