/Users/../llama-b8740/llama-bench --no-warmup -m /Users/../Qwen3.5-9B-UD-Q4_K_XL.gguf -p 128 -n 256 -t 1,2,3,4
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_init_from_source: error compiling source
ggml_metal_device_init: - the tensor API is not supported in this environment - disabling
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.013 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10 (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 12713.12 MB
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | MTL,BLAS | 1 | pp128 | 110.11 ± 4.18 |
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | MTL,BLAS | 1 | tg256 | 9.50 ± 0.30 |
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | MTL,BLAS | 2 | pp128 | 107.58 ± 6.53 |
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | MTL,BLAS | 2 | tg256 | 9.45 ± 0.08 |
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | MTL,BLAS | 3 | pp128 | 110.79 ± 1.14 |
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | MTL,BLAS | 3 | tg256 | 9.29 ± 0.09 |
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | MTL,BLAS | 4 | pp128 | 110.78 ± 1.42 |
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | MTL,BLAS | 4 | tg256 | 8.75 ± 0.95 |With the default no battery saving mode:/Users/../llama-b8740/llama-bench --no-warmup -m /Users/../Qwen3.5-9B-UD-Q4_K_XL.gguf -p 128 -n 256 -t 1,2,3,4
[..]
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | MTL,BLAS | 1 | pp128 | 216.04 ± 8.47 |
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | MTL,BLAS | 1 | tg256 | 20.17 ± 0.04 |
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | MTL,BLAS | 2 | pp128 | 209.43 ± 0.35 |
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | MTL,BLAS | 2 | tg256 | 20.21 ± 0.03 |
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | MTL,BLAS | 3 | pp128 | 196.55 ± 5.79 |
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | MTL,BLAS | 3 | tg256 | 18.34 ± 3.22 |
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | MTL,BLAS | 4 | pp128 | 193.50 ± 2.13 |
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | MTL,BLAS | 4 | tg256 | 19.57 ± 0.18 |Quote from: en.wikipedia.org/wiki/Apple_M5#PerformancePeak GPU AI compute: over 4× fasterdoes not apply to running 3rd party LLMs and only the RAM/unified memory bandwidth increase increases the token generation (28% = 1.28 = 153.6 GB/s (M5) / 120 GB/s (M4)). (153.6 GB/s = 128-bit * 9600 MT/s / 1000 / 8)
Quote from: youtube.com/watch?v=HKxIGgyeISMApple's Energy Model - Deconstructed
In this video, I reverse engineer Apple's Energy Model on the Mac Studio M4 Max. In the process I explain why and how measured DC power can appear up to 3 times higher than reported M4 Max GPU power.
[..]
Quote from: juri on March 13, 2026, 03:09:11and still no matte option for the screen, so idiotic without any reason.It's not 12W, look again. Maybe the matte coating on your screen played a prank on you ;)
Quote from: not_anton on March 11, 2026, 17:16:44Good to know, but will those quants fit (if I had ot guess I'd say yes)? There's also sysctl iogpu.wired_limit_mb=<MB> (I know not to assign too much to the VRAM (1-3 GB may be ok), as the OS may start to write/swap to the SSD).Quote from: Will MBAir fit 27B quants on March 11, 2026, 12:23:33Will the 24 GB RAM option fit Qwen3.5-27B-UD-Q4_K_XL.gguf (17.6 GB) or Qwen3.5-27B-UD-Q5_K_XL.gguf (20.2 GB)? (huggingface.co/unsloth/Qwen3.5-27B-GGUF) (I know there's mlx-community/Qwen3.5-27B-4bit (16.1 GB) too, but I don't know if its perplexity is good)I have a 15" M2 Air with 24GB RAM, but it can only run 3B models max because of overheating. Work is fine, gaming is fine, but LLMs throttle it to 0.4GHz on GPU in a minute. Sorry, you would need something with a fan or two to make those models useful.
Quote from: Will MBAir fit 27B quants on March 11, 2026, 12:23:33Will the 24 GB RAM option fit Qwen3.5-27B-UD-Q4_K_XL.gguf (17.6 GB) or Qwen3.5-27B-UD-Q5_K_XL.gguf (20.2 GB)? (huggingface.co/unsloth/Qwen3.5-27B-GGUF) (I know there's mlx-community/Qwen3.5-27B-4bit (16.1 GB) too, but I don't know if its perplexity is good)I have a 15" M2 Air with 24GB RAM, but it can only run 3B models max because of overheating. Work is fine, gaming is fine, but LLMs throttle it to 0.4GHz on GPU in a minute. Sorry, you would need something with a fan or two to make those models useful.
Quote from: dumb_oems on March 09, 2026, 11:44:29This is not emphasized enough
...
the actual user experience, especially on battery power
Quote from: JimD on March 09, 2026, 12:11:02Fanless is not really an attraction for me.