AI coding assistance discussion

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Jul 27, 2020
24,268
16,925
146

Pretty CRAZY model.

Don't believe me? Ask it something and watch it go. Like really, really go.

It doesn't stop until it runs against some sort of limit. Keeps going through different code possibilities.
 
Jul 27, 2020
24,268
16,925
146



Speculative decoding now allows a larger 231B model to oversee the draft work of the smaller 13B model, resulting in improved response times.
 
Jul 27, 2020
24,268
16,925
146
It's LIVE!


Kids can now create their own CPU benchmarks!

(yes, I'm a 44 year old kid...)
 
Reactions: Red Squirrel
Jul 27, 2020
24,268
16,925
146
RAM latency checker: https://www.overclock.net/posts/29439133/

As described there, it's not the absolute latency but it seems fairly consistent.

Tested to work on Haswell and onwards. Don't think I can try it on my Epyc today so someone may wanna volunteer and test on their Ryzen? Thanks!

EDIT: Tested and working as intended on Tiger Lake. Average latency deviation isn't wild which means it can be useful.
 
Last edited:
Jul 27, 2020
24,268
16,925
146
A disappointment to report, hoping it would dissuade someone else from investing in expensive hardware (good thing LLM wasn't the only thing I bought the laptop for).

So my Thinkpad now has 128GB RAM and RTX 5000 16GB dGPU. I was hoping I would be able to run Llama 3.3 70B. It loads, at a context length of 16384 and consumes 71GB system RAM and all of VRAM. Unfortunately, the calculations are not offloaded to the GPU, despite lowering the core count to 1 and using all 80 cores of the GPU. It stays at 0% utilization. The processing happens on the CPU and even when setting it to max 6 cores (HT not supported by LM Studio I guess), the CPU utilization does not go beyond 17%. It gives a response, at the most horrible speed of something like 0.05 tokens per second or even lower. Gave up on it and now downloading another 8B LLM at F16 and Q8, to take advantage of speculative decoding. If I still don't get any GPU utilization, I will need to troubleshoot (maybe driver issue?).
 
Jul 27, 2020
24,268
16,925
146

LM Studio can't use both GPUs in parallel so one of them is doing the hard work while the other is chilling and just holding some data in its VRAM.
 
Jul 27, 2020
24,268
16,925
146
Tried the same prompt with/without GPU offloading and in the CPU only scenario, it created twice as many tokens to arrive at the solution. May have to do that a number of times to verify if this behavior is consistent but it does beg the question why the "thinking" is better with the involvement of GPU.
 
Jul 27, 2020
24,268
16,925
146
Was planning to benchmark my 9950X3D using LG ExaOne Deep F16 model. On the Xeon 6248R, I got a speed of roughly 3.7 tokens per second. Tried it at home and first, it loads up only the bottom half of the threads in Task Manager. Second, it keeps processing and never gets to the "thinking" stage. Just wastes a whole lot of power for nothing. So I suspect:

1) LM Studio isn't optimized for 9950X3D or getting confused by the CCD crap.

2) LM Studio is secretly co-owned by Intel or AMD or both and it only works flawlessly on server CPUs.

Extremely annoyed since the Xeon was only 41% utilized with max speeds of 3.9 GHz on 24 cores while the 9950X3D was hitting 5+ GHz on 16 cores and still failed to progress to the thinking stage.
 

MS_AT

Senior member
Jul 15, 2024
614
1,262
96
Extremely annoyed since the Xeon was only 41% utilized with max speeds of 3.9 GHz on 24 cores while the 9950X3D was hitting 5+ GHz on 16 cores and still failed to progress to the thinking stage.
The prompt processing part is compute heavy and during that on 16 threads your clocks should be sinking low if the code is well optimized. SMT will be by definition useless in this case.

The token generation part in single user case is dominated by memory BW, during that the clocks will be high, and you could get by with fewer than 16 threads even (it takes 2, pinned to different CCDs to maximize MemBW usage and if more threads help will depend on the model compute needs)

https://github.com/ikawrakow/ik_llama.cpp discussions in this repo are quite insightful, as well as in https://github.com/ggml-org/llama.cpp
 
Reactions: igor_kavinski
Jul 27, 2020
24,268
16,925
146

Athene V2 Chat IQ4_XS 73B model

9950X3D

6200C52 FCLK 2133 UCLK 3100 CO -37

Pretty impressive that it's maintaining a solid 5.35 GHz speed, even with my less than stellar 240mm AIO cooler.
 

dank69

Lifer
Oct 6, 2009
36,937
32,133
136
I wouldn't trust AI for anything beyond scaffolding. The wettest of the wet jr. coders.
 
Jul 27, 2020
24,268
16,925
146
The token generation part in single user case is dominated by memory BW, during that the clocks will be high, and you could get by with fewer than 16 threads even (it takes 2, pinned to different CCDs to maximize MemBW usage and if more threads help will depend on the model compute needs)
I found out why I've been getting really low token/s speeds compared to what people say in Reddit threads. They are using low parameter models (in other words, almost crap and useless).

I tried StarCoder 10.7B.

Results:

9950X3D 7200C34 ~10.5 tokens/sec
9950X3D 7600C36 ~11.5 tokens/sec

A770 16GB ~25 tokens/sec

This is the only model where the A770 is able to shine so far in my testing (not that I've been able to try that many).

Seems large parameter count models make even GPUs crumble to their knees.

But the response quality of the Athene V2 Chat model was much higher and it made more of an effort with an elaborate solution to solve the problem (asked it to provide code for the best sort algorithm) whereas StarCoder kept its response short and not as comprehensive.
 

MS_AT

Senior member
Jul 15, 2024
614
1,262
96
They are using low parameter models (in other words, almost crap and useless).
You don't tell us the quantization you are using, for example for programming AMD suggested Q6 in their materials. Also for programming try qwen 2.5 coder, it should do better in theory.
 
Jul 27, 2020
24,268
16,925
146
I found an extremely lovely (I don't say that lightly because it made me really happy) tiny model: tiny-llama-R1.

Just watched it go crazy fast using only the 9950X3D at 67.5 tokens/sec @ IQ4 quant!!!

I'm almost too giddy to try it on the GPU because it might break 100 tokens/sec there
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |