Just got a 9654 Genoa ** Motherboard /ram/heatsink/CPU are here !!

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

cellarnoise

Senior member
Mar 22, 2017
721
399
136
I think Genoa has been done as far as benchmarking goes... Maybe not completely with total crazy avx512, but now... Yes...

New generations comes out soon, but, the Genoa generation is great so far and won't be out done price to performance for like 6 to 12 months?
 

cellarnoise

Senior member
Mar 22, 2017
721
399
136
So far I think a 9554 E.S. is better than 3 x 7950x at full core stuff, until a 7950x goes above 180W or so... over 150 or ish, a 7950x goes down hill on performance per watt / burn.

4th gen does a bit better performant per power usage on avx 512 tasks, and on other stuff, sometimes better than 3x, and mostly on all stuff
 
Last edited:

cellarnoise

Senior member
Mar 22, 2017
721
399
136
7590x power constrained some... Like under 160 or 150W or even hard locked with a V core... Is the best performant sillycone on hard avx512 tasks... But Price to performance comes into play along with how many puters you wish to support...
 
Jul 27, 2020
16,816
10,755
106
7590x power constrained some... Like under 160 or 150W or even hard locked with a V core... Is the best performant sillycone on hard avx512 tasks
Does the 7950X3D perform better in AVX-512? Do AVX-512 workloads benefit greatly from V-cache? I don't think anyone has investigated that much. Best I can do is this:



Yes, these are different versions but notice that the newer version 2022.3 provides a slight uplift in performance but when V-cache comes into the picture, all bets are off.


the Ryzen 9 7950X3D was pulling 118 Watts on average compared to 192 Watts with the 7950X. The power efficiency improvements here were phenomenal.
Maybe the DC folks are missing out on even better performance at much lower power consumption and heat output?
 

StefanR5R

Elite Member
Dec 10, 2016
5,591
8,013
136
Do AVX-512 workloads benefit greatly from V-cache?
The AVX-512 workloads in distributed computing require less than 32 MB level 3 cache. (Actually, I haven't checked the recently introduced AVX-512 version of Asteroids@Home's application yet. However, its AVX-2 variant is not looking like a heavy user of the vector units to begin with. Plus I also never had the impression that it was memory-bound even if all SMT threads were loaded on AMD and Intel CPUs. The power meters of my computers would give away hints of such application behavior.)

Now, one could use the total of 96 MB cache per CCX of V-cache equipped CPUs to run more instances of mentioned applications at once, giving fewer threads to each instances than if merely one or two instances were running concurrently on a CCX. But this would prolong the task durations a lot, while not reducing the applications internal thread synchronization overhead to a degree which would matter.

It is possible that some of the simulations which the Rosetta@home project runs would profit from V-cache. But so far we haven't come up with a way to benchmark this. The payload in Rosetta@home workunits differs a lot between different simulation campaigns of theirs. (Rosetta@home's applications do not use AVX-512. But they might issue many memory accesses.)

Edit, another question related to Distributed Computing which hasn't been investigated yet AFAIK, is whether or not GPGPU applications benefit from V-Cache.

phoronix said:
the Ryzen 9 7950X3D was pulling 118 Watts on average compared to 192 Watts with the 7950X. The power efficiency improvements here were phenomenal.
Maybe the DC folks are missing out on even better performance at much lower power consumption and heat output?
Several of us who do have 5900x, 5950x, 7900x, 7950x and so on already operate them with reduced package power tracking limit, to greatly increase power efficiency versus stock.
 
Last edited:
Jul 27, 2020
16,816
10,755
106
It is possible that some of the simulations which the Rosetta@home project runs would profit from V-cache. But so far we haven't come up with a way to benchmark this.
How about monitoring two systems, one with 7950X and other with 7950X3D, to see which one completes more work units in a given timeframe? Like in a month?
 

StefanR5R

Elite Member
Dec 10, 2016
5,591
8,013
136
How about monitoring two systems, one with 7950X and other with 7950X3D, to see which one completes more work units in a given timeframe? Like in a month?
In practice, the intermittent nature of Rosetta@home's work availability may make this too imprecise even if tracked for a long time frame like a month or so. There are periods during which the reception of work to a specific host is luck of the draw. One host might grab a bunch while another may only get few or miss out entirely, for days to weeks. Second, it is more complicated than just counting completed workunits in case of Rosetta@home. Their workunits have a fixed user-configurable duration, within which several simulation runs with changing randomized starting values are performed. One would have to count the number of these simulations-within-the-simulation. But different workunit batches are not comparable at all; some get just a couple of these internal simulations done within one round, others do hundreds.

Hence, best for computer performance comparisons would be to pick out representative workunits (ones with large, medium, and small payloads), and bake them into fully repeatable standalone benchmarks.
 

StefanR5R

Elite Member
Dec 10, 2016
5,591
8,013
136
One more note on cache demand of science applications, especially in the Distributed Computing context. We've got basically three kinds of applications:
  • The very special case of LLR2 and Genefer (also, Prime95 and similar):
    These implement one or another number-theoretical transform which sort of rotates a fixed amount of coefficients through the vector pipelines of the CPU. The larger the investigated numbers are, the more coefficients are involved and respectively much cache desired. But as mentioned, we are getting by with 32 MB cache or less so far, depending on the progress of the respective project within its own search space.
    --> V-cache doesn't help, because we don't need as much... yet. Should these projects wander past 32 MB cache demand at some point in the future, then we also would want more cores per CCX. Or rather, a GPGPU implementation. (Which we already have in case of Genefer.)​
  • The more general case:
    The science code is well optimized in the parts which matters most to throughput. The optimization target is invariably a CPU with a normal amount of cache. Such optimizations happen if the code author knows his stuff, or/and uses standard code which was optimized respectively by the originators of this base code.
    --> V-cache doesn't help because the numeric algorithms were optimized to work well on widely deployed hardware.​
  • Another general and not at all uncommon case:
    The science code is everything but optimized, and wastes a lot of CPU time needlessly. V-cache might help a very little with some luck in such cases, but what's really needed would be a thorough overhaul of the code. This type of science code exists because the authors don't have the respective computer science background or the resources to hire respective workforce.
    --> Maybe, or maybe not, V-cache would reduce the wastefulness of applications like this, a little.​
  • Certain HPC cases with large working set sizes (CFD and the likes):
    We don't have these in Distributed Computing. Even the ClimatePrediction.net applications, which do work on relatively large datasets, are in fact optimized to utilize normal CPUs well.

It is possible that some of the simulations which the Rosetta@home project runs would profit from V-cache.
Also, maybe (or maybe not) WCG's African Rainfall project could benefit. It is similar in nature to ClimatePrediction.net's applications, and I would hope that it is optimized similarly to these.

another question related to Distributed Computing which hasn't been investigated yet AFAIK, is whether or not GPGPU applications benefit from V-Cache.
On second thought, it probably doesn't make sense to cache the data which are transferred to and from the GPU. Except for pre- and post-processing by the CPU, but then, normal sized caches should suffice because the CPU ideally does not have to perform multiple passes over the data with interdependency between all the data; that's what the GPGPU part of the application is for.
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
5,591
8,013
136
@Markfw
Have you ever measured the power usage from the wall for the 9654 under a heavy AVX-512 load, like in PrimeGrid?

What's your estimate for the 9654's power usage at full load?
Package power tracking should get the same results as with 9554. A difference between 9654 and 9554 however is that the latter runs into the f_max cap earlier, due to obviously higher power budget per core. And the rest of the power consumption which happens outside the package is the same for all 9004 series EPYCs, naturally, depending on mainboard make, memory population, etc pp.

So far, I have never noticed my 9554p computer (config) pull more than 70 W over PPT limit at the wall. Caveat: I have doubts that the BIOS switch between performance determinism and power determinism works right, or that I have used it right yet. Power determinism attempts to drive each individual processor closer to the PPT limit, whereas performance determinism attempts to equalize slight power efficiency differences between specimens of the same OPN. BIOS default should be performance determinism.

I have seen more power overhead from PPT to at-the-wall consumption with dual-socket EPYC 7452 when in power determinism mode. I'd have to dig out notes to say how much more.

BTW, my M97 hsf is keeping it at 66c with 62 processes avx-512 !
I am still doubting that Process Lasso, at least with the configs which I have seen posted, aligns tasks with caches as much as desired.
Edit: Though this concern doesn't apply to single threaded PPS-LLR, SMT not used.

Edit 2:
The only thing that's holding me back is the availability of the heatsink I want. (AMD SP5 M99)
@emoga, this cooler is similar to the one in the HP Z6 G5 A (Threadripper Pro workstation), except that I suspect that the fans in the HP workstation can go higher than 2200 RPM (M99 spec). And the number of heatpipes in HP's cooler isn't known. Anyway, look around for reviews of the HP workstation to get an impression if this style of cooler, or a 4U air cooler in general, would work for you.
 
Last edited:
Reactions: emoga

pututu

Member
Jul 1, 2017
148
224
116
Maybe the DC folks are missing out on even better performance at much lower power consumption and heat output?
For DCing, if possible I always try to optimize the cpu or gpu voltage (set lower than default curve) for less heat output and better computational efficiency in terms of PPD (points per day) per watt. The most dominant factor affecting the power consumption/heat output is obviously the cpu and gpu core voltage. I generally try to google for information if someone had already mapped out, say the 7950x clock versus voltage curve and use this as a starting guide to optimize the cpu clock-voltage for a particular DC project for my own system. I never had any great success with ryzen master (ECO mode or with CO offset) so I go into the bios and adjust the clock and voltage directly. Once you get a baseline, it is easy to optimize your system for other projects.

Below is the 7900x voltage curve versus clock taken from here. I tried it on my 7950x and it is not that far off but requiring slightly higher voltage due to higher power density.
 
Reactions: igor_kavinski

StefanR5R

Elite Member
Dec 10, 2016
5,591
8,013
136
@pututu, as another reference point, I happen to have notes from 9554P @ 400W running GFN-20:
average clock was 3.40 GHz (didn't take notes of max clocks or the spread, but the spread of core clocks wasn't large)
CPU_VDDCR0 = 0.967 V, CPU_VDDCR1 = 0.967 V, CPU_SOC = 0.811 V, CPU_VDDIO = 1.092 V (according to the BMC, via IPMI, I haven't researched yet if there is a driver for the CPU's own sensors, but at least the BMC's reading isn't subject to AGESA bugs... only subject to BMC firmware bugs)

Some projects accepts results at a quorum of 1. Better don't undervolt the CPU at these projects. Rather, make do with a PPT limit. (Of course, don't let the BIOS overvolt the CPU either...)
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,639
14,629
136
@pututu, as another reference point, I happen to have notes from 9554P @ 400W running GFN-20:
average clock was 3.40 GHz (didn't take notes of max clocks or the spread, but the spread of core clocks wasn't large)
CPU_VDDCR0 = 0.967 V, CPU_VDDCR1 = 0.967 V, CPU_SOC = 0.811 V, CPU_VDDIO = 1.092 V (according to the BMC, via IPMI, I haven't researched yet if there is a driver for the CPU's own sensors, but at least the BMC's reading isn't subject to AGESA bugs... only subject to BMC firmware bugs)

Some projects accepts results at a quorum of 1. Better don't undervolt the CPU at these projects. Rather, make do with a PPT limit. (Of course, don't let the BIOS overvolt the CPU either...)
Now thats odd... My QS 9554's run 3.5 ghz, all 4 of them ! Faster than retail. same motherboard, stock bios, no changes. one using linux, the other 3 using windows 10.
 

pututu

Member
Jul 1, 2017
148
224
116
Some projects accepts results at a quorum of 1. Better don't undervolt the CPU at these projects. Rather, make do with a PPT limit. (Of course, don't let the BIOS overvolt the CPU either...)
The non-avx workloads are easier to undervolt with a fixed cpu core voltage. With avx workload, I've to add additional voltage for stability. I just use a simple formula to estimate how much voltage I need to add to the cpu. For e.g. if the avx workload increases the cpu temp by 20°C over a "typical" cpu temp when running non-avx workload, I just add about 40mV to compensate for the cpu voltage drop due to heating (assuming typical temperature coefficient of a silicon is about -2mV/°C, add at least 2mV/°C * 20°C = 40mV). My 7950x is air-cooled with Noctua nh-u9s (smallest desktop noctua cpu heatsink cooler, I think). I don't run at default curve or overclock, so a single 92mm fan is sufficient for my use case.

If you have many systems to handle, using ryzen master with CO offset is probably the best way to go. Anyway, I've only four main rigs (including one epyc Rome system) that I undervolt all the time. Other than playing around with cTDP, does anyone know how to undervolt an epyc Genoa system other than the threadripper? Maybe future upgrade when the price is right. Yeah, I'll be waiting for a few years...
 

StefanR5R

Elite Member
Dec 10, 2016
5,591
8,013
136
notes from 9554P @ 400W running GFN-20:
average clock was 3.40 GHz
My QS 9554's run 3.5 ghz, all 4 of them !
The clock speed of my EPYC, like all (production) EPYCs, depends on the workload.
(In workloads which don't spend a lot of energy in the cores, clock speed approaches f_max of course.)

Another data point:
Running 64 SGS-LLR tasks at once [...with default 360 W PPT limit...] Cores are running at about 3.3 GHz.

And another:
MilkyWay@Home NBody, 4 threads/task, all SMT threads used, cache-aligned affinity, a light workload which needs merely 365 W at the wall at default 360 W PPT limit,
min/avg/max core clocks =
3.39/3.66/3.74 GHz (Genoa, f_base/f_max = 3.1/3.75 GHz),
Same but at 400 W PPT limit: 415 W at the wall, min/avg/max core clocks =
3.0/3.70/3.75 GHz (one take)
2.4/3.70/3.75 GHz (another take)

Also, this is a good occasion to remind everyone that high core clocks do not always mean high execution speed. Sometimes it merely means that the cores aren't actually doing quite much besides waiting for memory accesses or for inter-thread synchronization.
 
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |