Just got a 9654 Genoa ** Motherboard /ram/heatsink/CPU are here !!

cellarnoise · Dec 21, 2023

I think Genoa has been done as far as benchmarking goes... Maybe not completely with total crazy avx512, but now... Yes...

New generations comes out soon, but, the Genoa generation is great so far and won't be out done price to performance for like 6 to 12 months?

cellarnoise · Dec 21, 2023

So far I think a 9554 E.S. is better than 3 x 7950x at full core stuff, until a 7950x goes above 180W or so... over 150 or ish, a 7950x goes down hill on performance per watt / burn.

4th gen does a bit better performant per power usage on avx 512 tasks, and on other stuff, sometimes better than 3x, and mostly on all stuff

cellarnoise · Dec 21, 2023

7590x power constrained some... Like under 160 or 150W or even hard locked with a V core... Is the best performant sillycone on hard avx512 tasks... But Price to performance comes into play along with how many puters you wish to support...

igor_kavinski · Dec 22, 2023

cellarnoise said:
7590x power constrained some... Like under 160 or 150W or even hard locked with a V core... Is the best performant sillycone on hard avx512 tasks

Does the 7950X3D perform better in AVX-512? Do AVX-512 workloads benefit greatly from V-cache? I don't think anyone has investigated that much. Best I can do is this:

Yes, these are different versions but notice that the newer version 2022.3 provides a slight uplift in performance but when V-cache comes into the picture, all bets are off.

The Technical Workloads Where AMD Ryzen 9 7900X3D/7950X3D CPUs Are Excellent - Phoronix

www.phoronix.com

the Ryzen 9 7950X3D was pulling 118 Watts on average compared to 192 Watts with the 7950X. The power efficiency improvements here were phenomenal.

Maybe the DC folks are missing out on even better performance at much lower power consumption and heat output?

StefanR5R · Dec 22, 2023

igor_kavinski said:
Do AVX-512 workloads benefit greatly from V-cache?

The AVX-512 workloads in distributed computing require less than 32 MB level 3 cache. (Actually, I haven't checked the recently introduced AVX-512 version of Asteroids@Home's application yet. However, its AVX-2 variant is not looking like a heavy user of the vector units to begin with. Plus I also never had the impression that it was memory-bound even if all SMT threads were loaded on AMD and Intel CPUs. The power meters of my computers would give away hints of such application behavior.)

Now, one could use the total of 96 MB cache per CCX of V-cache equipped CPUs to run more instances of mentioned applications at once, giving fewer threads to each instances than if merely one or two instances were running concurrently on a CCX. But this would prolong the task durations a lot, while not reducing the applications internal thread synchronization overhead to a degree which would matter.

It is possible that some of the simulations which the Rosetta@home project runs would profit from V-cache. But so far we haven't come up with a way to benchmark this. The payload in Rosetta@home workunits differs a lot between different simulation campaigns of theirs. (Rosetta@home's applications do not use AVX-512. But they might issue many memory accesses.)

Edit, another question related to Distributed Computing which hasn't been investigated yet AFAIK, is whether or not GPGPU applications benefit from V-Cache.

igor_kavinski said:
phoronix said:

the Ryzen 9 7950X3D was pulling 118 Watts on average compared to 192 Watts with the 7950X. The power efficiency improvements here were phenomenal.

Click to expand...

Maybe the DC folks are missing out on even better performance at much lower power consumption and heat output?

Several of us who do have 5900x, 5950x, 7900x, 7950x and so on already operate them with reduced package power tracking limit, to greatly increase power efficiency versus stock.

igor_kavinski · Dec 22, 2023

StefanR5R said:
It is possible that some of the simulations which the Rosetta@home project runs would profit from V-cache. But so far we haven't come up with a way to benchmark this.

How about monitoring two systems, one with 7950X and other with 7950X3D, to see which one completes more work units in a given timeframe? Like in a month?

StefanR5R · Dec 22, 2023

igor_kavinski said:
How about monitoring two systems, one with 7950X and other with 7950X3D, to see which one completes more work units in a given timeframe? Like in a month?

In practice, the intermittent nature of Rosetta@home's work availability may make this too imprecise even if tracked for a long time frame like a month or so. There are periods during which the reception of work to a specific host is luck of the draw. One host might grab a bunch while another may only get few or miss out entirely, for days to weeks. Second, it is more complicated than just counting completed workunits in case of Rosetta@home. Their workunits have a fixed user-configurable duration, within which several simulation runs with changing randomized starting values are performed. One would have to count the number of these simulations-within-the-simulation. But different workunit batches are not comparable at all; some get just a couple of these internal simulations done within one round, others do hundreds.

Hence, best for computer performance comparisons would be to pick out representative workunits (ones with large, medium, and small payloads), and bake them into fully repeatable standalone benchmarks.

StefanR5R · Dec 22, 2023

One more note on cache demand of science applications, especially in the Distributed Computing context. We've got basically three kinds of applications:

The very special case of LLR2 and Genefer (also, Prime95 and similar):
These implement one or another number-theoretical transform which sort of rotates a fixed amount of coefficients through the vector pipelines of the CPU. The larger the investigated numbers are, the more coefficients are involved and respectively much cache desired. But as mentioned, we are getting by with 32 MB cache or less so far, depending on the progress of the respective project within its own search space.

--> V-cache doesn't help, because we don't need as much... yet. Should these projects wander past 32 MB cache demand at some point in the future, then we also would want more cores per CCX. Or rather, a GPGPU implementation. (Which we already have in case of Genefer.)

The more general case:
The science code is well optimized in the parts which matters most to throughput. The optimization target is invariably a CPU with a normal amount of cache. Such optimizations happen if the code author knows his stuff, or/and uses standard code which was optimized respectively by the originators of this base code.

--> V-cache doesn't help because the numeric algorithms were optimized to work well on widely deployed hardware.

Another general and not at all uncommon case:
The science code is everything but optimized, and wastes a lot of CPU time needlessly. V-cache might help a very little with some luck in such cases, but what's really needed would be a thorough overhaul of the code. This type of science code exists because the authors don't have the respective computer science background or the resources to hire respective workforce.

--> Maybe, or maybe not, V-cache would reduce the wastefulness of applications like this, a little.

Certain HPC cases with large working set sizes (CFD and the likes):
We don't have these in Distributed Computing. Even the ClimatePrediction.net applications, which do work on relatively large datasets, are in fact optimized to utilize normal CPUs well.

StefanR5R said:
It is possible that some of the simulations which the Rosetta@home project runs would profit from V-cache.

Also, maybe (or maybe not) WCG's African Rainfall project could benefit. It is similar in nature to ClimatePrediction.net's applications, and I would hope that it is optimized similarly to these.

StefanR5R said:
another question related to Distributed Computing which hasn't been investigated yet AFAIK, is whether or not GPGPU applications benefit from V-Cache.

On second thought, it probably doesn't make sense to cache the data which are transferred to and from the GPU. Except for pre- and post-processing by the CPU, but then, normal sized caches should suffice because the CPU ideally does not have to perform multiple passes over the data with interdependency between all the data; that's what the GPGPU part of the application is for.

StefanR5R · Dec 22, 2023

emoga said:
@Markfw
Have you ever measured the power usage from the wall for the 9654 under a heavy AVX-512 load, like in PrimeGrid?

What's your estimate for the 9654's power usage at full load?

Package power tracking should get the same results as with 9554. A difference between 9654 and 9554 however is that the latter runs into the f_max cap earlier, due to obviously higher power budget per core. And the rest of the power consumption which happens outside the package is the same for all 9004 series EPYCs, naturally, depending on mainboard make, memory population, etc pp.

So far, I have never noticed my 9554p computer (config) pull more than 70 W over PPT limit at the wall. Caveat: I have doubts that the BIOS switch between performance determinism and power determinism works right, or that I have used it right yet. Power determinism attempts to drive each individual processor closer to the PPT limit, whereas performance determinism attempts to equalize slight power efficiency differences between specimens of the same OPN. BIOS default should be performance determinism.

I have seen more power overhead from PPT to at-the-wall consumption with dual-socket EPYC 7452 when in power determinism mode. I'd have to dig out notes to say how much more.

Markfw said:
BTW, my M97 hsf is keeping it at 66c with 62 processes avx-512 !

I am still doubting that Process Lasso, at least with the configs which I have seen posted, aligns tasks with caches as much as desired.
Edit: Though this concern doesn't apply to single threaded PPS-LLR, SMT not used.

Edit 2:

emoga said:
The only thing that's holding me back is the availability of the heatsink I want. (AMD SP5 M99)

@emoga, this cooler is similar to the one in the HP Z6 G5 A (Threadripper Pro workstation), except that I suspect that the fans in the HP workstation can go higher than 2200 RPM (M99 spec). And the number of heatpipes in HP's cooler isn't known. Anyway, look around for reviews of the HP workstation to get an impression if this style of cooler, or a 4U air cooler in general, would work for you.

pututu · Dec 22, 2023

igor_kavinski said:
Maybe the DC folks are missing out on even better performance at much lower power consumption and heat output?

For DCing, if possible I always try to optimize the cpu or gpu voltage (set lower than default curve) for less heat output and better computational efficiency in terms of PPD (points per day) per watt. The most dominant factor affecting the power consumption/heat output is obviously the cpu and gpu core voltage. I generally try to google for information if someone had already mapped out, say the 7950x clock versus voltage curve and use this as a starting guide to optimize the cpu clock-voltage for a particular DC project for my own system. I never had any great success with ryzen master (ECO mode or with CO offset) so I go into the bios and adjust the clock and voltage directly. Once you get a baseline, it is easy to optimize your system for other projects.

Below is the 7900x voltage curve versus clock taken from here. I tried it on my 7950x and it is not that far off but requiring slightly higher voltage due to higher power density.

StefanR5R · Dec 22, 2023

@pututu, as another reference point, I happen to have notes from 9554P @ 400W running GFN-20:
average clock was 3.40 GHz (didn't take notes of max clocks or the spread, but the spread of core clocks wasn't large)
CPU_VDDCR0 = 0.967 V, CPU_VDDCR1 = 0.967 V, CPU_SOC = 0.811 V, CPU_VDDIO = 1.092 V (according to the BMC, via IPMI, I haven't researched yet if there is a driver for the CPU's own sensors, but at least the BMC's reading isn't subject to AGESA bugs... only subject to BMC firmware bugs)

Some projects accepts results at a quorum of 1. Better don't undervolt the CPU at these projects. Rather, make do with a PPT limit. (Of course, don't let the BIOS overvolt the CPU either...)

Markfw · Dec 22, 2023

StefanR5R said:
@pututu, as another reference point, I happen to have notes from 9554P @ 400W running GFN-20:
average clock was 3.40 GHz (didn't take notes of max clocks or the spread, but the spread of core clocks wasn't large)
CPU_VDDCR0 = 0.967 V, CPU_VDDCR1 = 0.967 V, CPU_SOC = 0.811 V, CPU_VDDIO = 1.092 V (according to the BMC, via IPMI, I haven't researched yet if there is a driver for the CPU's own sensors, but at least the BMC's reading isn't subject to AGESA bugs... only subject to BMC firmware bugs)

Some projects accepts results at a quorum of 1. Better don't undervolt the CPU at these projects. Rather, make do with a PPT limit. (Of course, don't let the BIOS overvolt the CPU either...)

Now thats odd... My QS 9554's run 3.5 ghz, all 4 of them ! Faster than retail. same motherboard, stock bios, no changes. one using linux, the other 3 using windows 10.

pututu · Dec 22, 2023

StefanR5R said:
Some projects accepts results at a quorum of 1. Better don't undervolt the CPU at these projects. Rather, make do with a PPT limit. (Of course, don't let the BIOS overvolt the CPU either...)

The non-avx workloads are easier to undervolt with a fixed cpu core voltage. With avx workload, I've to add additional voltage for stability. I just use a simple formula to estimate how much voltage I need to add to the cpu. For e.g. if the avx workload increases the cpu temp by 20°C over a "typical" cpu temp when running non-avx workload, I just add about 40mV to compensate for the cpu voltage drop due to heating (assuming typical temperature coefficient of a silicon is about -2mV/°C, add at least 2mV/°C * 20°C = 40mV). My 7950x is air-cooled with Noctua nh-u9s (smallest desktop noctua cpu heatsink cooler, I think). I don't run at default curve or overclock, so a single 92mm fan is sufficient for my use case.

If you have many systems to handle, using ryzen master with CO offset is probably the best way to go. Anyway, I've only four main rigs (including one epyc Rome system) that I undervolt all the time. Other than playing around with cTDP, does anyone know how to undervolt an epyc Genoa system other than the threadripper? Maybe future upgrade when the price is right. Yeah, I'll be waiting for a few years...

StefanR5R · Dec 22, 2023

StefanR5R said:
notes from 9554P @ 400W running GFN-20:
average clock was 3.40 GHz

Markfw said:
My QS 9554's run 3.5 ghz, all 4 of them !

The clock speed of my EPYC, like all (production) EPYCs, depends on the workload.
(In workloads which don't spend a lot of energy in the cores, clock speed approaches f_max of course.)

Another data point:

StefanR5R said:
Running 64 SGS-LLR tasks at once [...with default 360 W PPT limit...] Cores are running at about 3.3 GHz.

And another:
MilkyWay@Home NBody, 4 threads/task, all SMT threads used, cache-aligned affinity, a light workload which needs merely 365 W at the wall at default 360 W PPT limit,

StefanR5R said:
min/avg/max core clocks =
3.39/3.66/3.74 GHz (Genoa, f_base/f_max = 3.1/3.75 GHz),

Same but at 400 W PPT limit: 415 W at the wall, min/avg/max core clocks =
3.0/3.70/3.75 GHz (one take)
2.4/3.70/3.75 GHz (another take)

Also, this is a good occasion to remind everyone that high core clocks do not always mean high execution speed. Sometimes it merely means that the cores aren't actually doing quite much besides waiting for memory accesses or for inter-thread synchronization.

Just got a 9654 Genoa ** Motherboard /ram/heatsink/CPU are here !!

cellarnoise

Senior member

cellarnoise

Senior member

cellarnoise

Senior member

igor_kavinski

Lifer

The Technical Workloads Where AMD Ryzen 9 7900X3D/7950X3D CPUs Are Excellent - Phoronix

StefanR5R

Elite Member

igor_kavinski

Lifer

StefanR5R

Elite Member

StefanR5R

Elite Member

StefanR5R

Elite Member

pututu

Member

StefanR5R

Elite Member

Markfw

Moderator Emeritus, Elite Member

pututu

Member

StefanR5R

Elite Member

TRENDING THREADS