The Athlons do floating point faster than a P4 on a per clock basis. At some clock speed a P4 will be fast enough to beat a slower clocked Athlon. I don't know the ratio. The Athlon devotes a lot of resources to the floating point in comparison to the P4. The advantage in some games of Athlons labeled with a model number similar to a to a given P4s clock speed is often attributed to a games heavy use of FP. OTOH games that are thoroughly optimized for SSE2 instructions generally outperform similar model Athlons
Even programs that are primarily floating point use other instructions as well, so the long pipeline liabilities can have some effect on the program execution speed. If optimized for a P4, there should be no great penalty.
It was a conscious choice with the P4's design to give up per-clock performance in favor of a higher attainable clock speed. For a while, P4s had a hard time ramping up the clock fast enough to keep pace with the Athlon design. Now the fastests P4s are pulling away a little from the fastest Athlons. Naturally Intel is asking a (large) premium for the fastest obtainable. What else? Do the fastest P4s outperform the fastest Athlons on FP? I don't know.
The manufacturers as well as the pundits make all kinds of claims for why one processor will be faster than another, but generally all processors have design measures to mitigate the effects of possible slowdowns, and likewise there are unavoidable stalls due to the limitations of resources available in particular situations. Some compromise on resources is necessary to keep the chip cost down. In the end, you cannot tell what will happen without an accurate simulation, or actually running real programs. The Athlon just has more resources available on average than the P4 for floating point. The FP execution units on an Athlon are more complete, more independent, and have fewer special limitations. In general, the resouces which the Athlon has to prevent stalls and interlocks are gross overkill. The P4, OTOH, is very adeptly balanced and minimized. Intel depends on its market dominance (80% ?) to persuade programmers to program around its processors pitfalls. AMD cannot adopt this type of strategy, obviously. The net result is that it is a lot easier to hand optimize the key FP speed loops for an Athlon, so the Athlon in general will easily outperfom a P4.
There is nothing inherently cheaper about the Athlons chip design, at least not to the naked eye. On the contrary, one would guess that the P4 has an advantage on chip cost. It appears to me that AMD gets high switching speed by the traditional method of high current. AMD has been mostly a step behind in applying leading edge chip processes to its CPUs, but that technology is also somewhat cheaper. Intel seems to use an unusual method to get chip speed up. Lower temperatures. (Switching speed increases with decreasing temperature.) To keep in the performance range of Athlons, this requires next generation chip processes (which require lower operating potentials), but Intel has been able to do this. Intel seems to be spreading its transistors over a larger chip than one might expect, to get heat density down, and therefore lower temperatures. Still, to beat the Athlon by using higher clock speeds, Intel has pushed the P4s clock to the point it draws similar currents. (Current is proportion to clock speed.) Now Intel has the P4 doing multiple threads concurrently to boost the instructions per cycle, by using otherwise unemployed CPU resources. But using more resources also means drawing more current, and with that, a higher temperature.