OT: AMD and Cell MBs of the future

apit34356 wrote on 9/5/2006, 9:00 PM
Awhile back ago, I posted that an AMD with IBM's Cell together on a MB would be a real screamer. Well it's been leaked to the media, so here it is;
"Los Alamos National Laboratory (LANL) will turn to both AMD's Opteron chip and IBM's Cell in an effort to breath new life into its supercomputing program. The lab will announce that IBM will build Roadrunner using a hybrid design that makes use of Opteron and Cell systems, according to a report from online rag CNET. The publication cites "sources familiar with the machine" as claiming that the National Nuclear Security Administration (NNSA), which oversees LANL, will reveal IBM's win "in the coming days."

Other rumors, but it's a safe bet, that IBM has design a series of servers using the AMD and Cell combo. Its well known that IBM has demonstrated a cell server farm to a few select clients. But this pushes many MB manufacturers to reconsider the volume of such a market. Encoding speeds for any encoder using the Cell will be at least 1Mag of order. Once the Cell's local memories are loaded with the program code for the spus, only memory access required is the data stream.

One can see why, IBM saw no benefit in Apple. Remember, IBM was the who refused to continue working with Apple. An when Adobe said they were no interested investing resources on 5% market( their products on apple), Apple spun a series of bad news into marketing gold, for now.

The real question is; who will packaged the AMD and Cell MB in a retail package, with all the glitter and software extras. ...... the future media center for pros, pro-consumers.... will "A" kiss the shoes of IBM/Sony for becoming first, will "D" be first.......

Comments

SHTUNOT wrote on 9/5/2006, 9:17 PM
Do you mean that on a dual mobo you would have a opteron and a cell? Or on a dual core with one of each? How would they complement each other? What are ones strengths/weakness over the other?

Ed.
apit34356 wrote on 9/5/2006, 9:37 PM
"dual mobo you would have a opteron and a cell" yes.
"on a dual core with one of each" AMD is multi core,....ie....x2,x4 or more. theCell is 9 core design, as in; 1-powerchip5 core and 8-specialized spus(ie vector) with each having it own "on chip" program and data memory.

"How would they complement each other?" The AMD and Cell both have better and more advancement memory subsystems, so they don't screw up the memory "cycling' required for Dym. ram causing extra delayed read and write times.
But AMD chips have a large market of 3party software already there. The Cell power is in the abiltiy to process data streams without requiring instructions being mixed with data coming the main memory. More to follow....
farss wrote on 9/5/2006, 9:37 PM
I really don't see the need for this technology for our kind of work.
From perhaps limited experience the limiting factor is not CPU speed but disk system throughput. That probably changes somewhat with more advanced codec that reduce data at the expense of CPU load.

Even so though when it comes to compositing multiple tracks of HD again we're talking about shifting huge amount of data.

Furthermore the code is the other issue. Not just the OS but code that can make use of all the horsepower. Just look at the existing supercomputers, not exactly user friendly beasts, probably still being coded in Fortran. Fine for chrunching huge arrays for climate modeling, searching for ET or ballistics but video, I don't know. We've yet to even get GPU processing for video, audio is maybe on the horizon.

Bob.
apit34356 wrote on 9/5/2006, 10:13 PM
Current cpu design request instructions and data from the cache, the cache checks if its in there, if not, then makes a request from main memory, which is then store in cache for the cpu's requested address space, then moved to the CPU. A cpu memory request when running a program like this, I=instruction request,D=data request,
I,I,I,I,D,I,D,D,I,D,D,I,I,I,I,I now think of this pattern in a loop of 100, where we have 11 program instructions and 5 data request, 100X = 1100 Instructions and 500 data request. so, the best performance for 1600 memory request gave 500 data operations at the max. Usually most loops are closer to 200-300 result datas at best.
Now the Cell is different, to run a spu( example will be one out of eight), you load the program into the local memory on chip, this not cache, then the spu executes, the memory stream for the same loop is as follows, D,D,D,D,D
So for a loop of 100, 100x=100x5=500 data request, no instruction requestions from main memory

Now, think about having 8 spu, all execute the same or different program with each having never to request Instruction from memory, just data.

Now, I have avoided discussing cache design and instruction pipeline because this is usually the most misunderstood. Most current instruction pipelines and cache are poorly designed because they tring to execute conditional jumps( and others type code) that worked well in the 50's and early 60's. But these instruction design just kills the performance of the pipeline and caches, so massive extra circuits were added, but add more power needs and lowered simple instruction execution speeds, causing more headaches as the clock speeds climb.
The Cell spus have special designed conditional branching designs that removed the performance hit of conditional instructions. Put in real world terms, kept it simple, do right first time.
apit34356 wrote on 9/5/2006, 10:34 PM
Bob, I do not agree. Look, we now have X2 cpus and more coming down the pipeline.
We have vegas users that are running duel -duel AMDs and Intels.

Alot of people are happy with PP2 because of gpu rendering, the Cell will make the gpu look like a bicycle racing a motorcycle.

Software? The Cell is more for encoding /decoding / vector math. So, if IBM or Sony releases their en/decoders(they already exist) for general apps,

Bob, the Cell power is in the data stream, you can put the same memory design and speed with any currunt cpu, running encoders, the Cell absolutely wins every time, not by 5% or 10%, think 400 to 1000%. Bigger the data stream, better it does.
8 channels of HD being encoded each reat-time, no problem( assuming 32g memory min, plus 10k disks, output by channel for broadcast).
grh wrote on 9/6/2006, 5:44 AM
I really don't see the need for this technology for our kind of work.

That's not quite true. One need only look at the latest Boris package, which uses a GPU to perform 3-D titling computations in real time to recognize that there is a huge need for compute resource in video compositing. The editing itself, not so much, but anything past that: effects, titles, 2-D and 3-D transformations, etc, all would benefit from additional CPU horsepower.

Yes, the disk drive is far away from the CPU, and there's a need to move a lot of data into/out of the CPU. Increased bus speeds and more physical memory assist with that, but once the data is available to the processor, there's only so much work that can be accomplished in a given amount of time.

Myself, I want a Cell-based system running Vegas (or comparable) with lots of CPU-hogging effects and tools. Looking forward to the PS3 demonstrating what can be done with a Cell in real time.
JohnnyRoy wrote on 9/6/2006, 7:33 AM
> I really don't see the need for this technology for our kind of work.

I brought this up back in February in this thread: OT: Could be a great rendering solution. The IBM Blades hold 14 Cell boards with each Cell having 9 cores. That’s 126 cores! Can you say, “Instant Render Farm?”

~jr