AOA Forums AOA Forums AOA Forums Folding For Team 45 AOA Files Home Front Page Become an AOA Subscriber! UserCP Calendar Memberlist FAQ Search Forum Home

Go Back   AOA Forums > General > AOA FAQ

AOA FAQ Need a general understanding of something or detailed plan of action? Our members offer you their FAQs!

LinkBack Thread Tools Rating: Thread Rating: 4 votes, 3.00 average.
  #1 (permalink)  
Old 11th February, 2006, 06:04 PM
Super Nade's Avatar
Join Date: December 2005
Location: Indianapolis, USA
Posts: 157
Send a message via Yahoo to Super Nade

Athlon64 and CPU quick reference

I'm attempting to put together a mini-refrence which explains the architectural highlights of the Athlon64 with the associated CPU paradigms. I will be adding new material as and when I learn something new . I am trying to make my learning experience profitable to all those who wanted to know more, but don't have the time to do the research.Suggestions to organize/present the material are welcome. Many thanks to yeha, emboss and Moto7451 collegues at,(our conversations will be quoted extensively!) who inspired me to undertake this study.

This compilation owes its existence to various online sources (often quoted in verbatim) and they are mentioned in the refrence section. Most of the material is courtesy of Jon Stokes @ ArsTechnica. I have referred to his article for most of these passages.

Processor Pipelining

We hear all this talk about the P4 having higher clock frequency because it has longer pipelines. But what the heck is a pipeline? Does it carry water or oil (or sewage, for the unsavoury types who frequent this forum) ?

Take for example a tennis raquet making factory. Lets assume the following steps lead to a finished product.

-Paint the Frame
-Mark the Logos
-String it
-Package it

One could either hire specialist crews to do each job or have a single crew who does it all. Let us look at each case. When you have specialists, only one crew working on one stage at any given time while the rest of the crews are idle. However, if there are multiple orders for racquets, everybody is going to be busy. With a single know it all crew, the turn around time is larger but you don't have the expense of paying a lot in salary.I suppose this paragraph holds many hints on what is to come, i.e RISC, bottlenecks and latencies. We will tackle them soon.

A computer basically just repeats four basic steps over and over again in order to execute a program:

1. Get an instruction stored in the program counter.
2. Store that instruction in the instruction register and decode it, and sort of move the "pointer" to the next address. (Something like a stack pointer).
3. Execute the instruction in the instruction register
4. Repeat steps 1-3.

An alternate way of doing this would be as given below:-

1. Fetch the next instruction from the address stored in the program counter.
2. Store that instruction in the instruction register and decode it, and increment the address in the program counter.
3. Execute the instruction currently in the instruction register.
4. Write the results of that instruction from the ALU back into the destination register.

In a modern processor, these are the only steps and they are repeated untill the program is done executing. These are, in fact, the four stages in a classic RISC pipeline. So we can write a time evolution pattern as:
Fetch--> Decode---> Execute---> Write/Store

So, an ancient processor had one instruction in the pipeline and as it moved along the various stages not involved with it, remained idle. A picture of inefficiency. So, if each stage took x ms to complete, then this processor could finish with only one instruction every 4x ms.

With pipelining, the four stages act like four stages in a factory assembly line. When the pipeline is at full capacity, each stage is busy working on an instruction and the whole pipeline is able to spit out one instruction right after the other. If each stage takes 10 ms to complete, then a full pipeline can process one instruction every 10ms. It is important to realize the importance of keeping a pipeline full. What we shall see later is how we can play tricks with predictive algorithms and assigning certain operations more priority.

Allow me to quote yeha on the concept of branching:
Originally Posted by yeha
A pipeline is a list of instructions that the cpu is working on - if the first-most instruction is an add or something, the cpu can look further down the pipeline for a memory load and start working on that, since it can both add and load from memory at the same time.

Processors like to keep the pipeline full and go through as many instructions as possible, however branches make that tricky. Branching is when your code says "if x, do y, otherwise z", like "if this pixel is pure pink, copy the background over it, otherwise keep the foreground color" is a simple transparency hack. Processors try to guess which choice out of "y" and "z" will be correct and fill subsequent pipeline positions with the result of that guess, so that it can start working on the consequences of the answer, before the decision has even been made.

Now when you guess wrong, all that subsequent work in your pipeline has been invalidated so it has to be flushed out and started over, giving us a stall. To get around this, good media code minimizes branches by unrolling loops, replacing "if" decisions with tricky equivalent math or working with the branch predictor's subtleties so that pipelines stay full and keep churning out data.

Unrolling loops means instead of doing say "myloop is (store a 0), now do myloop 256 times" you code "store 0 store 0 store 0 store 0 store 0...." - looping gives you smaller code but every time you run through a loop you have to perform a branch to see whether your loop counter has expired yet. that's a bad example as branch predictors would get such a loop correct 99% of the time, but it's the general idea and helps shave valuable cycles off many inner loops.
Processor clock frequency and Pipelines

All of a CPU's parts are driven by a single clock generator. It is a precise circuit made of transistors. A simple macroscopic analogue would be the Colpitt oscillator. So, this base clock generator frequency determines the "speed" of the computer. The amount of time that it takes to complete one pipeline stage is exactly one CPU clock cycle. Note here, that we set the clock freq based on the largest time taken to execute one instruction.

Lets look at a few questions I posed, before getting back to the main topic.

Q. The length of the pipeline determines clock freq. Suppose we have a clock pulse say of 1ns width, does the Intel see more CPU activity (albeit inefficient, because of the probabilistic nature of the pre-emptive processing) than the A64 ?
Originally Posted by yeha
Kind of.. the longer the pipeline, the simpler each step is. simpler steps allow higher clocks. you can have a long pipeline with complex steps for a long-pipeline/slow-clock-speed combination, but i'm not sure why you'd want that. if the pipelines are full in both an a64 and prescott, and the Mhz are equal, there'd be more activity in the a64 since each stage in the pipeline is more complex than each stage in intel's.
Q. Am I right in concluding that the clock freq is not a constant and that the oscillator is "driven" to produce pulses of variable width? Now, if so, what drives it? What kind of feedback mechanism are we looking at? I though a stable sync pulse train is a must in any CPU?
Originally Posted by yeha
No it's constant, that part just meant that if you design a cpu, the kind of clock speeds you can expect out of it will be dependent on how complex the pipeline stages are. if you design simple stages (long pipeline) you'll be able to reach high clocks. if you design complex stages (short pipeline) your clocks won't be as high.
4.What are we doing by overclocking? If the pipelines control the clock freq, are we increasing the length of the pipelines? Or, am I incorrect in seperating the clock oscillator from the pipelines and it should be thought of as one whole feedback system?[
Originally Posted by yeha
By overclocking we're just increasing the internal clock frequency, pushing more instructions per second through the pipeline.
If you think about it, saying that each pipeline stage can take at most one clock cycle to complete is equivalent to saying that to entire pipeline can only be as fast as its slowest stage. In other words, the amount of time it takes for the slowest stage in the pipeline to complete will be the length of the CPU's clock cycle and thus of each pipeline stage. However, we must understand that all stages are not made equal and the timlines are different. So, it would take longer to string the racquet than to box it. If we impose a time limit of 5 min for all processes, there is going to be trouble, as one cannot string a tennis racquet in 5 min!

It is beneficial to have all the stages taking the same amount of time to execute an instruction. If one stage takes up too much time it slows down the entire pipeline and this is the classic bottleneck.

Pipelining, in essence, allows the CPU to process multiple instructions at the same time. A four-stage pipeline like the one described above gives the processor a "window" (In tems of DSP terminology it is akin to a filter Window) of four instructions. All instructions within this window are executed simultaneously and the window slides to the next data set only when all instructions assigned to it are executed.

All pipelines are not four stages. Rather, the four stages represent the minimum breakdown of labor found in a modern, pipelined processor. For many processors, these four stages are further subdivided into even smaller stages. Because of the aforementioned relationship of clockspeed to the number of pipeline stages, more and shorter pipeline stages = a faster clock speed. (Note that the number of pipeline stages is referred to as the pipeline depth. So our four-stage pipeline has a pipeline depth of four.)

Video Encoding and CPU Cache:
A common misconception is that Video encoding benefits greatly from extra CPU cache. yeha helped me understand why it is not so.
Originally Posted by yeha
Video encoding isn't helped much by cache - for a well-written encoder the working set is very small. 16x16 pixel blocks are only 256 bytes, good motion searches only go through 5 or 6 block candidates before settling on one to refine, and all of that fits in L1 cache.

that's for block-based codecs anyway. Codecs which can't be fit into a slice system would see a large penalty if the entire image didn't fit in cache, wavelets could be a candidate there.

I could imagine audio codecs which perform extravagant analysis creating large data tables to be crunched down, however working with 2048-sample windows makes that unlikely too. i wasn't surprised at all that encoding was mostly clock-speed dependent.

Cache helps you when you have a large data-set that you're scanning all over. if you're just streaming in blocks of data, performing a permutation over all of it then dumping it back to memory, cache won't help you at all - encoding is a great cache-neutral example, something like lossless data compression is a good cache-favoring one.

Superscalar Architecture

A superscalar CPU is one which has more than one ALU (Arithematic and Logical Unit). Remember that CPU's have 3 basic units, Control Unit, Memory Unit and the ALU. Having more ALU's means that instructions can now be executed in parallel, stepping up the effective processing power of the CPU (if this is the sole consideration). As, Integer instructions were more common than floating point instructions (a seperate unit, processes FPI's), the expanded ALU processed Integers or scalars. Hence this architecture is called superscalar architecture.

As transparency of the inner processes is required, it means superscalar processing is inherently complex as it has to maintain this facad of transparency.This is true, because you have to divide one stream into two, exceute the instructions and order the output in a single stream as before. No wonder, single threaded applications don't see a massive benefit in this case (due to the overhead involved in maintaining this facade)..The important thing to remember is that the main memory still sees one code stream, one data stream and one results stream. However, the code and data streams are carved up inside the computer and pushed through the two ALUs in parallel.

Q. What are CPU registers and how do they factor in?
Registers are high speed memory blocks that are right on the CPU chip (said to be “on-chip”). An example of how registers are used is adding two numbers. First the numbers are loaded into registers, then an ADD instruction is issued, and the sum is put in another register. Then that sum is moved to other memory as needed. Sets of registers are defined as “register files.” A register file simply consists of equal size registers one right after the other, like records in a file. In brief, one may say,registers are temporary storage units within the CPU. Some registers, such as the program counter and instruction register, have dedicated uses. Other registers, such as the accumulator, are for more general purpose use.
It is not possible to have dedicated registers for each ALU, as that would greatly increase the complexity.So, the CPU's registers are grouped together into a special unit called a register file. This unit is a memory array and it's accessed through an interface that allows the ALU to read from or write to specific registers.The amount of die space that the register file takes up increases approximately with the square of the number of ports. This is one of the reasons why modern CPUs use separate register files to store integer, floating-point, and vector numbers. Since each type of math (integer floating-point, vector) uses a different type of execution unit, attaching multiple integer, floating-point, and vector execution units to a single register file would result in quite a large file. The ALU executes integer and logical instructions only. The FPU (floating-point unit) executes only floating-point instructions, and the VPU (vector processing unit) executes only vector instructions.Also, as the register files grow larger, the accompanying access latencies increase slowing down the whole CPU. Another potential bottleneck within the CPU.

So instead of using one, massive register file for each type of numerical data, computer architects use two or three register files, connected to a few different types of execution units.Running out of resources like registers isn't the only thing that can stop a superscalar processor from issuing multiple instructions in parallel. Sometimes, the instructions themselves are arranged in ways that lock them into a specific, sequential execution order. In such cases, the processor has to jump through some hoops in order to extract instruction-level parallelism (ILP) from the code stream.

GHz is not everything!
We often hear about the GHz wars being pointless and just a marketing gimmick by Intel. Let us find out if that is true by means of a logical discussion.

In general, a program's execution time is equal to the processor's instruction completion rate (number of instructions completed per nanosecond) multiplied by the total number of instructions in the program.
Program execution time = instruction completion rate x number of instructions in program.
In the case of a non-pipelined, single-cycle processor, the instruction completion rate (X ns per 1 instruction) is simply the inverse of the instruction execution time (1 instruction per X ns). With pipelined processors, this is not the case.

What we have to consider here is the "fill up time", i.e the time taken by the pipeline to fill up with instructions, so in effect there is no action during this fill up period. One may say there is a phase difference between when a single cycle processor starts data execution and when a pipelined processor starts off. However, the pipelined processor executes more instructions per clock cycle than the former. However, this is a classic case of diminishing returns. Program execution does not speed up exponentially as the pipeline length increases! So, the longer the pipeline, the longer the wait time. Another issue to consider is a mis prediction. Suppose the predictive algorithm fails, the pipeline has to be refilled. So, longer the pipeline, the more costly a mis-prediction.

n the real world, a processor's pipeline can be found in more conditions than just the two described so far — i.e., a full pipeline or a pipeline that's being filled. Sometimes, instructions get hung up in one pipeline stage for multiple cycles. There are a number of reasons why this might happen, but when it happens, the pipeline is said to stall. When the pipeline stalls, or gets hung in a certain stage, all of the instructions in the stages below the one where the stall happened continue advancing normally, while the stalled instruction just sits in its stage and backs up all the instructions behind it. Pipeline stalls, or bubbles, reduce a pipeline's average instruction throughput, because they prevent the pipeline from attaining the maximum throughput of one finished instruction per cycle.This is the problem with Intel's CPU's. Their long pipelines come at a price of inefficiency.

I'd like to quote Moto7451 here for some excellent input.
Originally Posted by Moto7451
The size of the pipeline is fixed. Its based on how they "wire" the CPU when they make it. Heres one way at looking at this...

The 31 stage pipeline takes 31 Cycles to complete an instruction. A 12 stage pipeline takes 12 Cycles to do the same operation.

This means that if the designs are exactly the same besides the number of pipelines (i.e. if you were to make a 12 stage Prescott to accompany the current ones or 31 stage Venice to accompany the current model) you'd end up with something like this (If you have no pipeline misses & branch prediction errors & we're not caching anything as these are issues that affect performance... I'll talk about this later):

A 31Hz 31 Stage A64 would be just as fast as a 12Hz A64 & this equality in performance would scale accordingly... if the world was perfect. There are some problems with long pipelines.

The old P4 based celerons were awful because they had such a small L2 cache. The larger your pipeline the more L2 cache you need. This is because you need to keep the pipeline fed, & more importantly if you have a branch prediction miss & you have to throw away data or you have an error in calculating data & you have to throw the data away you're loosing more work cycles than on a shorter pipeline processor.

Also, if your cache is full & this happens you may not have the data in it necessary to redo the task you just had an issue with unless you go back to system memory. That takes even more time. This is one reason why the L2 Cache size on A64s hasn't proved to make as huge of a difference as it does on P4 based chips. A 128KB Sempron is able to keep up with its 512KB & 1MB L2 A64 brethren. It is also why there isn't a huge difference in performance between the different cache sizes on A64s (Also the Integrated Memory Controller on the A64 helps in the case that the CPU must go to memory because its cache is filled.).

Look at the huge performance increase the Celeron D sees over the Celeron based on the Northwood & you'll see how much even a small increase in cache will improve performance on that architectures.

That said, these problems are not an issue overall if you put in a big enough cache & if you're able to achieve a clock speed thats high enough to give you the same performance as a smaller pipeline/lower speed design or if you're actually able to outperform the smaller pipeline design by being able to clock much higher than necessary to match the performance of the other.

A 62Hz 31 stage CPU would be twice as fast as the 12Hz 12 stage pipeline CPU for example (assuming there aren't any of the issues I spelled out above). Another nice thing about being able to put up big numbers is that you're able to spin this in your marketing department & make some money that way.

The Single Instruction stream, Multiple Data streams (SIMD) paradigm
In this part, I'm trying to learn about the evolution of 3DNow! from certain basic principles.

Before SIMD, a seperate Math co-processor was required. There was no FPU. A microprocessor is a essentially a single instruction-single data stream device. What this means is a single instruction "acts" on one data stream which requests that particular operation to be performed on it. However, what if we have multiple data streams requesting the same operation to be performed on them? You get data parallelism when you have a large mass of data of a uniform type that needs the same instruction performed on it. This is exploited by SIMD.

I'd like to think of it in this way:

Many instructions -------> A single data stream >>>>>> Superscalar architecture and Instruction level Parallelism

Single instruction ------> Many data streams >>>>> SIMD.

SIMD implementations support saturated arithmetic. With wrap-around arithmetic, whenever you do a calculation whose result turns out to be bigger than what you can represent with whatever data format you're using (16-bit, 32-bit, etc.), the CPU stores a wrap-around number in the destination register and sets some sort of overflow flag to tell the program that the value exceeded its limits. This isn't really ideal for media applications though. If you add two 32-bit color pixel values together and get a number that's greater than what you can represent with 32-bits, you just want the result to come out as the maximum represent able value (#FFFFFF, or white). You don't really care that the number was too big; you just want to represent the extreme value. It's sort of like turning up the volume on an amplifier past 10 . You can keep on turning that knob, but the amp is already maxed out--which is what you want.

Most modern CPU's existing today use ILP, which brings us to MMX and its logical extentions i.e SSE and 3DNow!
The AltiVec G3/G4 story
Unlike AMD and Intel, Motorola took a dedicated hardware approach to SIMD. Another important advantage of AltiVec that deserves to be pointed out is that there are no interrupts except on vector LOADs and STOREs. You have to have interrupts for LOADs and STOREs in case of, for instance, a cache miss. If AltiVec tries to LOAD some data from the L1 cache into a register, and that data isn't there, it throws an interrupt (stops executing) so that it can wait for the data to arrive.

AltiVec doesn't, however, have interrupts for things like overflows and such (remember the saturated arithmetic discussion). Furthermore, the peculiar implementation that 3DNow! and SSE use to do 128-bit single-precision FP means that a 128-bit fp calculation can throw an interrupt, saturation or no. More on that when we talk about SSE, though.

The upshot of all this is that AltiVec can keep up its single-cycle throughput as long as the L1 keeps the data stream full. The FP and integer instructions aren't going to hold up execution by throwing an interrupt. They added 32 new registers to the G4's die along with two dedicated AltiVec SIMD functional units, thus increasing the die size of the G4. Nevertheless, the G4's die is still under 1/3 the size of the PIII's, which is itself about half the size of the Athlon's. Since the G3 was so small to begin with (in comparison to Intel's and AMD's offerings), Motorola could afford to spend the extra transistors adding dedicated SIMD hardware.

All of the AltiVec calculations are done by one of two fully-pipelined, independent AltiVec execution units. The first unit is the Vector Permute Unit. It handles vector operations that involve rearranging the order of the elements in a vector. These are those inter-element operations, like pack, unpack, and permute. It also handles vector memory accesses -- the loading and storing of vectors into the registers.

The second piece of hardware is the Vector ALU. This unit handles all of the vector arithmetic (integer and FP multiply, add, etc.) and logic (AND, OR, XOR, etc.) operations. Most of these fall under the heading of intra-element operations, where you're combining two vectors (and possibly a control vector) to get a result.

Both of these execution units are fully pipelined and independent. This means that the G4 can execute two 128-bit vector operations per cycle (one ALU, one permute), and it can do so in parallel with regular floating-point operations and integer operations. The units are also pretty fast. The instruction latency is 1 cycle for simple operations, and 3-4 cycles for more complex ones.

Another important advantage of AltiVec that deserves to be pointed out is that there are no interrupts except on vector LOADs and STOREs. You have to have interrupts for LOADs and STOREs in case of, for instance, a cache miss. If AltiVec tries to LOAD some data from the L1 cache into a register, and that data isn't there, it throws an interrupt (stops executing) so that it can wait for the data to arrive.

AltiVec doesn't, however, have interrupts for things like overflows and such (remember the saturated arithmetic discussion). Furthermore, the peculiar implementation that 3DNow! and SSE use to do 128-bit single-precision FP means that a 128-bit fp calculation can throw an interrupt, saturation or no. More on that when we talk about SSE, though.

The upshot of all this is that AltiVec can keep up its single-cycle throughput as long as the L1 keeps the data stream full. The FP and integer instructions aren't going to hold up execution by throwing an interrupt.
The MMX story
Intel introduced MMX first as an integer-only SIMD solution. MMX doesn't support floating-point arithmetic at all !!Even as MMX was being rolled out, Intel knew that they had to include FP support at some point.

Floating-point operations and MMX operations share a register space, so a programmer can't mix floating-point and MMX instructions in the same routine. Of course, since there's no mode bit for MMX or FP, there's nothing to prevent a programmer from pulling such a stunt and corrupting his floating-point data. The fact that you can't mix floating-point and MMX instructions normally isn't a problem, though. In most programs, floating-point calculations are used for generating data, while SIMD calculations are used for displaying it.In all, MMX added 57 new instructions to the x86 ISA. The MMX instruction format is pretty much like the conventional x86 instruction format.

Without going into the details, it is sufficient to say that Intel shotchanged themselves in persisting with a different avatar of MMX. It is decidedly inferior to the AltiVec scheme (which came after MMX). First and foremost, they wanted to conserve die space. As it is, MMX/SSE adds 10% to the size of the PIII's die. If they had gone ahead and implemented an independent SIMD multiplication unit, this percentage would have been higher. So they reused some FP hardware to keep the transistor count low. Furthermore, doing things this way allows them to use the existing 64-bit data paths inside the CPU to do 128-bit computation. Adding dedicated SIMD floating-point hardware and a 128-bit internal data path to push vectors down would have really eaten up transistor resources. Intel was also able to limit the changes to the PIII's instruction decoder by implementing 128-bit SIMD FP in this manner. Finally, the fact that the SIMD adder and multiplier are independent and on two different ports, the PIII can dispatch a 64-bit add and a 64-bit multiply at the same time.

Remember that all of this 128-bit talk applies only to floating-point instructions. Old-school integer MMX calculations are still restricted to the world of 64-bits. Such is the price of backwards compatibility.

By breaking up the 128-bit ops into two 64-bit uops ("uop" = microinstruction) and running them either concurrently or sequentially, the PIII opens itself up to the possibility that one of the uops will encounter a snag and have to bail out ("throw an exception") after the other one has already been retired. If this were to happen, then only half of the 128-bit destination register would hold a valid result. Ooops.

To prevent this, the PIII includes special hardware in the form of a Check Next Micro-Operation (CNU) mechanism. What this hardware does is keep the first uop from retiring if the second one throws an exception. This means that once the re-order buffer (he keeps track of execution and retirement) gets the first, completed uop of a 128-bit instruction, it has to wait up for the second uop to finish before it can retire them both. This has the potential to slow things down.

Intel got around this by taking advantage of a common case in multimedia processing. Often, as in the case of saturated arithmetic, exceptions like overflow and underflow are masked, which means that the programmer has told the processor to just ignore them. If an MMX or SSE instruction has its interrupts masked, then since the PIII would ignore the exception anyway it just doesn't bother having the re-order buffer (ROB) wait up for the second uop. In this case then, the ROB can go ahead and retire each uop individually. This is much faster, and since it's the common case it reduces the impact of exception handling on performance.
Clearly, the key here is saturated arithmatic. This principle when applied to non saturated arithmatic would yeild serious errors!!
Moving on to AMD's implementation i.e 3DNow!:

AMD's 3DNow!, as it's implemented on the Athlon, faces problems similar to those faced by SSE. Since 3DNow!, like SSE, incorporates MMX in all its 64-bit, x87 register-sharing glory, it has to deal with all the less desirable features of MMX (only 64-bit integer computation, the two-operand instruction format, etc.). 3DNow! takes the 57 MMX instructions and adds to them 21 unique instructions that handle floating-point arithmetic, for a total of 78 instructions. The Athlon's Advanced 3DNow! adds another 24 new, SSE-like instructions (for DSP, cache hinting, etc.), bringing the SIMD instruction count up to 114.

3DNow! simulates four-way single precision (128-bit) FP computation the same way that the PIII does, by breaking 4-way instructions down into a pair of 2-way microinstructions, and executing them in parallel on two different SIMD execution units. Like the PIII, the two units are independent of each other, and one does addition and the other multiplication. This means that for 3DNow! to be able to do sustained 128-bit computation, it has to either issue a 2-way single precision multiply and a 2-way single precision add in parallel. However, unlike either the PIII or Altivec, 3DNow has no 128-bit instructions. So any "128-bit SIMD computation" that it does is purely the result of using two 64-bit instructions in parallel. Another big difference between the Athlon's SIMD implementation and the PIII's is that the Athlon has two, independent, fully-pipelined floating-point functional units, and both of them do double duty as SIMD FUs. (Recall that the PIII has two FPUs that aren't fully pipelined, and only one of them does double duty as a SIMD FU.)

The final important difference between SSE and 3DNow! is the fact that all the 3DNow! operations, bother integer and floating-point, share the same 8 registers with MMX and x87 operations. There are no added registers, like on the PIII. This is good and bad. It's good in that you can switch between 3DNow! and MMX instructions without a costly change of state. It's bad insofar as that's very few registers for the compiler to be working with. (The Athlon has a load of internal, microarchitectural 3DNow!/MMX/FP registers, so it can use register renaming to help alleviate some of the register starvation. The PIII also has microarchitectural rename registers for this purpose, but the Athlon has more of them.)
3DNow! is a new CPU instruction set developed by AMD. As the name suggests, it offers enhanced visual performance by removing bottlenecks in multimedia and floating-point-intensive applications. In practice, this translates to faster frame rates on high-resolution scenes, much better physical modeling of real-world environments, sharper and more detailed 3D imaging, smoother video playback, and near-theater–quality audio.
The 3DNow! technology is compatible with today's existing x86 software and requires no operating system support, allowing 3DNow! applications to work with all existing operating systems. This technology is implemented by processors from AMD beginning with AMD-K6-2, AMD-K6-III, and AMD Athlon processors. Beginning with the AMD Athlon processor, 3DNow! technology has been enhanced to add five new 3DNow! digital signal processing (DSP) instructions and 19 MMX Extensions, including streaming functionality.
Today's 3D applications are facing limitations because only one floating-point execution unit exists in the most advanced x86 processors. Advanced graphics use computations are very floating-point intensive and often limit the features and functionality of a 3D application. The CPU cannot always provide processed data fast enough. The result is a bottleneck; 3D graphics performance is limited by the earlier floating-point-intensive stages of the graphics pipeline. 3DNow!™ technology relieves the bottleneck by accelerating the floating-point and other calculations that occur early in the pipeline. Ultimately, 3DNow! technology and the 3D graphics accelerators complement one another.
The source of performance for the 3DNow! instructions originates from the single-instruction, multiple data (SIMD) implementation. With SIMD, each instruction not only operates on two single-precision, floating-point operands, but the microarchitecture within the processor can execute up to two 3DNow! instructions per clock through two register execution pipelines, which allows for a total of four floating-point operations per clock. In addition, because the 3DNow! instructions use the same floating-point registers as the MMX technology instructions, task switching between MMX and 3DNow! operations is eliminated.
The 3DNow! technology instruction set contains up to 26 instructions that support SIMD floating-point operations and includes SIMD integer operations, data prefetching, and faster MMX-to-floating-point switching. To improve MPEG decoding, the 3DNow! instructions include a specific SIMD integer instruction created to facilitate pixel-motion compensation. Because media-based software typically operates on large data sets, the processor often needs to wait for this data to be transferred from main memory. The extra time involved with retrieving this data can be avoided by using the 3DNow! instruction called PREFETCH. This instruction can ensure that data is in the level 1 cache when it is needed.

To improve the time it takes to switch between MMX and x87 code, the 3DNow! instructions include the fast entry/exit multimedia state (FEMMS) instruction, which eliminates much of the overhead involved with the switch. The addition of 3DNow! technology expands the capabilities of the AMD family of processors and enables a new generation of enriched user applications.

Enhanced 3DNow! technology instructions, first implemented on the AMD Athlon processor, provides 24 new 3DNow! Instructions. The Enhanced 3DNow! Technology consists of 5 new 3DNow! DSP instructions and 19 new MMX Extensions. Along with the new instructions, the AMD Athlon processor implements additional micro-architecture enhancements that enable more efficient operation of all these instructions, and programming can be simplified because there are fewer coding restrictions.

Hyperthreading and gaming performance!:

Let us analyze Intel's marketing spin about hyperthreading and how it helps gaming. Again I'd take the liberty of using yeha's answer to my question.

Q. Well, if you are able to process many instructions simultaneously (multi-threading), wouldn't that in effect remove the extended waiting period when compared to a single queue/thread, thereby removing the bottleneck?
Originally Posted by yeha
Hyperthreading would help gaming if the game was multithreaded, but most aren't. hyperthreading helps intel cpus because it 'fills in the gaps' at times when intel's design has nothing to work on - if a section of the pipeline is empty because it's waiting for something else to complete before scheduling a task in there, it can fill that empty pipeline stage with work that another concurrent thread needs to have done. intel typically has more gaps in its pipelines than amd, which is better at keeping its pipelines full.

As for gaming performance, i'm not 100% sure. for whatever reason, the intel cannot keep its pipelines full - if they were full, amd would be getting creamed by the greater number of retired instructions per second that intel would have. in situations where intel can keep its pipelines full (encoding) it is visibly superior because of clock speed, so i guess we have to conclude that gaming is a pathological case of highly-branched code.

I've never looked at game programming before, but finding shortest paths and enemy ai are all i can think of that would be heavily branching. physics, geometry and collision detection all feel like highly-parallel operations to me that shouldn't be branching much, and a lot of them can be offloaded onto the gpu anyway. i'll do some searching to find an explanation of what most common games are doing and why it's penalizing intel so much.
Q. Now isn't it the job of the OS to split sequential instructions into two threads? How is this different from load sharing in SMP and why can't the same be implemented here? By hyperthreading, is the CPU executing a parallel instruction stream?
Originally Posted by yeha
The operating system doesn't split up code sections on a per-instruction basis, it just loads executable code into big blocks of memory, and when the os scheduler decides that program deserves some cpu time it tells the cpu "start executing code at this memory location".

When hyperthreading is in use, the os gives the cpu two memory locations to execute code from - one area will get 90% of the cpu's resources, and the second memory area full of code will be used to fill in gaps that the cpu encountered running the first thread's code.
Q. Heavy branching is favoured by short pipelines and long pipelines favour monotonos high probabilistic tasks, is this conclusion right?
Originally Posted by yeha
The shorter your pipeline the less penalty a mispredicted branch incurs. the longer your pipeline the faster you can blast through instruction sequences that mostly stream data in & out with little branching, like encoding.

With regards to the On-die Memory controller on Athlon64's
I started a thread to investigate the role of a poor memory controller on overclocking and if this problem actually exsisted. A lot of people said it did, but I was not quite convinced. Here is a very interesting question of mine answered by emboss.

Q. If you look at the block diagram for the K8 system, one particular area I'm looking at (still gathering info regarding the specifics) is the X-Bar unit/switch, which is a bridge between the memory controller and the other parts of the CPU. Also, there may be a problem with using one base clock generator for both the hypertransport bus and the memory controller (i.e 200MHz). There could also be a problem with keeping more than 4 memory pages open at a time (I think its 16 for an Athlon 64) as most chipsets seem to support only 4. So there could also be a problem with the NF4 chipset implementation of this feature.

Also, since the GART table is integrated into the controller, this issue you see could be as a result of the video card (I also think there is an internal PCI bridge) .

Originally Posted by emboss
Unfortunately there seems to be very little interest from sites like chip-architect on the hypertransport/crossbar/MCU area of the K8, so a lot of the below is from less reliable sources and guesses

First, the memory controller is the same across all K8 CPUs of the same core. So 939 FX's have the same controller as all other A64's. I'm not sure if the Opteron (and 940 FX's) and A64 lines share the same controller; I would suspect that this is the case but I haven't seen any evidence either way.

Most of the K8 core, including crossbar and memory controller operate at the core frequency. Of course, there are parts of the HT and memory interfaces that operate at the interface speed, but most of it is at the core frequency. The hypertransport frequency only would affect the hypertransport interface conponents and possibly the crossbar. Ignoring clock synchronisation issues, the memory speed should only affect the memory controller. Note that the crossbar sits in the middle of the SRQ/CPU, the hypertransport link(s), and the memory controller.

I would not be surprised if the memory controller is a limiting factor in a certain sense. AMD would have designed it for 200MHz operation with some headroom. Quite possibly is just has (due to natuaral variation in the chips) insufficient drive strength to operate at higher frequencies. How far it goes depends on the CPU, the motherboard, the BIOS (how it configures it, etc), and the RAM. A badly-designed motherboard or would require that the lines be driven harder, and if the CPU can't do it you're going to get errors. Likewise if the RAM you have is borderline.

So like most limits, it's a combination of things working against you, and it's pretty hard to say that {x} is the problem. Some motherbaords just may not give you much headroom with particular revisions. Is that the fault of the motherboard or the CPU? Or the motherboard?

Finally, you have the problem of clock synchronisation. With the various parts of the CPU operating at different speeds, a small difference in (say) switching time can make a big difference if the clocks are at just the wrong ratios or phases. There is a reason why engineers hate multiple-clock (or, god forbid, completely async) designes; they're a nightmare to validate as you have switching noise at annoying times. In a completely sync design, everything switches together so you don't have to worry, for example, about latching at the wrong time.
Follow up question

Q.What is the likelyhood of errors caused by the internal PCI bridge ? Any clue on how this is related to the PCI-E freq we select via the BIOS?

You mentioned "drive strength", from experience many users have concluded that TCCD based RAM like weak drive strength. How would you account for this? Or are we talking about two different things here? I mean, is it so, that the drive strength setting available in the BIOS has nothing to do with the how the CPU drives the memory controller? Am I right in assuming that the X-Bar switch actually drives the mem-controller?

Originally Posted by emboss
I haven't personally come across anything that likes a weak drive strength, but then I haven't really dealt with much high-speed stuff where high drive strength can become a problem. The problem is that a high drive strength (and correspondingly fast slew rate) can create crosstalk issues. This is especially true if you have a large number of simultaniously switching lines (such as a RAM data bus or leads on a chip). Also, reflections also increase with a higher drive strength. So generally, with a higher drive strength stuff happens faster but with more noise.

Quite possibly TCCD is more sensitive to this noise than other types, or pehaps it has design problems (either in the RAM or on the PCB) that make it more likely to have crosstalk or reflection issues at higher drive strengths. As an aside, the motherboard would also have a significant impact on the noise to drive strength relationship.

The drive strength you select in the BIOS controls the drive strength of the RAM controller driving the RAM. I don't think there's a drive strength setting for the crossbar in the northbridge configuration space, so I would be surprised if a BIOS supports such a change. Corrections welcom, of course The crossbar acts just like a fancy ethernet switch. It has five connections (3 hypertransport "PHYs" so to speak, the CPU, and the memory controller) and routes packets between these. This allows things like very fast IO<->memory transfers. I suspect it operates at the CPU core speed, but haven't seen any evidence for any particular operating frequency. Each of the 5 components are independent from each other (more or less).

I'm not exactly sure where the internal PCI bridge sits. AFAIK, it's only really a "virtual" PCI device that sits in the memory controller part of the crossbar. It's not like AMD slotted a HT-PCI tunnel and a PCI device in there. It it what is responsible for the CPU configuration, and "other" stuff like the GART and DMA. It certainly should be independent from any PCI-E settings, which are completely outside the chip. There's a virtual PCI connection that is tunnelled over the HT link, but that shouldn't have any problems unless your HT link has problems (in which case the PCI bridge is probably the least of your worries ). As for the likelyhood of errors, I would say it, like any part of the CPU, has a speed limit, but its speed of operation almost certainly is independent of the memory speed.
Hypertransport Technology

First off, let us look at what HTT means, through a Q&A session. I have used the FAQ from the HTT Consortium's webpage.

HyperTransport (HT), formerly known as Lightning Data Transport (LDT), is a bidirectional serial/parallel high-bandwidth, low-latency computer bus. The HyperTransport Technology Consortium is in charge of promoting and developing HyperTransport technology. The technology is used by AMD and Transmeta in x86 processors, PMC-Sierra and Broadcom in MIPS microprocessors, NVIDIA, VIA, SiS, ULi/ALi, and AMD in PC chipsets, Apple Computer and HP in Desktops and notebooks, HP, Sun, IBM, and IWill in servers, Cray in supercomputers, and Cisco Systems in routers.

HyperTransport runs at 200-1400 MHz (compared to PCI at either 33 or 66 MHz). It is also a DDR or "Double pumped" bus, meaning it sends data on both the rising and falling edges of the 1400 MHz clock signal. This allows for a maximum data rate of 2800 MTransfers/s per pair. The frequency is auto-negotiated.

HyperTransport supports an auto-negotiated bus widths, from 2 (bidirectional serial, 1 bit each way) to 32-bit (16 each way) busses are allowed. The full-sized, full-speed 32-bit bus has a transfer rate of 22,400 MByte/s, making it much faster than existing standards. Busses of various widths can be mixed together in a single application, which allows for high speed busses between main memory and the CPU, and lower speed busses to peripherals, as appropriate. The technology also has much lower latency than other solutions.

HyperTransport is packet-based, with each packet always consisting of a set of 32-bit words, regardless of the physical width of the bus interconnect. The first word in a packet is always a command word. If a packet contains an address, the last 8 bits of the command word are chained with the next 32-bit word to make a 40-bit address. The remaining 32-bit words in a packet are the data payload. Transfers are always padded to a multiple of 32 bits, regardless of their actual length. HyperTransport revision 1.05 contains an option allowing an additional 32-bit control packet to be prepended when 64 bit addressing is required.

Hypertransport packets come out onto the bus in segments known as bit times. How many bit times it takes depends on the width of the bus. HT can be used for generating system management messages, signaling interrupts, issuing probes to adjacent devices or processors, and general I/O and data transactions. There are usually two different kinds of write commands that can be used, posted and non-posted. Posted writes are ones that do not require a response from the target. This is usually used for high bandwidth devices such as UMA traffic or DMA transfers. Non-posted writes require a response from the receiver in the form of a target done. Reads also cause the receiver to generate a read response.

Hypertransport also greatly facilitates power management as it readily supports C-state specific messages various architectures. Power management messages are transmitted in system management packets, prepended with a FDF91... For specific C-state messages, the HT specification employs the use of signals like the HTStop signal. This is to allow hypertransport controllers to disconnect end devices on the hypertransport chain when a processor is entering a C3/C4 sleep state or other state that requires a bus disconnect. This signal is typically controlled by an end device on the hypertransport chain that is responsible for initiating a C-state transition.

Its electrical interface uses 1.2 volt Low Voltage Differential Signaling (LVDS).

There has been confusion between the use of HT referring to HyperTransport and the use of HT to refer to Intel's Hyper-Threading feature of their Pentium 4 based microprocessors. Hyper-Threading is known as Hyper-Threading Technology (HTT) or HT-Technology. Because of this potential for confusion, the HyperTransport Consortium always uses the written out form: "HyperTransport".

What is HyperTransport technology?
HyperTransport chip-to-chip interconnect technology is a highly optimized, high performance and low latency board-level architecture for embedded and open- architecture systems. It provides up to 22.4 Gigabyte/second aggregate CPU to I/O or CPU to CPU bandwidth in a highly efficient chip-to-chip technology that replaces existing complex multi-level buses. In addition to delivering the industry's highest bandwidth, frequency scalability, and lowest implementation cost, the technology is software compatible with legacy Peripheral Component Interconnect (PCI) and PCI-X and emerging PCI Express technologies. HyperTransport technology delivers state-of the-art bandwidth by means of easy-to-implement Low Voltage Differential Signaling (LVDS) point-to-point links, delivering increased data throughput while minimizing signal crosstalk and EMI. It employs a packet-based data protocol to eliminate many sideband (control and command) signals and supports asymmetric, variable width data paths.

What are the key characteristics of HyperTransport technology?
Key characteristics of the royalty-free HyperTransport technology include low latency, high bandwidth, excellent scalability, high integration, low power consumption, PCI software transparency, and small PCB footprint, with PCB manufacturing friendly electrical implementation.

How does HyperTransport technology compare to other bus technologies?
As compared to older multidrop, shared buses such as PCI, PCI-X or SysAD, HyperTransport provides a far simplier electrical interface, but with much greater bandwidth. Instead of a wide, address/data/control multidrop, shared bus such as implemented by PCI, PCI-X or SysAD technologies, HyperTransport deploys narrow, but very fast unidirectional links to carry both data and command information encoded into packets. Unidirectional links provide significantly better signal integrity at high speeds and enable much faster data transfers with low-power 1.2V LVDS signals. In addition, link widths can be asymmetrical, meaning that 2 bit wide links can easily connect to 8 bit wide links and 8 bit wide links can connect to 16 or 32 bit wide links and so on. Thus, the HyperTransport Technology eliminates the problems associated with high speed parallel buses with their many noisy bus signals (multiplexed data/address, and clock and control signals) while providing scalable bandwidth wherever it is needed in the system. As compared to newer serial I/O technologies such as RapidIO and PCI Express, HyperTransport shares some raw bandwidth characteristics, but is significantly different in some key characteristics. HyperTransport was designed to support both CPU-to-CPU communications as well as CPU-to-I/O transfers, thus, it features very low latency. Consequently, it has been incorporated into multiple x86 and MIPS architecture processors as an integrated front-side bus. Serial technologies such as PCI Express and RapidIO require serial-deserializer interfaces and have the burden of extensive overhead in encoding parallel data into serial data, embedding clock information, re-acquiring and decoding the data stream. The parallel technology of HyperTransport needs no serdes and clock encoding overhead making it far more efficient in data transfers.

How does HyperTransport technology performance compare to other bus technologies?
Performance comparisons between technologies can be problematical. Raw clock and data transfer speeds do not take into account raw bandwidth and "true" bandwidth (total data transfer minus overhead). PCI and PCI-X buses lag far behind any of the other newer technologies. For example, the traditional 32-bit/33MHz PCI bus transfers data at 133 Megabytes per second, while PCI-X transfers data at up to 1 gigabytes per second. RapidIO defines a data rate of 3.125 gigabit/second, while PCI Express defines a 2.5 gigabit/second data rate. The latest 2.0 HyperTransport specification defines a 2.8 gigatransfers/second data rate. However, gross bandwidth figures are less important than the net bandwidth available for data transfers. HyperTransport delivers 22.4 gigabytes/second of aggregate bandwidth with the lowest latency and least clocking overhead. This yields a bandwidth approximately 80 times faster than traditional PCI buses.

At what clock speeds does HyperTransport technology operate?
HyperTransport technology devices are designed to operate at multiple clock speeds from 200MHz up to 1.4 GHz, and utilize double data rate technology transferring two bits of data per clock cycle, for an effective transfer rate of up to 2.8 gigatransfer/sec in each direction. Since transfers can occur in both directions simultaneously, an aggregate transfer rate of 11.2 gigabytes per second in a 16 bit HyperTransport I/O Link and an aggregate transfer rate of 22.4 gigabytes per second in a 32-bit HyperTransport I/O Link can be achieved. To allow for system design optimization, the clocks of the receive and transmit links may beset at different rates.

What is the width of the HyperTransport I/O link bus?
The HyperTransport I/O Link is designed to allow very flexible implementations, allowing data widths of 2, 4, 8, 16, or 32-bits in each direction. Devices negotiate the bus width during initialization and operate accordingly thereafter. To allow for system design optimization, the clocks of the receiving and transmitting links may be set at different rates.

With what buses and I/O technologies is HyperTransport technology compatible?

HyperTransport technology is completely software compatible with PCI and PCI legacy I/O extensions such as PCI-X 1.0 and 2.0. Specification 2.0 includes mapping to PCI Express protocols. In addition, because of its bandwidth and packetized data/command protocol, it is easily integrated using HyperTransport bridge devices to any of today's advanced I/O technologies, such as AGP 8x, Firewire, USB, InfiniBand, PL-3, SPI-4.2, SPI-5.0, and gigabit Ethernet.

Why is HyperTransport technology compared to PCI technology?
PCI is the most pervasive bus in personal computing and is widely used in networking applications, servers and even in embedded systems. HyperTransport technology preserves the large investment that has already been made in PCI while providing a powerful combination of low-cost implementations and high bandwidth. HyperTransport solves many of the technical limitations of PCI while preserving the software infrastructure of this widely used technology.

CPU Clock and the System Bus:
In the pre-80486 days, everything i.e the motherboard,expansion slots,CPU etc and cache's were based off of one timing clock/oscillator. This presented an obvious problem viz, what happens when CPU frequencies increased with each generation? With the introduction of the DX2 flavour of the 80486, Intel made sure that the CPU ran at 2X the freq of the others. Now you see the need for the multiplier. So, they kept the Bus freq a constant and introdued the Multiplier.

However, bumping up the multiplier and keeping the Bus freq or "FSB" at a measly 66.6 MHz was useless, as there were too many idle cycles (CPU was clocked 4X FSB>. So they raised the Bus Freq. The central idea here is to keep the idle cycles to a minimum. We see that this influenced other technologies like RAM (PC100 was born) and PCI frequencies. I'm not sure why PCI Frequencies remain locked at 33.33Mhz and AGP at 66.66MHz. Maybe it has something to do with SNR issues.

A bit of trivia here Early Celerons did not have an L2 Cache and the L2's on P2's were on seperate chips off-die. The Celeron 300A's were the first on-die cache chips.

The multiplier info is said to be stored in special ROMS on the CPU. It is said to be impossible to change them beyond their upper bound (if not allowed).This gives rise to the term "locked".An excellent read on why the Multiplier should be on the CPU die, by Sander Sassen of Hardware Central is quoted here in verbatim.(see refrence section for the link)

Having the multiplier on the PCB has the following drawbacks:
- If the multiplier lock were on the PCB it could be circumvented by using some logic circuitry to bypass any input values before they were entering the actual CPU-core, thus setting a different multiplier. This would mean cutting the traces that derive the multiplier leading to the CPU’s core, and applying the desired input values to get the desired multiplier.

- Having the multiplier lock on the PCB requires active logic circuitry, so there is no trace routing or via selection which could set this multiplier value without actively responding to input values. So we can’t just use a different via layout, or trace pattern.

- Having the multiplier lock on the PCB needs different PCBs for each type of processor, so each CPU model would have a dedicated PCB; this drives up the cost pretty quickly.

Having the multiplier on the CPU-core has the following advantages:
- Changing the multiplier from the inside is impossible, because we can’t get into the CPU-core.

- Changing the multiplier from the factory can be as simple as writing the value into an on-die PROM (Programmable Read Only Memory), either using a similar setup to that described above (setting input values during the active phase of another signal) or using inputs listed as ‘reserved’ as the input values.

- Any CPU coming off of the production line can be ‘programmed’ to any desired value, permitting a great deal of flexibility; further, if the ROM is re-programmable (EEPROM, Electric Erasable Programmable Read Only Memory), already manufactured and programmed CPU’s can be re-programmed to adapt to market demand.

- The interface used to program the on-die multiplier lock does not have to adhere to any particular protocol used by the chipset or CPU; it can use a totally different and independent protocol.

- The implementation of a multiplier lock on the CPU-core only uses a few thousand transistors, and as we have millions available, this is easily accomodated.

- There are no added costs or components needed to implement the multiplier lock, as it is independent of the PCB and is part of the CPU-die. It is manufactured with the CPU-core and will thus not drive up costs.

- Interface with and configuration of the multiplier lock can be easily protected using a code or a bit value at the inputs, or can be a one-time operation (write once, read many times)--thus ruling out 99.9% of all attempts to change it from the outside.

- The multiplier factor is independent of the PCB used or of the CPU-core soldered onto the PCB; it is solely dependent on the programming of the multiplier lock.

R.P Feynman : "If I can't create doesn't exist "
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is On
Trackbacks are On
Pingbacks are On
Refbacks are On

Similar Threads
Thread Thread Starter Forum Replies Last Post
Athlon64 and CPU quick reference Super Nade AMD Motherboards & CPUs 5 14th February, 2006 07:30 PM
Vdimm/Vdd reference points. MONKEYMAN Hardware Hacking 18 16th December, 2005 03:03 AM
Stanford introduces new points reference PC Allan ThunderRd's AOA FOLDING@HOME Team 10 21st April, 2004 10:53 PM
ISO/OSI Reference Model lplate80 Random Nonsense! 3 10th July, 2003 12:19 PM
Quick Question...need quick answer... Betty General Hardware Discussion 6 21st December, 2001 05:00 AM

All times are GMT +1. The time now is 10:40 PM.

Copyright ©2001 - 2010, AOA Forums
Don't Click Here Don't Click Here Either

Search Engine Friendly URLs by vBSEO 3.3.0