Fetch-execute cycle and impact of operation types.

It's only fair to share...Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedInDigg thisShare on RedditPin on PinterestPrint this pageEmail this to someone

What is a fetch-execute cycle in a processor.

Conventional processor consist five units called ALU, Controller, Internal Storage, Internal Interconnections and External Interface. These units are designed to work in a collaborated environment continuously. What should processor and mentioned components do at a moment is decided by the program that is currently being executed. The way processor is going to read next instruction, execute, produce and store result of computation is called Fetch-Execution cycle. Steps of fetch-execution cycle can be further described as shown below:


CPU has a special register named Program Counter (PC) (also referred as Instruction Pointer) which is a part of Internal Storage. CPU reads the memory address stored in this Program Counter (PC) in to Memory Address Register (MAR). This process can be denoted as “[MAR] <= [PC]”

As soon as value of the PC is read into MAR, value of PC will be incremented as it points to next instruction. If instruction length is 32 bit, this inclement is by 4, which denotes 4 bytes. This step can be denoted as “[PC] <= [PC] + 4”. If instruction length was 64 bit, increment will be 8 which denote 8 bytes.

It is necessary to mention that some CPUs do not have MAR built in with it. In such case PC will be incremented for each step and PC will be directly used instead of coping address of it to MAR.

Once this step is complete address stored in MAR is placed in Address Bus. Data stored in mentioned address of memory will be fetched by the CPU with the help of Data Bus and Control Bus. This data fetched from memory is stored in Memory Buffer Register (MBR) (also referred as Memory Data Register(MDR)) of processor. This can be denoted as “[MBR] <= [[MAR]]”

Fetched instruction, available in MBR is now carried to Instruction Register (IR) of the processor. This step can be denoted as “[IR] <= [MBR]”

As instructions are fetched into suitable registers of processor, Fetch step is now complete. These steps related to fetching is common to most of the instructions.


During this step, instruction in the Instruction Register (IR) is checked and required operation is identified using operation code (opcode). If any additional data from the memory is required to carry out the operation they are loaded into data registers with the help of address bus, data bus and control bus as mentioned before.


Control unit (CU) then passed decoded information as a signal to various units of the processor mentioned before. Some operations might involve arithmetic operations which are passed to ALU of the processor. In such case data from registers are given to ALU and after processing done by ALU it writes the new value back to register. This step is some time separated from execution phase and called as “Write Back” step. According to the executed instruction it is possible to change the Program Counter to point at a different location.

Consider the example scenario of adding two values together. This can be written in X86 instructions, using assembly language as shown below:


“MOVE R0,A “, which moves value stored in memory address A into register R0, results in below sequence of actions:

[MAR] <= [PC]
[PC] <= [PC] + 4
[MBR] <= [[MAR]]
[IR] <= [MBR]

CU <= [IR]

[MAR] <= [A]
[MBR] <= [[MAR]]
[R0] <= [MBR]

“ADD R0,B “, which instructs ALU to add value of register R0 with value stored in memory address B and store result into register R0, results in below sequence of actions:

[MAR] <= [PC]
[PC] <= [PC] + 4
[MBR] <= [[MAR]]
[IR] <= [MBR]

CU <= [IR]

[MAR] <= [B]
[MBR] <= [[MAR]]

ALU <= [R0]
ALU <= [MBR]

[R0] <= ALU

“MOVE C,R0”, which moves value stored in register R0 into memory address C, results in below sequence of operations:

[MAR] <= [PC]
[PC] <= [PC] + 4
[MBR] <= [[MAR]]
[IR] <= [MBR]

CU <= [IR]

[MAR] <= [R0]
[MBR] <= [[MAR]]
[C] <= [MBR]

Nature of the operations and impact of them to the fetch-execute cycle

As discussed is the last section, Decode step will load required additional data from the memory that are required to carry-out the operation. There are operations that accept one-operand, two-operands, and up to four operands as mentioned in X86 Opcode reference. These operands and number of operands will decide how many clock cycles are necessary to request data on address bus and receive them over data bus, and then store them in data registers. It is clear that fetch-execution cycle depends on type of operation because each type of operation might require different number of data reads or writes from memory.

Once all data are read into required registers, control unit will pass instructions to various units with control signals telling what to do. ALU or the Arithmetic Logic Unit is responsible of arithmetic and logical operations as the name suggests. As discussed in previous lessons of the subject, ALU is a complex Logic Circuit. Sequential Logic Circuit requires a clock signal to change its status and carry-out various operations. Logical circuit used to add two binary values is not same as logic circuit for multiplication of two numbers. These various circuits need different number of clock cycles to complete a given operation. This difference causes some operations to use ALU for large number of clock cycles compared to other operations, resulting a longer fetch execution cycles.

We’ll discuss few basic arithmetic operations and their logic circuits next to actually see why different operations take different number of clock cycles to complete.

Flynn, J, 1995. Computer Architecture: Pipelined and Parallel Processor Design. 1st ed. Page 671 [2] suggest that ALU operation costs for basic arithmetic operations are as mentioned below, however these values depend upon type of processor, manufacture, instruction set used, and changes over development of technologies.


It is possible to get up-to-date information for Intel processors by referring “Intel Architecture Optimization Manual” for required processor technology. Manuals for other processor manufacturers are also available under their own titles.

Integer Addition

As discussed in lessons a full-adder with carry-in can be designed by using few logic gates. For the moment let us assume that out 4-bit ALU only knows how to add two 4bit numbers. We can construct such simple ALU using four full-adders as shown below:


As you may notice final carry will not be considered here. It is not a good practice and in actual ALUs techniques like Carry Lookahead are used to address this problem and flag user if necessary. However, most important point is that if both numbers can be made available on A0 to A3 and B0 to B3 in a single clock cycle, the addition can be performed at the same time and output can be generated.

Floating Point Addition

When it comes to adding two floating point number, it is not possible to follow a simple logic like this. If a simple processor is considered, first of all it is necessary to compare exponents of two floating point numbers to determine how far to shift the mantissa of the number with the smaller exponent. To count number of shifts required, integer subtraction can be used. Data latches can be used for bit shifting. However presenting exponent to subtraction and comparing resulting value will require additional clock cycles.

When it comes to multiplication and division logic circuits get complex and floating point arithmetic take much longer.

Practically Measuring, Clock Cycles Required to Complete Operations

Modern general purpose processors are multiprogramming processors. Process schedulers run in background to schedule various processes using number of scheduling algorithms. In such condition it is quite impossible to check how many clock cycles an operation take.

However “Mississippi State University – Department of Electrical and Computer Engineering” suggests a reasonable method to do this by using Assembly Language [1]. They use two loops in this program. First loop is an empty loop which just loop for certain number of times. While looping it uses “RDTSC” X86 operation (primaty opcode = 0x31) to Read Time-Stamp Counter. Time-Stamp Counter is a special register which is incremented for each clock cycle until it is reset. Approximate clock cycles taken to execute empty loop is calculated by reading Time-Stamp Counter at the beginning and end of loop. Second loop is as same as first loop, but the body of loop consists of an arithmetic operation of choice. Clock cycles taken for executing this loop is again calculated. Difference between clock cycles taken for first loop and second loop is considered as the approximate number of clock cycles taken to execute arithmetic calculation. Code of these two loops can be checked using the linked reference material.

As you may have already guessing, this is not going to give a 100% accurate result. Even this simple assembly program might get scheduled for number of times in its few milliseconds of execution. Operating system and other processes will gain processor focus, sending this process to waiting queue. However, this process is strong enough to see how cost of operations change.

Microsoft Macro Assembler (MASM) 6.1.1 is used to test this application using flowing sequence of commands. (MASM inittim.asm / ml intim.obj Irvine.lib). (Irvine.lib is a library written by Kip R.Iverine and used in displaying values on standard output)


Execution results for integer addition:



As you can see it gives outputs like F, 3C and 2D which means execution takes number of clock cycles like 15, 60, and 45 leading to an average of 28.7 clock cycles. (FFFFFFFFFFFFFFFF1/3 is printed because first loop took longer than second loop to continue. This is possible in addition because it is actually a operation that require 1 clock cycle to complete)

Execution results for integer multiplication:


Outputs were E1, 87, B4, A5, C3 and 4B which means execution takes number of clock cycles like 225, 135, 180, 165, 195, and 75 leading to an average of 64 clock cycles.

Execution results for integer division:



Outputs were 2A3, 285, 249 and 258 which means execution takes number of clock cycles like 675, 645, 585 and 600 resulting in an average of 655 clock cycles.

Execution results for floating-point addition:


Floating point operations were not supported in the original implementation. After proving required implementation code for floating point addition in X86 instructions, result was as shown as below:


Outputs were 7224, 7233, and 727E which means execution takes number of clock cycles like 29220, 29235 and 29310 resulting in an average of 29233 clock cycles.

This value seems bit large for us, but for a 2.0GHz Dual Core Processor (2 x 2000000000 = 4000000000 clock cycles per second), 29233 clock cycles is just less than 8 micro seconds (0.00000730825 seconds). (Please note that above multiplication by 2 to get the speed of both cores is not accurate in practical, real life scenarios because speedup of parallel processing follows Amdahl’s law)

Summary of Test


As mentioned before these values are not actual value. Currently this computer is running 93 processes and around 100 system services, so that scheduler is surely allocating processor to various other processes while executing this small code. This will increase clock cycles taken for all the operations. Instruction pipelining and multi-core processors make this more complicated. However it clearly and practically shows that clock cycles required to finish various operations differ according to the type of operation. Summarized result of above test is :

Integer addition : aprx. 28.7 clock cycles
Integer multiplication : aprx 64 clock cycles
Integer division: aprx 655 clock cycles
Floating-point addition: aprx 29233 clock cycles.

Techniques used to improve fetch-execute cycle.

Improved Design and Algorithms

As discussed before time take to execute an instruction depends on various underlying execution units such as ALU. Logic circuits that are used in such units directly affect number of clock cycles taken to complete a fetch execution cycle. Various processor manufactures introduce new processors, and new processing techniques, with improved logic circuits to provide better performance. However there are certain limits when it comes to subjects like ALU designing. A major architectural change to a logic circuit might take longer to design and implement.

Consider an arithmetic operation. Once all the operands and related data are sent to the ALU it perform required operation in predefined sequence. This sequence is again an algorithm that is known to solve a given problem. Consider binary multiplication as an example, older CPUs used shifters and accumulators to sum each partial product to produce the answer. This normally take one clock cycle per partial product which is an overkill. Because of that, new processor architectures follow improved techniques such as Wallace trees, Baugh–Wooley algorithm and Dadda multipliers to add partial product together in a single clock cycle. It is clear that improved designs and algorithms can improve fetch-execute cycle.

Improved Memory Management

As discussed before, in the Decode step of fetch execution cycle, control unit issues control signals to load required operands from various memory locations. Some of these memory locations might have been swapped to secondary storage which will require reloading the swapped data into memory. Such operations will take many clock cycles than expected to complete Decoding because of heavy I/O operations. Modern processors normally have two levels of cache memory. Cache memory can be used to store data that will be required in near future so that extraction of data or the Decode phase will not take much longer to complete. What should remain in cache is decided by Cache Replacement Algorithms and what should be swapped to secondary storage is decided by Swapping Algorithm. If cache memory and main memory was managed in an effective way fetch-execution-cycle will take less clock cycles to complete execution. Because of that in order to improve fetch-execution-cycle it is possible to improve and introduce new efficient memory management algorithms.

Parallel Processing

A processor can execute only one instruction at a time. In the sense, critical units of a processor such as ALU cannot handle multiple operations at the same time. What if a computer had two processors? ALUs of each processor are independent so that they can handle two executions in a parallel manner. Having two processors is expected speedup the processing by double, but it is not exactly doubled or not increased in a linear manner. Amdahl’s law and Gustafson’s law are some laws that were introduced to provide a mathematical model to increment of speedup. Having multiple processors is not a problem for a large workstation or a server, however it will not be that suitable for a home computer, due to complex hardware and increased space consumption. Because of this, processor manufactures have introduced processors with multiple cores, which refer to having multiple processing units embedded in a single processor package. Modern processors are designed with 2, 4, and even 6 cores built in. It’s clear that parallel processing and multi-core processors surely help with improving fetch-execute cycle.

Instruction Pipelining

With instruction pipelining, fetch execution cycle is clearly divided into four section, namely “Fetch”, “Decode”, “Execute” and “Write Back”. As discussed before various units in processer cannot act on multiple operations at the same time. But control unit is capable of sending control signals to multiple independent units in a single clock cycle, so they can work independent of each other. By using this ability control unity can queue up few operations and distribute control signals for each of these to act on different units in a single clock cycle. As an example, Instruction1 might be executing “WriteBack” stage when, Instruction2 is submitted to “Execution”, in the same time Instruction3 starts decoding while fetching Instruction4 from the memory. As “Fetch”, “Decode”, “Execute” and “Write Back” are not conflicting with each other all these can be executed in a single clock cycle for different processes. It is clear that instruction pipelining will also improve the fetch-execution-cycle by acting upon number of different instruction in a single clock cycle.

[1] ECE 3724/CS 3124 Microprocessors Home page, Robert Reese. Lab Note 12. 2011 [ONLINE] Available at: http://www.ece.msstate.edu/~reese/EE3724/labs/lab12/desc.pdf. [Accessed 17 February 2011].

[2] Computer architecture: pipelined and … – Google Books. 2011. Computer architecture: pipelined and … – Google Books. [ONLINE] Available at: http://books.google.lk/books?id=JS-01OTl9dsC&lpg=PP1&ots=sswvTR8Q-Q&dq=.%20Computer%20Architecture%3A%20Pipelined%20and%20Parallel%20Processor%20Design&pg=PA671#v=onepage&q&f=false. [Accessed 17 February 2011].

It's only fair to share...Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedInDigg thisShare on RedditPin on PinterestPrint this pageEmail this to someone

Related Posts

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">