Floating Point CPSC 252 Computer Organization Ellen Walker, Hiram College Representing Non-Integers вЂ“ Often represented in decimal format вЂ“ Some require infinite digits to represent exactly вЂ“ With a fixed number of digits (or bits), many numbers are approximated вЂ“ Precision is a measure of the degree of approximation Scientific Notation (Decimal) вЂў Format: m.mmmm x 10^eeeee вЂ“ Normalized = exactly 1 digit before decimal point вЂў Mantissa (m) represents the significant digits вЂ“ Precision limited by number of digits in mantissa вЂў Exponent (e) represents the magnitude вЂ“ Magnitude limited by number of digits in exponent вЂ“ Exponent < 0 for numbers between 0 and 1 Scientific Notation (Binary) вЂў Format: 1.mmmm x 2^eeeee вЂ“ Normalized = 1 before the binary point вЂў Mantissa (m) represents the significant bits вЂ“ Precision limited by number of bits in mantissa вЂў Exponent (e) represents the magnitude вЂ“ Magnitude limited by number of bits in exponent вЂ“ Exponent < 0 for numbers between 0 and 1 Binary Examples вЂў 1/16 1.0 x 2^-4 (mantissa 1.0, exponent -4) вЂў 32.5 1.000001 x 2^5 (mantissa 1.000001, exponent 5) Quick Decimal-to-Binary Conversion (Exact) 1. Multiply the number by a power of 2 big enough to get an integer 2. Convert this integer to binary 3. Place the binary point the appropriate number of bits (based on the power of 2 from step 1) from the right of the number Conversion Example вЂў Convert 32.5 to binary 1. Multiply 32.5 by 2 (result is 65) 2. Convert 65 to binary (result is 1000001) 3. Place the decimal point (in this case 1 bit from the right) (result is 100000.1) вЂў Convert to binary scientific notation (result is 1.000001 x 2^5) Floating Point Representation вЂў вЂў вЂў вЂў Mantissa - m bits (unsigned) Exponent - e bits (signed) Sign (separate) - 1 bit Total = 1+m+e bits вЂ“ Tradeoff between precision and magnitude вЂ“ Total bits fit into 1 or 2 full words Implicit First Bit вЂў Remember the mantissa must always begin with вЂњ1.вЂќ вЂў Therefore, we can save a bit by not actually representing the 1 explicitly. вЂў Example: вЂ“ Mantissa bits 0001 вЂ“ Mantissa: 1.0001 Offset Exponent вЂў Exponent can be positive or negative, but itвЂ™s cleaner (for sorting) use an unsigned representation вЂў Therefore, represent exponents as unsigned, but add a bias of вЂ“((2^(bits-1))-1) вЂў Examples: 8 bit exponent вЂ“ 00000001 = 1(+ -127) = -126 вЂ“ 10000000 = 128 (+ -127) = 1 IEEE 754 Floating Point Representation (Single) вЂў Sign (1 bit), Exponent (8 bits), Magnitude (23 bits) вЂ“ What is the largest value that can be represented? вЂ“ What is the smallest positive value that can be represented? вЂ“ How many вЂњsignificant bitsвЂќ can be represented? вЂў Values can be sorted using integer comparison вЂ“ Sign first вЂ“ Exponent next (sorted as unsigned) вЂ“ Magnitude last (also unsigned) Double Precision вЂў Floating point number takes 2 words (64 bits) вЂў Sign is 1 bit вЂў Exponent is 11 bits (vs. 8) вЂў Magnitude is 52 bits (vs. 23) вЂ“ Last 32 bits of magnitude is in the second word Floating Point Errors вЂў Overflow вЂ“ A positive exponent becomes too large for the exponent field вЂў Underflow вЂ“ A negative exponent becomes too large for the exponent field вЂў Rounding (not actually an error) вЂ“ The result of an operation has too many significant bits for the fraction field Special Values вЂў Infinity вЂ“ Result of dividing a non-zero value by 0 вЂ“ Can be positive or negative вЂ“ Infinity +/- anything = Infinity вЂў Not A Number (NaN) вЂ“ Result of an invalid mathematical operation, e.g. 0/0 or Infinity-Infinity Representing Special Values in IEEE 754 вЂў Exponent в‰ 0, Exponent в‰ FF вЂ“ Ordinary floating point number вЂў Exponent = 00, Fraction = 0 вЂ“ Number is 0 вЂў Exponent = 00, Fraction в‰ 0 вЂ“ Number is denormalized (leading 0. Instead of 1.) вЂў Exponent = FF, Fraction = 0 вЂ“ Infinity (+ or -, depending on sign) вЂў Exponent = FF, Fraction в‰ 0 вЂ“ Not a Number (NaN) Double Precision in MIPS вЂў Each even register can be considered a register pair for double precision вЂ“ High order bit in even register вЂ“ Low order bit in odd register Floating Point Arithmetic in MIPS вЂў Add.s, add.d, sub.s, sub.d [rd] [rs] [rt] вЂ“ Single and double precision addition / subtraction вЂ“ rd = rs +/- rt вЂў 32 floating point registers $f0 - $f31 вЂ“ Use in pairs for double precision вЂ“ Registers for add.d (etc) must be even numbers Why Separate Floating Point Registers? вЂў Twice as many registers using the same number of instruction bits вЂў Integer & floating point operations usually on distinct data вЂў Increased parallelism possible вЂў Customized hardware possible Load/ Store Floading Point Number вЂў вЂў вЂў вЂў Lwc1 32 bit word to FP register Swc1 FP register to 32 bit word Ldc1 2 words to FP register pair Sdc1 register pair to 2 words вЂў (Note last character is the number 1) Floating Point Addition вЂў Align the binary points (make exponents equal) вЂў Add the revised mantissas вЂў Normalize the sum Changing Exponents for Alignment and Normalization вЂў To keep the number the same: вЂ“ Left shift mantissa by 1 bit and decrement exponent вЂ“ Right shift mantissa by one bit and increment exponent вЂў Align by right-shifting smaller number вЂў Normalize by вЂ“ Round result to correct number of significant bits вЂ“ Shift result to put 1 before binary point Addition Example Add 1.101 x 2^4 + 1.101 x 2^5 (26+52) вЂў Align binary points 1.101 x 2^4 = 0.1101 x 2^5 вЂў Add mantissas 0.1101 x 2^5 1.1010 x 2^5 10.0111 x 2^5 Addition Example (cont.) вЂў Normalize: 10.0111 x 2^5 = 1.00111 x 2^6 (78) вЂў Round to 3-bit mantissa: 1.00111 x 2^6 ~= 1.010 x 2^6 (80) Rounding вЂў At least 1 bit beyond the last bit is needed вЂў Rounding up could require renormalization вЂ“ Example: 1.1111 -> 10.000 вЂў For multiplication, 2 extra bits are needed in case the productвЂ™s first bit is 0 and it must be left shifted (guard, round) вЂў For complete generality, add вЂњsticky bitвЂќ that is set whenever additional bits to the right would be >0 Round to Nearest Even вЂў Most common rounding mode вЂў If the actual value is halfway between two values round to an even result вЂў Examples: вЂ“ 1.0011 -> 1.010 вЂ“ 1.0101 -> 1.010 вЂў If the sticky bit is set, round up because the value isnвЂ™t really halfway between! Floating point addition Sign Exponent вЂў Fraction Sign Exponent Fraction 1. Compare the exponents of the two numbers. Shift the smaller number to the right until its exponent would match the larger exponent Small ALU Exponent difference 0 Start 2. Add the significands 1 0 1 0 3. Normalize the sum, either shifting right and incrementing the exponent or shifting left and decrementing the exponent Shift right Control 1 Overflow or underflow? Big ALU Yes No 0 0 1 1 4. Round the significand to the appropriate Increment or decrement number of bits Shift left or right No Rounding hardware Still normalized? Yes Sign Exponent Fraction Done Exception Floating Point Multiplication 1. Calculate new exponent by adding exponents together 2. Multiply the significands (using shift & add) 3. Normalize the product 4. Round 5. Set the sign Adding Exponents вЂў Remember that exponents are biased вЂ“ Adding exponents adds 2 copies of bias! (exp1 + 127) + (exp2 + 127) = (exp1+exp2 + 254) вЂў Therefore, subtract the bias from the sum and the result is a correctly biased value Multiplication Example вЂў Convert 2.25 x 1.5 to binary floating point (3 bits exponent, 3 bits mantissa) вЂў 2.25 = 10.01 * 2^0 = 1.001 * 2^1 вЂў Exp = 100 (because bias is 3) вЂў 2.25 = 0 100 001 вЂў 1.5 = 1.100 * 2^0 вЂў Exp = 011, Mantissa: 100 вЂў 1.5 = 0 100 100 1. Add Exponents 0 100 001 x 0 011 100 вЂў Add Exponents (and subtract bias) 100 + 011 вЂ“ 011 = 100 2. Multiply Significands 0 100 001 x 0 011 100 вЂў Remember to restore the leading 1 вЂў Remember that the number of binary places doubles 1.001 1.100 -----------------------.100100 1.001000 ---------------1.101100 x 2^1 Finish Up вЂў вЂў вЂў вЂў вЂў вЂў Product is 1.1011 * 2^1 Already normalized But, too many bits, so we need to round Nearest even number (up) is 1.110 Result: 0 100 110 Value is 1.75 * 2 = 3.5 Types of Errors вЂў Overflow вЂў Exponent too large or small for the number of bits allotted вЂў Underflow вЂў Negative exponent is too small to fit in the # bits вЂў Rounding error вЂў Mantissa has too many bits Overflow and Underflow вЂў Addition вЂ“ Overflow is possible when adding two positive or two negative numbers вЂў Multiplication вЂ“ Overflow is possible when multiplying two large absolute value numbers вЂ“ Underflow is possible when multiplying two numbers very close to 0 Limitations of Finite Floating Point Representations вЂў Gap between 0 and the smallest nonzero number вЂў Gaps between values when the last bit of the mantissa changes вЂў Fixed number of values between 0 and 1 вЂў Significant effects of rounding in mathematical operations Implications for Programmers вЂў Mathematical rules are not always followed вЂ“ (a / b) * b does not always equal a вЂ“ (a + b) + c does not always equal a + (b + c) вЂў Use inequality comparisons instead of directly comparing floating point numbers (with ==) вЂ“ if ((x > вЂ“epsilon) && (x < epsilon)) instead of if(x==0) вЂ“ Epsilon can be set based on problem or knowledge of representation (e.g. single vs. double precision) The Pentium Floating Point Bug вЂў To speed up division, a table was used вЂў It was assumed that 5 elements of the table would never be accessed (and the hardware was optimized to make them 0) вЂў These table elements occasionally caused errors in bits 12 to 52 of floating point significands вЂў (see Section 3.8 for more) A Marketing Error вЂў July 1994 - Intel discovers the bug, decides not to halt production or recall chips вЂў September 1994 - A professor discovers the bug, posts to Internet (after attempting to inform Intel) вЂў November 1994 - Press articles, Intel says will affect вЂњmaybe several dozen peopleвЂќ вЂў December 1994 - IBM disputes claim and halts shipment of Pentium based PCs. вЂў Late December 1994 - Intel apologizes The вЂњBig PictureвЂќ вЂў Bits in memory have no inherent meaning. A given sequence can contain вЂ“ вЂ“ вЂ“ вЂ“ An instruction An integer A string of characters A floating point number вЂў All number representations are finite вЂў Finite arithmetic requires compromises

1/--страниц