3_Num_Rep

3.1. NUMBER SYSTEMS

3.1.1. A REVIEW OF BASE 10 (DECIMAL) NOTATION [1]

The decimal or base-10 number system is so common to everyday life that few people understand its evolution or functions. In the decimal number system, all numbers can be represented by combinations of 10 unique digits (0, 1, 2, 3, 4, 5, 6, 7, 8, and 9) located in specific locations or places. Each position to the left is one power of 10 greater than the position to its right.

To determine the value represented by a sequence of digits in a decimal number, multiply each digit by its place value and add the products.

For example:

345.2110 = (3 * 10²) + (4 * 10¹) + (5 * 10⁰) + (2 * 10^-1) + (1 * 10^-2)

Although trivial, this example of using the decimal number system can be applied to all positional number systems. All (positional) number systems are built on sequences of digits whose positions correspond to powers of the base value (baseⁿ). By understanding how the decimal system works, it is possible to create and use alternative number systems and to convert from one number system to any other.

3.1.2. A REVIEW OF BASE 16 (HEXADECIMAL) NOTATION

Hexadecimal or base-16 is used extensively in computing ("hexa" refers to six and "decimal" mean ten). In the hexadecimal system, numbers are made up of combinations of 16 unique digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F), and positions have place values corresponding to powers of sixteen (16ⁿ). In the hexadecimal system, the first left position from the hex point (as opposed to the decimal point in base-10) has a place value of 16⁰ (16⁰=1), the position to the left has a place value of 16¹ (16¹=16), and so on. Numbers in the hexadecimal system are written with a subscript of 16 after the rightmost digit to distinguish them from numbers in other systems.

For example:

35B2.116 = (3 * 16³) + (5 * 16²) + (B * 16¹) + (2 * 16⁰) + (1 * 16^-1) =

35B2.116 = (3 * 16³) + (5 * 16²) + (11 * 16¹) + (2 * 16⁰) + (1 * 16^-1) =

35B2.116 = 1228810 + 128010 + 17610 + 210 + 0.062510 =

35B2.116 = 13476.062510

3.1.3. A REVIEW OF BASE 2 (BINARY) NOTATION

Another number system of great importance to computers and computer programming is the binary or base-2 number system. Numbers in this system are made up of combinations of only two unique digits, 0 and 1. Positions or place values in the binary system correspond to powers of two (2ⁿ), just as place values in the decimal system correspond to powers of ten (10ⁿ). In a binary number, the first left position from the binary point (as opposed to a decimal point in base-10) has a place value of 2⁰ (2⁰=1), the next position to the left has a place value of 2¹ (2¹=2), and so on. Numbers in the binary system have a subscript of 2 after the rightmost digit to distinguish them from numbers in other number systems (101011 binary system = 1010112, 256 decimal system = 25610).

For example:

101011.100012 = (1 * 2⁵) + (1 * 2³) + (1 * 2¹) + (1 * 2⁰) + (1 * 2^-1) + (1 * 2^-5) =

101011.100012 = 32 + 8 + 2 + 1 + 0.5 + 0.03125 =

101011.100012 = 43.5312510

Note that 43.5312510 is an approximation of 101011.100012. This is a problem in converting numbers with fractional parts between different number systems. Numbers that can be exactly stated in one system (101011.100012) must be rounded off in other systems (43.5312510). Problems with rounding real numbers are not discussed in this appendix. Integer numbers are always exact when transferring between number systems.

3.1.4. CONVERSION BETWEEN BINARY AND HEXADECIMAL

Conversion between binary and hexadecimal and back is a simple operation. To convert from binary to hexadecimal, group the bits into chunks of four, then apply the following conversion table:

Binary Hexadecimal Binary Hexadecimal
0000 0 1000 8
0001 1 1001 9
0010 2 1010 A
0011 3 1011 B
0100 4 1100 C
0101 5 1101 D
0110 6 1110 E
0111 7 1111 F

To convert from hexadecimal to binary, replace the hexadecimal numbers with the four bit binary equivalent in the table above.

For example:

1110000100012 = 1110 0001 00012 = E1116

F4A516 = 1111 0100 1010 01012 = 11110100101001012

3.1.5. GENERALIZED BASE CONVERSIONS

The number systems that we use and that we are all familiar with are positional systems. In a positional system, each digit position has a value called a weight associated with it. To obtain the value of such a number, each digit value is multiplied by the weight of its position and the value of the number is realized by the summation of all such products. We can formally represent such a number as:

(1)

Its values can be expressed as

(2)

where V(X) denotes the numerical value of X. R is called the radix or base of the number systems and each digit is such that

For example, the number The radix R is 8 and by equation (1) the value of the number can be found by

V(X)8 = (7)(8)² + (6)(8)¹ +(5)(8)⁰

The above example is for numbers without fractional parts. If we include fractional parts in the number, the number X can be formally expressed as.

and the value of the number X can be expressed as

So if we have the number then the value of this number can be found by equation (2)

When working with numbers it is sometimes useful to transform the number from one base or radix to another. Working with numbers in different bases requires the conversion of numbers from one base to another and involves some arithmetic. Normally in such a conversion the arithmetic that one performs is in the familiar base. Therefore, conversion from a familiar base to a foreign base involves a different algorithm than the one used to convert from a foreign base to the familiar one.

Conversion of numbers from a foreign base to the familiar base is relatively simple. It involves the substitution of the appropriate values into equation (2).

For most of us, it is easy to work and do arithmetic with numbers with a base of 10. Therefore, a number system with a base of 10 is a familiar base. Most of us cannot perform arithmetic well in a number system of base 5. That would be considered a foreign base.

Consider the conversion of to its base 10 equivalent. By equation (2) we have

or consider the number

Further consider

In the last example note that some numbers that are exact in one radix representation cannot be represented exactly (that is, in a finite number of digits) in some other base. This occurs when the reciprocal of the foreign base itself cannot be represented in a finite number of digits in the familiar base (ie. cannot be represented in a finite number of digits).

Conversion to a foreign base can be accomplished using the following methods.

Consider the integer number in some familiar base R' and consider how to compute its base R representation using arithmetic in base R' If we have the number and we want to convert it to then what we really have is the expression:

If one takes and divides by R in the familiar base R', then the result is a remainder and a quotient. This can be expressed as

We denote as the remainder and to be the quotient Q of the division operation. This leads us to the expression

This suggests a repeated process of division by R where each remainder yields a digit of the representation and each successive quotient is the dividend of the next step. This process terminates when a quotient of 0 is produced. Consider converting the number to base 5 representation using base 10 arithmetic. So and R=5. We need to determine

By our method we have

l l 63/5 remainder

quotient

12/5 remainder

quotient

2/5 remainder

quotient

The conversion of fractional numbers likewise proceeds from an interpretation of the fractional portion of the number. Recall that the value of a fractional number can be written formally as

Consider the number X such that X is 0.6356 with radix R equal to 10. By the previous expression the value of the number can be expressed as

If we multiplied V(X) by R we would obtain

If we let F denote the fractional part such that

then we can say We can notice that has an integer part, namely, and a fractional part . This suggests a repetitive process. If we multiply the number by the base R we would yield and and so on. This process terminates when a fractional part of 0 is achieved.

Consider the example to convert the number to its hexadecimal equivalent. 16 X(0.63671875) = 10.1875

The integer portion of the result is 10 (or A in hexadecimal notation). This number is The fractional portion of the number is 1875. This is

Multiply by 16 to get and

16 x (0.1875) = 3.0

so = 3 and = 0 , so the hexadecimal number has been found.

0.63671875 =

3.1.6. FACTORING OF NUMBERS

The binary system currently is firmly established as a fundamental system for representing data in electronic digital computers. It is, however inconvenient to represent large binary numbers in positional binary because of the large number of digits required.

Consider the following binary number

This number can be broken up into parts by factoring

Note that if we group the digits in groups of four they represent a digit to the base 16. That is we could let and

We can then say

So by dividing by the base we now have the binary number expressed as a

Notice that the way we arrived at the hexadecimal number was by grouping bits together.

Consider another binary number that we wish to convert to an octal (base 8)number. We have such that

then

Consider the number 10110011.Then = 1011 and = 0011. By grouping the number into groups of four bits we can easily find the hexadecimal equivalent of the number. = B and = 3 so we can say = B3.

Consider another number X such that = 10100010111011010001011101 if we want to convert the number to its base eight in octal equivalent we can factor the number so that we divide it into groups of three. If we do this we have

So we can say

In general, we can group the bits into groups of B bits and each B bitgroup can be considered as a digit of a radix representation of the number.

3.1.7. NEGATIVE NUMBER REPRESENTATION

The most familiar way for dealing with negative numbers (ie. the additive inverse of a number) of value X is as follows:

where

You will recall that this is the definition of a number that we established earlier. The additive inverse of X will be denoted by and its representation is:

This is obviously the negative number representation of X since and the property is fulfilled. Note that we can also say (for notation purposes)

Treating negative numbers in the above method is often not very convenient for computers. Two other methods of representing negative numbers are the radix complement method and the diminished radix-complement method.

Assume that we have an n digit, radix R number, X. X is a positive number. We can choose the following representation for the negative number X.

where

Note that this representation of X does not equal -X. X in this discussion is the radix complement of X. For example if X = , then R=10 and n=2. We can find X in the following way

Note that = 3 and = 1. If we represent a negative number in this manner we can now rewrite equation (1) so that it applies to both the negative and positive numbers. We can say

where , is 0 for positive numbers; 1 for negative numbers.

We can show that this indeed represents a negative number.

This gives the correct value for negative numbers represented in the radix complement form according to our previous definition. That is, the additive inverse property holds:

Thus, a negative number in this representation is defined by a 1 in the leftmost digit position, that is = 1. To obtain the value of a negative number, we simply take the radix complement of the value of the given negative number and put a minus sign in front of it.

Let us examine a variation of the radix complement method. This variation is designed to make the conversion from positive to negative numbers easier than the radix complement (where subtraction is required). ie. - V(X). In our previous definition we said . When X was a positive number had the value 0.

In the diminished complement representation of the negative number , the i'th digit is denoted to be

for i = -m, ..., n-1 and

The criterion that the negative number system must fulfill is that the sum of and must be a representation of 0, namely + = 0. We observe that

Therefore the number is represented by

Recall that the general form for formally representing a number in radix representation was

If we substitute the digits into the expression (4) we obtain

Hence the expression for evaluating the value of a number can be generalized to

As a result of the above equation, we have a simpler method for obtaining the radix complement of a number than performing minus the magnitude of the number. We first form diminished radix-complement (digit by digit) and add 1 to the least significant position.

In the binary number system, the radix-complement is called the two's complement while the diminished radix complement is known as the one's complement.

As an example consider representing the number as a negative number. (4 bits in length)

So subtract 10000 - 0011 = 1101 = . That is 1101 represents the negative number .

Another quick way to find the representation of a negative number is to find the one's complement and then add one to realize the two's complement.

Consider the previous example of how to represent as a negative binary number.

0011 =

1100 = one's complement of 3

1101 = two's complement of 3

So, 1101 is the binary representation of -3.

3.2. NUMERIC DATA REPRESENTATIONS

The following examples refer to data representation of FORTRAN data types on VAX computers except where noted. The principles apply to data representation on all computers.

3.2.1. LOGICAL

Logical*1 (byte)

- Can store values true (non zero) and false (zero)

3.2.2. INTEGERS

3.2.2.1. THE SIGN BIT

The sign bit, or high order bit is used to indicate the sign of the number. This bit is set to one when the value is negative and cleared to 0 when the value is positive.

3.2.2.2. INTEGER*2

- Can store values in the range -32768 to 32767

- Stored in two's complement notation

3.2.2.3. INTEGER*4 (DEFAULT FOR INTEGER)

- Can store values in the range -2147483648 to 2147483647

3.2.3. FLOATING POINT NUMBERS

3.2.3.1. REAL NUMBER REPRESENTATION

Real numbers are represented using a sequence of coded bits. The sequence is usually either 32 bits for single precision real numbers or 64 bits for double precision numbers. The 32 bit version is described below.

Each sequence of bits consists of three parts:

- a sign magnitude representation for the fraction,

- a exponent in excess 64 notation,

- a normalized hexadecimal fraction.

The most significant bit is the sign, the next bits are the exponent, and the remainder are the fraction. Meena (VAX) uses eight bits for the exponent and Max (IBM) uses only seven bit.

3.2.3.2. EXCESS N BINARY EXPONENT

An excess N binary exponent uses N, added to the exponent before it is stored. In this way the sign of the exponent does not need to be stored, because N is the value which causes the smallest possible exponent to be mapped to zero. Meena uses excess 128 notation and Max uses excess 64 notation.

3.2.3.3. EXCESS 64 NOTATION

Excess 64 notation is a technique for decoding the exponent to form an exponent. Excess 64 notation is used so that a sign bit for the exponent is not necessary. The exponent is created by adding decimal 64 to the exponent.

Since the exponent is 7 bits, the maximum number that can be stored is 127 and the minimum number that can be stored is 0. The formula for calculating the exponent is: EXPONENT - 64 = EXPONENT.

The maximum exponent that can be stored is 127 - 64 = 63 and the minimum exponent is 0 - 64 = -64.

For example:

69 (exponent) 5 (exponent)

- 64 (Excess 64 Notation) + 64 (Excess 64 Notation)

5 (exponent) 69 (exponent)

The exponent is expressed as a power of 16, not as a power of 10.

The exponent of 5 in the preceding example would therefore be expressed as 16⁵or 16*16*16*16*16 or 1048576.

3.2.3.4. SIGN MAGNITUDE REPRESENTATION

Sign magnitude representation omits the sign of the exponent in storage, not the sign of the fraction. This can be done as excess binary notation is used to handle the sign of the exponent. Sign magnitude representation is used for all floating point formats.

3.2.3.5. NORMALIZED FRACTION

A normalized fraction is one where the fraction is shifted until all the leading zeros and the first set bit are removed. To allow as many significant bits as possible to be stored, the leading zeros are not stored. Since the fraction is always shifted so that the first bit is a one, this bit can also be dropped.

The value zero is represented in floating point notation by an exponent of zero.

Normalized fractions are used in all floating point formats.

3.2.3.6. NORMALIZED HEXADECIMAL FRACTION

A normalized decimal fraction is one where the first decimal digit IS NOT zero. For example: 0.11210 is normalized (remember that these are fractions so all numbers are less than 1) while 0.011210 is not normalized. Numbers are normalized in order to save bits (as starting zeros can be included in the exponent) so that more numbers can be stored in the fraction.

The fraction is expressed in hexadecimal. Therefore a normalized hexadecimal fraction is one where the first hexadecimal digit IS NOT zero. Since a hexadecimal number is 4 bits, an easy way of determining whether or not a binary fraction is normalized is to check if there is one or more "1"s in any of the first four bits (the first hex digit). If "1"s exist, then the first digit is greater than zero and is normalized.

For example:

(a) 0.4516 = 0000.010001012 (normalized)

(b) 0.004516 = 0000.0000010001012 (not normalized)

In example (a) the first four bits are not zero (.0100, remember that these are fractions) so the number is normalized. In example (b) the first four bits are all zero (.0000) so the number is not normalized.

Numbers in the fraction are always normalized, so the "0000." is dropped to save bits. PASCAL will automatically normalize all numbers stored in Max.

3.2.3.7. REAL*4 (F_FLOATING) (DEFAULT FOR REAL)

- Can store values in the range .29 x 10^-38to 1.7 x 10 ^38.

- Precision is approximately seven decimal digits.

3.2.3.8. REAL*8 (D_FLOATING) (DEFAULT FOR DOUBLE PRECISION)

- Can store values in the range .29 x 10^-38to 1.7 x 10 ³⁸

^{- Precision is approximately sixteen decimal digits}

3.2.3.9. DECODING EXAMPLE

The following is an example of taking a number and completely decoding it into decimal:

The original number is:

x(0)x(1000001)x(01010100)x(00000000)x(00000000)

1. Fraction

- since one or more of the first four bits is "1" the number is normalized

- The number can be broken up into chunks of four for conversion to hexadecimal

0000.0101 0100 0000 0000 0000 00002

- Convert the numbers from binary to hexadecimal

0.54000016

2. Sign Magnitude Bit

- Since the sign bit is 0 we have a positive number so the fraction is

+0.54000016

3. Excess 64 Notation

- The exponent is

10000012

- The exponent can be converted to decimal

1*2⁶ + 0*2⁵ + 0*2⁴ + 0*2³ + 0*2² + 0*2¹ + 1*2⁰ = 64 + 1 = 6510

- Subtract 64 from the exponent to obtain the exponent

65-64 = 110

- This is to the power 16 so the exponent is

decimal 16¹

4. Deriving the final number

- The number is currently:

+0.54000016 * dec 16¹

- Since the fraction is in hexadecimal (base-16) and the exponent is to the power 16, the power on the exponent specifies how many places to shift the fraction point to the left or right. A positive value moves the point to the right while a negative value moves the point to the left.

- Since the power is +1, the fraction point is moved one digit to the right.

The number is: +5.416

3.2.3.10. COMPLEX*8 (F_FLOATING) (DEFAULT FOR COMPLEX)

- Can store values in the range .29 x 10 ^-38 to 1.7 x 10 ³⁸

- Precision is approximately seven decimal digits

3.2.4. CHARACTERS

Character data are stored one character per byte in consecutive bytes of memory. Each character is stored using is ASCII value which is a number between 0 and 255. For example if the letter 'a' is to be stored in memory the value 97 decimal is stored.

3.2.5. ARRAYS

Arrays are stored in memory as a linear sequence of values. A one dimensional array is stored with the first element in the first memory location and the last element in the last memory location of the sequence. Multidimensional arrays are stored in either column-major order, with the left most subscript varying fastest, or in row-major order, with the right most subscript varying fastest. FORTRAN is column-major while C and Pascal are row-major order.

The amount of memory needed for each array element is dependent on the type of the array. The array can be declared to have elements of any of the preceding data types.

example: 2 dimensional array using FORTRAN

3.2.6. FORTRAN STORAGE CLASSES

3.2.6.1. COMMON VARIABLES

A COMMON statement declares a contiguous block of storage for the variables which are declared part of the common block. When a common block is used by more than one program unit, the same block of memory is shared by the units when they are combined into one executable unit. If the common blocks are named, then the block used in each unit is the one with the same name. If the blocks are unnamed then they are assumed to match in the order stated.

The amount of memory devoted to a common block is the sum of the memory required for the symbol types in the block. If the symbol types declared for the common block in one program unit do not correspond to the symbol types declared in another program unit, unpredictable results will occur.

3.2.6.2. EQUIVALENCE STATEMENTS

The equivalence statement is used to associate two or more data types and names with the same storage location.

For example if the following equivalence statement was used, the result would be three different definitions for the one 32 bit word of memory as shown in the diagram.

Equivalence (integer_variable,real_variable,logical)

3.3. MACHINE DEPENDENCIES

3.3.1. PRECISION AND ROUND-OFF ERRORS

Precision is the number of bits that a computer uses to represent its floating point numbers. The more bits that are used, the more accurate a representation is going to be made. If the represented value has more decimal places than the machine's precision can handle, then the value is rounded off. When calculations are done, they use these approximated values. The difference between the "real" answer and the calculated one is due to round-off error. A correlation can be made between the machine precision and the round-off error. The greater the precision a computer has the smaller the resulting round-off error.

3.3.2. MAXIMUM INTEGER

The maximum integer is the largest positive number that can be stored in an integer variable. This depends on the word size and is therefore machine dependent. The largest positive number is where all the bits of a word are set to 1's except the sign bit. Adding one to this number will cause the number to become negative ie: the number will "roll over" like an odometer.

One way to find the maximum integer is to add one until the number is smaller than the previous one. If your word size is large this takes a long time . There is a more efficient method as seen in the flowchart below:

3.3.3. MACHINE EPSILON

Machine epsilon (E) is the smallest value that can be stored in a floating point value and when added to one is still greater than one (i.e. E+1>1). This is one measure of accuracy of floating point arithmetic.

To find machine epsilon you can use the following method that is flowcharted below: