Floating Point (Real Numbers)

Book Search

Download this chapter in PDF format

Chapter4.pdf

1: The Breadth and Depth of DSP
- The Roots of DSP
- Telecommunications
- Audio Processing
- Echo Location
- Image Processing
2: Statistics, Probability and Noise
- Signal and Graph Terminology
- Mean and Standard Deviation
- Signal vs. Underlying Process
- The Histogram, Pmf and Pdf
- The Normal Distribution
- Digital Noise Generation
- Precision and Accuracy
3: ADC and DAC
- Quantization
- The Sampling Theorem
- Digital-to-Analog Conversion
- Analog Filters for Data Conversion
- Selecting The Antialias Filter
- Multirate Data Conversion
- Single Bit Data Conversion
4: DSP Software
- Computer Numbers
- Fixed Point (Integers)
- Floating Point (Real Numbers)
- Number Precision
- Execution Speed: Program Language
- Execution Speed: Hardware
- Execution Speed: Programming Tips
5: Linear Systems
- Signals and Systems
- Requirements for Linearity
- Static Linearity and Sinusoidal Fidelity
- Examples of Linear and Nonlinear Systems
- Special Properties of Linearity
- Superposition: the Foundation of DSP
- Common Decompositions
- Alternatives to Linearity
6: Convolution
- The Delta Function and Impulse Response
- Convolution
- The Input Side Algorithm
- The Output Side Algorithm
- The Sum of Weighted Inputs
7: Properties of Convolution
- Common Impulse Responses
- Mathematical Properties
- Correlation
- Speed
8: The Discrete Fourier Transform
- The Family of Fourier Transform
- Notation and Format of the Real DFT
- The Frequency Domain's Independent Variable
- DFT Basis Functions
- Synthesis, Calculating the Inverse DFT
- Analysis, Calculating the DFT
- Duality
- Polar Notation
- Polar Nuisances
9: Applications of the DFT
- Spectral Analysis of Signals
- Frequency Response of Systems
- Convolution via the Frequency Domain
10: Fourier Transform Properties
- Linearity of the Fourier Transform
- Characteristics of the Phase
- Periodic Nature of the DFT
- Compression and Expansion, Multirate methods
- Multiplying Signals (Amplitude Modulation)
- The Discrete Time Fourier Transform
- Parseval's Relation
11: Fourier Transform Pairs
- Delta Function Pairs
- The Sinc Function
- Other Transform Pairs
- Gibbs Effect
- Harmonics
- Chirp Signals
12: The Fast Fourier Transform
- Real DFT Using the Complex DFT
- How the FFT works
- FFT Programs
- Speed and Precision Comparisons
- Further Speed Increases
13: Continuous Signal Processing
- The Delta Function
- Convolution
- The Fourier Transform
- The Fourier Series
14: Introduction to Digital Filters
- Filter Basics
- How Information is Represented in Signals
- Time Domain Parameters
- Frequency Domain Parameters
- High-Pass, Band-Pass and Band-Reject Filters
- Filter Classification
15: Moving Average Filters
- Implementation by Convolution
- Noise Reduction vs. Step Response
- Frequency Response
- Relatives of the Moving Average Filter
- Recursive Implementation
16: Windowed-Sinc Filters
- Strategy of the Windowed-Sinc
- Designing the Filter
- Examples of Windowed-Sinc Filters
- Pushing it to the Limit
17: Custom Filters
- Arbitrary Frequency Response
- Deconvolution
- Optimal Filters
18: FFT Convolution
- The Overlap-Add Method
- FFT Convolution
- Speed Improvements
19: Recursive Filters
- The Recursive Method
- Single Pole Recursive Filters
- Narrow-band Filters
- Phase Response
- Using Integers
20: Chebyshev Filters
- The Chebyshev and Butterworth Responses
- Designing the Filter
- Step Response Overshoot
- Stability
21: Filter Comparison
- Match #1: Analog vs. Digital Filters
- Match #2: Windowed-Sinc vs. Chebyshev
- Match #3: Moving Average vs. Single Pole
22: Audio Processing
- Human Hearing
- Timbre
- Sound Quality vs. Data Rate
- High Fidelity Audio
- Companding
- Speech Synthesis and Recognition
- Nonlinear Audio Processing
23: Image Formation & Display
- Digital Image Structure
- Cameras and Eyes
- Television Video Signals
- Other Image Acquisition and Display
- Brightness and Contrast Adjustments
- Grayscale Transforms
- Warping
24: Linear Image Processing
- Convolution
- 3x3 Edge Modification
- Convolution by Separability
- Example of a Large PSF: Illumination Flattening
- Fourier Image Analysis
- FFT Convolution
- A Closer Look at Image Convolution
25: Special Imaging Techniques
- Spatial Resolution
- Sample Spacing and Sampling Aperture
- Signal-to-Noise Ratio
- Morphological Image Processing
- Computed Tomography
26: Neural Networks (and more!)
- Target Detection
- Neural Network Architecture
- Why Does it Work?
- Training the Neural Network
- Evaluating the Results
- Recursive Filter Design
27: Data Compression
- Data Compression Strategies
- Run-Length Encoding
- Huffman Encoding
- Delta Encoding
- LZW Compression
- JPEG (Transform Compression)
- MPEG
28: Digital Signal Processors
- How DSPs are Different from Other Microprocessors
- Circular Buffering
- Architecture of the Digital Signal Processor
- Fixed versus Floating Point
- C versus Assembly
- How Fast are DSPs?
- The Digital Signal Processor Market
29: Getting Started with DSPs
- The ADSP-2106x family
- The SHARC EZ-KIT Lite
- Design Example: An FIR Audio Filter
- Analog Measurements on a DSP System
- Another Look at Fixed versus Floating Point
- Advanced Software Tools
30: Complex Numbers
- The Complex Number System
- Polar Notation
- Using Complex Numbers by Substitution
- Complex Representation of Sinusoids
- Complex Representation of Systems
- Electrical Circuit Analysis
31: The Complex Fourier Transform
- The Real DFT
- Mathematical Equivalence
- The Complex DFT
- The Family of Fourier Transforms
- Why the Complex Fourier Transform is Used
32: The Laplace Transform
- The Nature of the s-Domain
- Strategy of the Laplace Transform
- Analysis of Electric Circuits
- The Importance of Poles and Zeros
- Filter Design in the s-Domain
33: The z-Transform
- The Nature of the z-Domain
- Analysis of Recursive Systems
- Cascade and Parallel Stages
- Spectral Inversion
- Gain Changes
- Chebyshev-Butterworth Filter Design
- The Best and Worst of DSP
34: Explaining Benford's Law
- Frank Benford's Discovery
- Homomorphic Processing
- The Ones Scaling Test
- Writing Benford's Law as a Convolution
- Solving in the Frequency Domain
- Solving Mystery #1
- Solving Mystery #2
- More on Following Benford's law
- Analysis of the Log-Normal Distribution
- The Power of Signal Processing

How to order your own hardcover copy

Wouldn't you rather have a bound book instead of 640 loose pages?
Your laser printer will thank you!
Order from Amazon.com.

Chapter 4 - DSP Software / Floating Point (Real Numbers)

Chapter 4: DSP Software

Floating Point (Real Numbers)

The encoding scheme for floating point numbers is more complicated than for fixed point. The basic idea is the same as used in scientific notation, where a mantissa is multiplied by ten raised to some exponent. For instance, 5.4321 × 10⁶, where 5.4321 is the mantissa and 6 is the exponent. Scientific notation is exceptional at representing very large and very small numbers. For example: 1.2 × 10⁵⁰, the number of atoms in the earth, or 2.6 × 10^-23, the distance a turtle crawls in one second, compared to the diameter of our galaxy. Notice that numbers represented in scientific notation are normalized so that there is only a single nonzero digit left of the decimal point. This is achieved by adjusting the exponent as needed.

Floating point representation is similar to scientific notation, except everything is carried out in base two, rather than base ten. While several similar formats are in use, the most common is ANSI/IEEE Std. 754-1985. This standard defines the format for 32 bit numbers called single precision, as well as 64 bit numbers called double precision. As shown in Fig. 4-2, the 32 bits used in single precision are divided into three separate groups: bits 0 through 22 form the mantissa, bits 23 through 30 form the exponent, and bit 31 is the sign bit. These bits form the floating point number, v, by the following relation:

The term: (-1)^S, simply means that the sign bit, S, is 0 for a positive number and 1 for a negative number. The variable, E, is the number between 0 and 255 represented by the eight exponent bits. Subtracting 127 from this number allows the exponent term to run from to In other words, the exponent is stored in offset binary with an offset of 127.

The mantissa, M, is formed from the 23 bits as a binary fraction. For example, the decimal fraction: 2.783, is interpreted: 2 + 7/10 + 8/100 + 3/1000. The binary fraction: 1.0101, means: 1 + 0/2 + 1/4 + 0/8 + 1/16. Floating point numbers are normalized in the same way as scientific notation, that is, there is only one nonzero digit left of the decimal point (called a binary point in

base 2). Since the only nonzero number that exists in base two is 1, the leading digit in the mantissa will always be a 1, and therefore does not need to be stored. Removing this redundancy allows the number to have an additional one bit of precision. The 23 stored bits, referred to by the notation: m₂₂,m₂₁,m₂₁,…,m₀, form the mantissa according to:

In other words, M = 1 + m₂₂2^-1 + m₂₁2^-2 + m₂₀2^-3…. If bits 0 through 22 are all zeros, M takes on the value of one. If bits 0 through 22 are all ones, M is just a hair under two, i.e., 2-2^-23.

Using this encoding scheme, the largest number that can be represented is: ±(2-2^-23)×2¹²⁸ = ±6.8 × 10³⁸. Likewise, the smallest number that can be represented is: ±1.0 × 2^-127 = ±5.9 × 10^-39. The IEEE standard reduces this range slightly to free bit patterns that are assigned special meanings. In particular, the largest and smallest numbers allowed in the standard are ±3.4 × 10³⁸ and ?1.2 ? 10^-38 respectively. The freed bit patterns allow three special classes of numbers: (1) ±0 is defined as all of the mantissa and exponent bits being zero. (2) ±∞ is defined as all of the mantissa bits being zero, and all of the exponent bits being one. (3) A group of very small unnormalized numbers between ?1.2 ? 10^-38 and ?1.4 ? 10^-45. These are lower precision numbers obtained by removing the requirement that the leading digit in the mantissa be a one. Besides these three special classes, there are bit patterns that are not assigned a meaning, commonly referred to as NANs (Not A Number).

The IEEE standard for double precision simply adds more bits to the single precision format. Of the 64 bits used to store a double precision number, bits 0 through 51 are the mantissa, bits 52 through 62 are the exponent, and bit 63 is the sign bit. As before, the mantissa is between one and just under two, i.e., M = 1 +m₅₁2^-1 +m₅₀2^-2 + m₄₉2^-3…. The 11 exponent bits form a number between 0 and 2047, with an offset of 1023, allowing exponents from 2^-1023 to 2¹⁰²⁴. The largest and smallest numbers allowed are ?1.8 ? 10³⁰⁸ and ?2.2 ? 10^-308, respectively. These are incredibly large and small numbers! It is quite uncommon to find an application where single precision is not adequate. You will probably never find a case where double precision limits what you want to accomplish.

Next Section: Number Precision

The Scientist and Engineer's Guide to
Digital Signal Processing
By Steven W. Smith, Ph.D.

Book Search

Download this chapter in PDF format

Table of contents

How to order your own hardcover copy

Chapter 4: DSP Software

The Scientist and Engineer's Guide toDigital Signal ProcessingBy Steven W. Smith, Ph.D.

Book Search

Download this chapter in PDF format

Table of contents

How to order your own hardcover copy

Chapter 4: DSP Software

The Scientist and Engineer's Guide to
Digital Signal Processing
By Steven W. Smith, Ph.D.