Question 1

Why can't computers represent 0.1 exactly?

Accepted Answer

0.1 in binary is the repeating fraction 0.0001100110011…, analogous to 1/3 = 0.333… in decimal. It has no finite binary representation.

• IEEE 754 stores only 23 mantissa bits (float32) or 52 bits (float64), so the pattern is truncated and rounded.
• The nearest float32 value is 0.100000001490116119, close, but not exact.
• This is why 0.1 + 0.2 === 0.30000000000000004 in JavaScript, Python, Java, and C.

The fix for financial code is to work in integer units (e.g., cents) rather than fractional dollars.

Question 2

What is the difference between float32 and float64?

Accepted Answer

Both formats follow IEEE 754, but they allocate bits differently:

• float32 (single): 1 sign + 8 exponent + 23 mantissa = 32 bits. Approximately 7 significant decimal digits, range up to ≈ 3.4 × 10³⁸.
• float64 (double): 1 sign + 11 exponent + 52 mantissa = 64 bits. Approximately 15–16 significant digits, range up to ≈ 1.8 × 10³⁰⁸.
• JavaScript's number type is always float64; float32 is only accessible via Float32Array.

GPUs and ML workloads often prefer float32 (or float16) to maximise throughput at the cost of precision.

Question 3

What is the exponent bias and why is it needed?

Accepted Answer

The exponent field stores an unsigned integer, and the bias converts it to a signed actual exponent: actual = stored − bias.

• float32 bias = 127: stored 127 → actual exponent 0; stored 130 → actual +3; stored 124 → actual −3.
• float64 bias = 1023: stored 1023 → actual exponent 0.
• Exponent 0 (all zeros) is reserved for subnormals and ±zero.
• Exponent 255 / 2047 (all ones) is reserved for ±Infinity and NaN.

The biased representation means floating-point magnitudes can be compared using plain integer comparison, a useful hardware optimisation.

Question 4

What are NaN and the two infinities?

Accepted Answer

IEEE 754 uses the all-ones exponent field to encode three special categories:

• +Infinity: exponent = all 1s, mantissa = 0, sign = 0. Produced by overflow or 1 / 0.
• -Infinity: same but sign = 1. Produced by -1 / 0 or large negative overflow.
• NaN: exponent = all 1s, mantissa ≠ 0. Result of 0/0, ∞ − ∞, or sqrt(−1). NaN is the only value not equal to itself: NaN !== NaN is always true.

There are also two zeros: +0 (sign = 0) and -0 (sign = 1). They compare equal, but 1/(+0) = +∞ while 1/(−0) = −∞.

Question 5

What is a subnormal (denormalized) number?

Accepted Answer

When the exponent field is all zeros, the number is subnormal. The formula changes: the implicit leading 1 becomes a 0, giving 0.mantissa × 2^(1−bias) instead of the normal 1.mantissa × 2^(e−bias) . • Subnormals allow gradual underflow : precision reduces smoothly near zero instead of abruptly jumping to 0. • Smallest positive float32 subnormal: 2⁻¹⁴⁹ ≈ 1.4 × 10⁻⁴⁵ . • Smallest positive float32 normal: 2⁻¹²⁶ ≈ 1.18 × 10⁻³⁸ .

Question 6

How do I check if two floats are equal in code?

Accepted Answer

Never use === to compare floating-point results that came from arithmetic, rounding errors make exact equality unreliable. • Relative tolerance: |a − b| < ε × max(|a|, |b|), scales with the magnitude of the values. • Absolute tolerance: |a − b| < ε, use when values are expected to be near zero. • JavaScript machine epsilon: Number.EPSILON ≈ 2.22 × 10⁻¹⁶ . A practical tolerance for most scientific work is 1e-9 . • For.

Question 7

Why do GPUs use float16 or bfloat16?

Accepted Answer

Modern ML accelerators trade precision for throughput, packing more multiply-accumulate operations per second into the same silicon:

• float16 (IEEE half precision): 1 sign + 5 exponent + 10 mantissa bits. About 3 decimal digits of precision, range up to 65504.
• bfloat16 (Brain Float 16): 1 sign + 8 exponent + 7 mantissa bits. Same exponent range as float32, so it avoids gradient overflow in neural networks despite lower mantissa precision.
• float8: Emerging format in transformer inference (e.g., FP8-E4M3, FP8-E5M2) used on H100 GPUs.

This converter focuses on the standard float32 and float64 formats defined by IEEE 754-2008.

Question 8

How does the converter extract the exact bits?

Accepted Answer

All conversions run live in your browser using JavaScript typed arrays, no server round-trip. • float32: A Float32Array and a Uint32Array share the same ArrayBuffer . Writing the float value and reading the integer gives the raw bit pattern, identical to a C union or memcpy . • float64: Two Uint32Array elements hold the high and low 32-bit words, avoiding JavaScript's 53-bit integer precision limit for the full 64-bit.

Symbol	Name	Description
s	Sign bit	0 = positive, 1 = negative; determines the sign of the value
e	Stored exponent	Unsigned integer stored in the exponent field; e = actual_exp + bias
bias	Exponent bias	127 for 32-bit; 1023 for 64-bit; centres the exponent range around 0
e − bias	Actual exponent	True power of 2 used in the value formula; ranges from −126 to +127 (float32)
1.mant	Significand	Implicit leading 1 plus the 23 (or 52) explicit mantissa bits
ULP	Unit in Last Place	Smallest difference between two adjacent representable numbers near a given value
ε	Machine epsilon	Smallest number such that 1 + ε ≠ 1; 2⁻²³ for float32, 2⁻⁵² for float64
NaN	Not a Number	Result of undefined operations like 0/0 or √(−1); not equal to any value including itself

Property	float32 (single)	float64 (double)
Bits	32	64
Sign bits	1	1
Exponent bits	8	11
Mantissa bits	23	52
Bias	127	1023
Max finite value	≈ 3.4 × 10³⁸	≈ 1.8 × 10³⁰⁸
Min positive normal	≈ 1.18 × 10⁻³⁸	≈ 2.23 × 10⁻³⁰⁸
Decimal precision	≈ 7 significant digits	≈ 15–16 significant digits
Machine epsilon ε	2⁻²³ ≈ 1.19 × 10⁻⁷	2⁻⁵² ≈ 2.22 × 10⁻¹⁶
Memory	4 bytes	8 bytes

Language	float32 type	float64 type
JavaScript / TypeScript	Float32Array elements	number (all floats)
Python	numpy.float32	float (built-in)
C / C++	float	double
Java	float	double
Rust	f32	f64
Go	float32	float64
GLSL / HLSL (GPU)	mediump float	highp float / double

Floating Point Converter | IEEE 754 Binary

What Is the Floating Point Converter | IEEE 754 Binary?

Formula

How to Use

Example Calculation

Understanding Floating Point Converter | IEEE 754 Binary

float32 vs float64, When Does It Matter?

Why 0.1 + 0.2 ≠ 0.3

Special Values in Practice

Rounding Modes

IEEE 754 in Programming Languages

Frequently Asked Questions