Data Representation: The Complete Guide

Master bits, bytes, character encoding (ASCII, Unicode, UTF-8), integer and floating-point representation, data types, and understand how computers store and process all information

Introduction

Welcome to the most comprehensive data representation guide for 2026. Every piece of information a computer processes—text, numbers, images, sound, video—ultimately reduces to sequences of bits (0s and 1s). Understanding how data is represented is fundamental to becoming an effective programmer, debugging issues, and optimizing performance.

8
Bits = 1 Byte
1M+
Unicode Characters
64
Bits in Modern CPU
Data Types Possible

Whether you're debugging a character encoding issue, optimizing memory usage, or understanding why floating-point math sometimes gives surprising results, this guide will give you the foundational knowledge to work confidently with data at every level.

What You'll Learn

This comprehensive guide covers bits and bytes, character encoding (ASCII, Unicode, UTF-8, UTF-16), integer representation (signed/unsigned, two's complement), IEEE 754 floating-point standard, programming data types, endianness (big-endian vs little-endian), memory layout and alignment, and practical applications across programming, networking, and storage.

Bits, Bytes & Words

All digital data starts with the bit—the smallest unit of information. Understanding how bits combine into larger units is the foundation of data representation.

Data Size Hierarchy

Unit Abbreviation Bits Bytes Range of Values
Bit b 1 1/8 0 or 1
Nibble - 4 1/2 0-15 (one hex digit)
Byte B 8 1 0-255 (unsigned) / -128 to 127 (signed)
Word - 16 2 0-65,535
Double Word DWORD 32 4 0-4,294,967,295
Quad Word QWORD 64 8 0-18,446,744,073,709,551,615

Real-World Byte Comparisons

# How much data is stored? 1 character = 1 byte # ASCII text (e.g., 'A') 1 emoji = 4 bytes # UTF-8 (e.g., 😀) 1 pixel = 3-4 bytes # RGB (3) or RGBA (4) 1 second audio = ~176 KB # CD quality (44.1kHz, 16-bit stereo) 1 second video = ~2 MB # 1080p at 30fps # Storage sizes: 1 KB = 1,024 bytes # ~1/2 page of text 1 MB = 1,048,576 bytes # ~1 minute of MP3 1 GB = 1,073,741,824 bytes # ~1 hour of HD video 1 TB = 1,099,511,627,776 bytes # ~250,000 photos
KB vs KiB: Know the Difference

KB (kilobyte) = 1,000 bytes (decimal, used by storage manufacturers). KiB (kibibyte) = 1,024 bytes (binary, used by operating systems). A "500 GB" hard drive actually has ~465 GiB of usable space. Always clarify which standard you're using!

Character Encoding

How do computers represent letters, numbers, and symbols? Character encoding maps characters to numeric codes (which are then stored as binary).

ASCII: The Foundation

ASCII (American Standard Code for Information Interchange) uses 7 bits to represent 128 characters: uppercase/lowercase letters, digits, punctuation, and control characters.

Character Decimal Binary Hex Category
'A' 65 01000001 0x41 Uppercase letter
'a' 97 01100001 0x61 Lowercase letter
'0' 48 00110000 0x30 Digit
' ' 32 00100000 0x20 Space
'\n' 10 00001010 0x0A Newline (LF)

Extended ASCII & ANSI

ASCII was extended to 8 bits (256 characters) to support accented letters, currency symbols, and graphics. However, different regions used different extensions (ISO-8859-1 for Western Europe, ISO-8859-5 for Cyrillic, etc.), creating compatibility problems.

# Python: Character encoding in action ord('A') # Returns 65 (ASCII/Unicode code point) chr(65) # Returns 'A' (reverse: code point → character) hex(ord('A')) # Returns '0x41' # Encoding a string to bytes text = "Hello" bytes_utf8 = text.encode('utf-8') print(bytes_utf8) # b'Hello' print(list(bytes_utf8)) # [72, 101, 108, 108, 111] # Decoding bytes back to string text = bytes_utf8.decode('utf-8') # "Hello"
Encoding Errors Are Common

"UnicodeDecodeError" and "Mojibake" (garbled text) happen when bytes are decoded with the wrong encoding. Always explicitly specify encoding when reading/writing files: open('file.txt', encoding='utf-8').

Unicode & UTF-8

Unicode is the universal character standard that aims to represent every character from every writing system in the world—over 150,000 characters and growing.

Unicode vs UTF-8 vs UTF-16

Encoding Bits per Character Max Characters Best For
ASCII 7 bits 128 English text, legacy systems
UTF-8 1-4 bytes (variable) 1,112,064 Web, emails, most modern systems
UTF-16 2 or 4 bytes 1,112,064 Windows, Java, JavaScript
UTF-32 4 bytes (fixed) 1,112,064 Internal processing (wastes space)

How UTF-8 Works

UTF-8 is brilliantly designed: ASCII characters use 1 byte, European characters use 2 bytes, Asian characters use 3 bytes, and rare/emoji characters use 4 bytes.

UTF-8 Byte Breakdown
'A' (ASCII)
→ 1 byte: 0x41
'é' (Latin extended)
→ 2 bytes: 0xC3 0xA9
'中' (Chinese)
→ 3 bytes: 0xE4 0xB8 0xAD
'😀' (Emoji)
→ 4 bytes: 0xF0 0x9F 0x98 0x80
UTF-8 = Backward compatible with ASCII + supports all languages!

Common Encoding Pitfalls

Best Practice: Always Use UTF-8

UTF-8 is the default for the web, modern programming languages, and databases. Always specify UTF-8 encoding explicitly. When in doubt, UTF-8 is the answer.

Integer Representation

How do computers store whole numbers? The answer involves binary representation, signed vs unsigned integers, and two's complement.

Unsigned vs Signed Integers

Type 8-bit Range 16-bit Range 32-bit Range 64-bit Range
Unsigned 0 to 255 0 to 65,535 0 to 4.29 billion 0 to 18.4 quintillion
Signed (Two's Complement) -128 to 127 -32,768 to 32,767 -2.14 billion to 2.14 billion -9.22 quintillion to 9.22 quintillion

Two's Complement Explained

# How two's complement works (8-bit example): # To represent -5: # 1. Start with +5: 00000101 # 2. Invert bits: 11111010 # 3. Add 1: 11111011 # Result: -5 = 11111011 (0xFB in hex) # Python: Two's complement behavior int8_max = 127 # 01111111 int8_min = -128 # 10000000 (note: more negative values than positive!) # The MSB (most significant bit) is the sign bit: # 0xxxxxxx = positive # 1xxxxxxx = negative # C: Explicit signed/unsigned types int8_t signed_val = -5; /* Signed: -128 to 127 */ uint8_t unsigned_val = 250; /* Unsigned: 0 to 255 */

Integer Overflow in Practice

# Python: No overflow (arbitrary precision integers) x = 2**1000 # Works! Python handles big integers automatically # C/Java: Fixed-size integers CAN overflow # int32_t max = 2,147,483,647 # max + 1 = -2,147,483,648 (wraparound!) # JavaScript: All numbers are 64-bit floats # Safe integer range: ±9,007,199,254,740,991 (2^53 - 1) Number.MAX_SAFE_INTEGER // 9007199254740991 BigInt(9007199254740992) // Use BigInt for larger integers
Overflow Has Caused Real Disasters

The Ariane 5 rocket exploded 37 seconds after launch due to a 64-bit to 16-bit integer conversion overflow ($370M loss). The Boeing 787 had a software bug from integer overflow. Always validate input ranges and use safe arithmetic.

Floating-Point Numbers

Not all numbers are whole. Computers represent fractional numbers using the IEEE 754 floating-point standard.

IEEE 754 Structure

Format Total Bits Sign Bit Exponent Bits Mantissa Bits Precision
Single (float) 32 1 8 23 ~7 decimal digits
Double (double) 64 1 11 52 ~15 decimal digits
Half (float16) 16 1 5 10 ~3 decimal digits
Quad (float128) 128 1 15 112 ~34 decimal digits

The Floating-Point Surprise

# The classic floating-point "bug": 0.1 + 0.2 # = 0.30000000000000004 (NOT 0.3!) # Why? 0.1 in binary is a repeating fraction: # 0.0001100110011001100110011001100110011001100110011... (repeats forever) # Computers must truncate, causing tiny precision errors # Python: Solutions from decimal import Decimal Decimal('0.1') + Decimal('0.2') # = Decimal('0.3') ✓ import math math.isclose(0.1 + 0.2, 0.3) # = True ✓ # NEVER compare floats with == # Use epsilon-based comparison instead: def float_eq(a, b, epsilon=1e-9): return abs(a - b) < epsilon

Special Floating-Point Values

Floating-Point Best Practices

• Use double (64-bit) by default; only use float (32-bit) for memory-constrained scenarios
• Never use == for float comparison; use epsilon-based comparison
• For financial calculations, use fixed-point or decimal types
• Be aware of precision loss in repeated operations (accumulate errors)

Data Types in Programming

Programming languages provide data types that determine how values are stored, what operations are allowed, and how much memory is used.

Common Data Types Across Languages

Type Python JavaScript Java C Size
Boolean bool boolean boolean _Bool 1 byte
Integer int (arbitrary) number (64-bit float) int (32-bit) int (typically 32-bit) 2-8 bytes
Floating-point float (64-bit) number (64-bit) double (64-bit) double (64-bit) 4-8 bytes
String str (Unicode) string (UTF-16) String (UTF-16) char* (null-terminated) Variable
Array/List list Array array[] array[] Variable

Static vs Dynamic Typing

# Dynamic typing (Python): Type determined at runtime x = 42 # x is an integer x = "hello" # x is now a string (no error!) # Static typing (Java): Type declared and enforced at compile time int x = 42; // x is always an integer # x = "hello"; // Compile error! Type mismatch # Modern approach: Type hints + runtime checking # Python type hints (not enforced at runtime): def add(a: int, b: int) -> int: return a + b # TypeScript: Static types for JavaScript let age: number = 25; // age must be a number # age = "twenty"; // Compile error!

Memory Size of Common Types

# Python: Check memory usage import sys sys.getsizeof(0) # 24 bytes (Python int overhead) sys.getsizeof(10**100) # 52 bytes (arbitrary precision) sys.getsizeof("hello") # 54 bytes (string overhead + characters) # C: sizeof operator # sizeof(char) = 1 byte # sizeof(short) = 2 bytes # sizeof(int) = 4 bytes (typically) # sizeof(long) = 8 bytes (64-bit systems) # sizeof(float) = 4 bytes # sizeof(double) = 8 bytes # sizeof(void*) = 8 bytes (64-bit pointer)
Choosing the Right Type

• Use the smallest type that fits your data (saves memory)
• Use signed for values that can be negative, unsigned for counts/indices
• Use double for floating-point unless memory is tight
• Use fixed-width types (int32_t, uint64_t) for portability

Endianness: Byte Order Matters

When storing multi-byte values (like a 32-bit integer), which byte goes first? This is endianness.

Big-Endian vs Little-Endian

Format Byte Order Example (0x12345678) Used By
Big-Endian Most significant byte first 12 34 56 78 Network protocols, Java, Motorola
Little-Endian Least significant byte first 78 56 34 12 x86/x64 CPUs, Windows, ARM (configurable)

Why Endianness Matters

# The value 0x12345678 stored differently: # Big-end: [12] [34] [56] [78] ← Reads left-to-right like humans # Little-end: [78] [56] [34] [12] ← "Backwards" but efficient for CPU # Python: Check your system's endianness import sys print(sys.byteorder) # 'little' on most modern systems # Python: Explicit byte order conversion import struct value = 0x12345678 big_endian_bytes = struct.pack('>I', value) # b'\x12\x34\x56\x78' little_endian_bytes = struct.pack('<I', value) # b'\x78\x56\x34\x12' # Network byte order is ALWAYS big-endian # Use htonl(), htons(), ntohl(), ntohs() in C for network programming

Endianness in Practice

Endianness Bugs

Reading a little-endian file on a big-endian system (or vice versa) produces garbage data. Always use serialization libraries (Protocol Buffers, JSON, MessagePack) that handle endianness automatically.

Memory Layout & Alignment

How data is arranged in memory affects performance. Memory alignment ensures data starts at addresses that are multiples of the data size.

Memory Alignment Rules

Data Type Size Alignment Requirement Valid Addresses
char 1 byte 1-byte aligned Any address
short 2 bytes 2-byte aligned Even addresses (0, 2, 4...)
int / float 4 bytes 4-byte aligned Addresses divisible by 4
double / long 8 bytes 8-byte aligned Addresses divisible by 8

Struct Padding Example

# C: Struct padding and packing struct BadLayout { char a; /* 1 byte */ /* 3 bytes padding (to align b) */ int b; /* 4 bytes */ char c; /* 1 byte */ /* 3 bytes padding (to make struct size multiple of 4) */ }; /* Total: 12 bytes (only 6 bytes of data!) */ struct GoodLayout { int b; /* 4 bytes */ char a; /* 1 byte */ char c; /* 1 byte */ /* 2 bytes padding */ }; /* Total: 8 bytes (same data, 33% less memory!) */ # Python: struct module respects alignment import struct struct.calcsize('cic') # = 12 (BadLayout) struct.calcsize('icc') # = 8 (GoodLayout)

Cache Lines & Performance

Memory Optimization Tips

• Order struct members by size (largest first) to minimize padding
• Use __attribute__((packed)) in C when binary compatibility matters (but expect performance cost)
• Prefer arrays over linked lists for cache-friendly access
• Profile memory usage before optimizing—don't guess

Practical Applications

Data representation concepts appear everywhere in programming. Here are common scenarios where this knowledge matters.

Common Use Cases

File I/O

Reading binary files requires understanding byte order, padding, and data types.

Example: Parsing PNG, JPEG, or custom binary formats

Network Protocols

Network byte order (big-endian), serialization, and protocol parsing.

Example: TCP/IP headers, HTTP, Protocol Buffers

Database Storage

How databases store numbers, strings, and dates on disk.

Example: Fixed-width vs variable-length fields

Cryptography

Encryption operates on bytes. Understanding byte-level operations is essential.

Example: AES, RSA, hash functions all work on byte arrays

Real-World Example: Parsing a Binary File

# Python: Reading a BMP file header (little-endian format) import struct with open('image.bmp', 'rb') as f: # BMP Header: 14 bytes signature = f.read(2) # b'BM' (file signature) file_size = struct.unpack('<I', f.read(4))[0] # Little-endian uint32 reserved = f.read(4) # Reserved (ignore) data_offset = struct.unpack('<I', f.read(4))[0] # DIB Header: 40 bytes (BITMAPINFOHEADER) header_size = struct.unpack('<I', f.read(4))[0] width = struct.unpack('<i', f.read(4))[0] # Signed int32 height = struct.unpack('<i', f.read(4))[0] planes = struct.unpack('<H', f.read(2))[0] # uint16 bits_per_pixel = struct.unpack('<H', f.read(2))[0] print(f"BMP: {width}x{height}, {bits_per_pixel}bpp, {file_size} bytes") # Key lessons: # - '<' means little-endian, '>' means big-endian # - 'I' = unsigned int (4 bytes), 'i' = signed int (4 bytes) # - 'H' = unsigned short (2 bytes) # - Always read binary files in 'rb' mode
Serialization Best Practices

Don't manually parse binary formats unless necessary. Use established libraries: json for text, struct or construct for binary, protobuf or messagepack for efficient cross-platform serialization.

Conclusion

Data representation is the invisible foundation of all computing. From the single bit that stores a boolean to the complex UTF-8 encoding that represents emojis, understanding how data is stored and processed makes you a more effective programmer.

Key Takeaways

Next Steps

  1. Practice encoding: Experiment with encode()/decode() in Python
  2. Read binary files: Try parsing a simple format (BMP, WAV, or PNG header)
  3. Understand your language: Check how your language handles integers, floats, and strings
  4. Explore serialization: Compare JSON, Protocol Buffers, and MessagePack
  5. Profile memory: Use tools to understand your program's memory usage

Programs must be written for people to read, and only incidentally for machines to execute.

— Harold Abelson, Structure and Interpretation of Computer Programs
Try This Now

Open your terminal. Type python3 -c "print('😀'.encode('utf-8'))". See how a single emoji becomes 4 bytes. That's data representation in action.

Thank you for reading this comprehensive data representation guide. Whether you're parsing binary files, debugging encoding issues, or optimizing memory usage, understanding how data is represented will make you a more confident and capable developer. Keep learning, keep experimenting, and keep building!