Introduction
Welcome to the most comprehensive data representation guide for 2026. Every piece of information a computer processes—text, numbers, images, sound, video—ultimately reduces to sequences of bits (0s and 1s). Understanding how data is represented is fundamental to becoming an effective programmer, debugging issues, and optimizing performance.
Whether you're debugging a character encoding issue, optimizing memory usage, or understanding why floating-point math sometimes gives surprising results, this guide will give you the foundational knowledge to work confidently with data at every level.
This comprehensive guide covers bits and bytes, character encoding (ASCII, Unicode, UTF-8, UTF-16), integer representation (signed/unsigned, two's complement), IEEE 754 floating-point standard, programming data types, endianness (big-endian vs little-endian), memory layout and alignment, and practical applications across programming, networking, and storage.
Bits, Bytes & Words
All digital data starts with the bit—the smallest unit of information. Understanding how bits combine into larger units is the foundation of data representation.
Data Size Hierarchy
| Unit | Abbreviation | Bits | Bytes | Range of Values |
|---|---|---|---|---|
| Bit | b | 1 | 1/8 | 0 or 1 |
| Nibble | - | 4 | 1/2 | 0-15 (one hex digit) |
| Byte | B | 8 | 1 | 0-255 (unsigned) / -128 to 127 (signed) |
| Word | - | 16 | 2 | 0-65,535 |
| Double Word | DWORD | 32 | 4 | 0-4,294,967,295 |
| Quad Word | QWORD | 64 | 8 | 0-18,446,744,073,709,551,615 |
Real-World Byte Comparisons
KB (kilobyte) = 1,000 bytes (decimal, used by storage manufacturers). KiB (kibibyte) = 1,024 bytes (binary, used by operating systems). A "500 GB" hard drive actually has ~465 GiB of usable space. Always clarify which standard you're using!
Character Encoding
How do computers represent letters, numbers, and symbols? Character encoding maps characters to numeric codes (which are then stored as binary).
ASCII: The Foundation
ASCII (American Standard Code for Information Interchange) uses 7 bits to represent 128 characters: uppercase/lowercase letters, digits, punctuation, and control characters.
| Character | Decimal | Binary | Hex | Category |
|---|---|---|---|---|
| 'A' | 65 | 01000001 | 0x41 | Uppercase letter |
| 'a' | 97 | 01100001 | 0x61 | Lowercase letter |
| '0' | 48 | 00110000 | 0x30 | Digit |
| ' ' | 32 | 00100000 | 0x20 | Space |
| '\n' | 10 | 00001010 | 0x0A | Newline (LF) |
Extended ASCII & ANSI
ASCII was extended to 8 bits (256 characters) to support accented letters, currency symbols, and graphics. However, different regions used different extensions (ISO-8859-1 for Western Europe, ISO-8859-5 for Cyrillic, etc.), creating compatibility problems.
"UnicodeDecodeError" and "Mojibake" (garbled text) happen when bytes are decoded with the wrong encoding. Always explicitly specify encoding when reading/writing files: open('file.txt', encoding='utf-8').
Unicode & UTF-8
Unicode is the universal character standard that aims to represent every character from every writing system in the world—over 150,000 characters and growing.
Unicode vs UTF-8 vs UTF-16
| Encoding | Bits per Character | Max Characters | Best For |
|---|---|---|---|
| ASCII | 7 bits | 128 | English text, legacy systems |
| UTF-8 | 1-4 bytes (variable) | 1,112,064 | Web, emails, most modern systems |
| UTF-16 | 2 or 4 bytes | 1,112,064 | Windows, Java, JavaScript |
| UTF-32 | 4 bytes (fixed) | 1,112,064 | Internal processing (wastes space) |
How UTF-8 Works
UTF-8 is brilliantly designed: ASCII characters use 1 byte, European characters use 2 bytes, Asian characters use 3 bytes, and rare/emoji characters use 4 bytes.
→ 1 byte: 0x41
→ 2 bytes: 0xC3 0xA9
→ 3 bytes: 0xE4 0xB8 0xAD
→ 4 bytes: 0xF0 0x9F 0x98 0x80
Common Encoding Pitfalls
- BOM (Byte Order Mark): UTF-8 files sometimes start with
0xEF 0xBB 0xBF. Can cause issues in config files and scripts. - Windows vs Unix line endings: Windows uses
\r\n(CRLF), Unix uses\n(LF). Mixing them causes problems. - Emoji length confusion: An emoji is 1 character but 4 bytes in UTF-8. String length ≠ byte length.
- Collation/sorting: Different languages sort characters differently. "Å" comes after "Z" in Swedish but before "A" in Norwegian.
UTF-8 is the default for the web, modern programming languages, and databases. Always specify UTF-8 encoding explicitly. When in doubt, UTF-8 is the answer.
Integer Representation
How do computers store whole numbers? The answer involves binary representation, signed vs unsigned integers, and two's complement.
Unsigned vs Signed Integers
| Type | 8-bit Range | 16-bit Range | 32-bit Range | 64-bit Range |
|---|---|---|---|---|
| Unsigned | 0 to 255 | 0 to 65,535 | 0 to 4.29 billion | 0 to 18.4 quintillion |
| Signed (Two's Complement) | -128 to 127 | -32,768 to 32,767 | -2.14 billion to 2.14 billion | -9.22 quintillion to 9.22 quintillion |
Two's Complement Explained
Integer Overflow in Practice
The Ariane 5 rocket exploded 37 seconds after launch due to a 64-bit to 16-bit integer conversion overflow ($370M loss). The Boeing 787 had a software bug from integer overflow. Always validate input ranges and use safe arithmetic.
Floating-Point Numbers
Not all numbers are whole. Computers represent fractional numbers using the IEEE 754 floating-point standard.
IEEE 754 Structure
| Format | Total Bits | Sign Bit | Exponent Bits | Mantissa Bits | Precision |
|---|---|---|---|---|---|
| Single (float) | 32 | 1 | 8 | 23 | ~7 decimal digits |
| Double (double) | 64 | 1 | 11 | 52 | ~15 decimal digits |
| Half (float16) | 16 | 1 | 5 | 10 | ~3 decimal digits |
| Quad (float128) | 128 | 1 | 15 | 112 | ~34 decimal digits |
The Floating-Point Surprise
Special Floating-Point Values
- NaN (Not a Number): Result of undefined operations (0/0, √-1, ∞ - ∞)
- +Infinity / -Infinity: Result of overflow (1/0 = +)
- -0 (Negative Zero):
- Subnormal numbers: Very small numbers near zero (gradual underflow)
• Use double (64-bit) by default; only use float (32-bit) for memory-constrained scenarios
• Never use == for float comparison; use epsilon-based comparison
• For financial calculations, use fixed-point or decimal types
• Be aware of precision loss in repeated operations (accumulate errors)
Data Types in Programming
Programming languages provide data types that determine how values are stored, what operations are allowed, and how much memory is used.
Common Data Types Across Languages
| Type | Python | JavaScript | Java | C | Size |
|---|---|---|---|---|---|
| Boolean | bool |
boolean |
boolean |
_Bool |
1 byte |
| Integer | int (arbitrary) |
number (64-bit float) |
int (32-bit) |
int (typically 32-bit) |
2-8 bytes |
| Floating-point | float (64-bit) |
number (64-bit) |
double (64-bit) |
double (64-bit) |
4-8 bytes |
| String | str (Unicode) |
string (UTF-16) |
String (UTF-16) |
char* (null-terminated) |
Variable |
| Array/List | list |
Array |
array[] |
array[] |
Variable |
Static vs Dynamic Typing
Memory Size of Common Types
• Use the smallest type that fits your data (saves memory)
• Use signed for values that can be negative, unsigned for counts/indices
• Use double for floating-point unless memory is tight
• Use fixed-width types (int32_t, uint64_t) for portability
Endianness: Byte Order Matters
When storing multi-byte values (like a 32-bit integer), which byte goes first? This is endianness.
Big-Endian vs Little-Endian
| Format | Byte Order | Example (0x12345678) | Used By |
|---|---|---|---|
| Big-Endian | Most significant byte first | 12 34 56 78 | Network protocols, Java, Motorola |
| Little-Endian | Least significant byte first | 78 56 34 12 | x86/x64 CPUs, Windows, ARM (configurable) |
Why Endianness Matters
Endianness in Practice
- Network protocols: TCP/IP uses big-endian ("network byte order")
- File formats: PNG uses big-endian, BMP uses little-endian
- Unicode BOM: UTF-16 files may start with
0xFEFF(big) or0xFFFE(little) - Cross-platform code: Always be explicit about byte order when serializing data
Reading a little-endian file on a big-endian system (or vice versa) produces garbage data. Always use serialization libraries (Protocol Buffers, JSON, MessagePack) that handle endianness automatically.
Memory Layout & Alignment
How data is arranged in memory affects performance. Memory alignment ensures data starts at addresses that are multiples of the data size.
Memory Alignment Rules
| Data Type | Size | Alignment Requirement | Valid Addresses |
|---|---|---|---|
| char | 1 byte | 1-byte aligned | Any address |
| short | 2 bytes | 2-byte aligned | Even addresses (0, 2, 4...) |
| int / float | 4 bytes | 4-byte aligned | Addresses divisible by 4 |
| double / long | 8 bytes | 8-byte aligned | Addresses divisible by 8 |
Struct Padding Example
Cache Lines & Performance
- Cache line: Typically 64 bytes. CPUs fetch memory in cache-line-sized chunks.
- False sharing: Two variables on the same cache line modified by different threads cause cache invalidation.
- Data locality: Accessing data sequentially (array traversal) is much faster than random access.
- Structure of Arrays vs Array of Structures: SoA is often better for SIMD/vectorization.
• Order struct members by size (largest first) to minimize padding
• Use __attribute__((packed)) in C when binary compatibility matters (but expect performance cost)
• Prefer arrays over linked lists for cache-friendly access
• Profile memory usage before optimizing—don't guess
Practical Applications
Data representation concepts appear everywhere in programming. Here are common scenarios where this knowledge matters.
Common Use Cases
File I/O
Reading binary files requires understanding byte order, padding, and data types.
Network Protocols
Network byte order (big-endian), serialization, and protocol parsing.
Database Storage
How databases store numbers, strings, and dates on disk.
Cryptography
Encryption operates on bytes. Understanding byte-level operations is essential.
Real-World Example: Parsing a Binary File
Don't manually parse binary formats unless necessary. Use established libraries: json for text, struct or construct for binary, protobuf or messagepack for efficient cross-platform serialization.
Conclusion
Data representation is the invisible foundation of all computing. From the single bit that stores a boolean to the complex UTF-8 encoding that represents emojis, understanding how data is stored and processed makes you a more effective programmer.
Key Takeaways
- Bits and bytes: 8 bits = 1 byte; understand the size hierarchy
- Always use UTF-8: It's the universal standard for text encoding
- Two's complement: How computers store negative integers
- IEEE 754: Floating-point has precision limits—never use == for comparison
- Endianness: Big-endian for networks, little-endian for most CPUs
- Memory alignment: Matters for performance and binary compatibility
- Choose types wisely: Right type = correct behavior + efficient memory usage
Next Steps
- Practice encoding: Experiment with
encode()/decode()in Python - Read binary files: Try parsing a simple format (BMP, WAV, or PNG header)
- Understand your language: Check how your language handles integers, floats, and strings
- Explore serialization: Compare JSON, Protocol Buffers, and MessagePack
- Profile memory: Use tools to understand your program's memory usage
Programs must be written for people to read, and only incidentally for machines to execute.
Open your terminal. Type python3 -c "print('😀'.encode('utf-8'))". See how a single emoji becomes 4 bytes. That's data representation in action.
Thank you for reading this comprehensive data representation guide. Whether you're parsing binary files, debugging encoding issues, or optimizing memory usage, understanding how data is represented will make you a more confident and capable developer. Keep learning, keep experimenting, and keep building!