Complete Character Encoding Guide | ASCII, UTF-8, UTF-16, EUC-KR

Complete Character Encoding Guide | ASCII, UTF-8, UTF-16, EUC-KR

이 글의 핵심

Principles and differences of all character encoding methods including ASCII, ANSI, Unicode, UTF-8, UTF-16, UTF-32, EUC-KR, CP949. Complete understanding from solving Korean character corruption to BOM and Endian with practical examples.

Introduction: Why Should You Know Character Encoding?

When developing, you experience Korean characters getting corrupted, files being unreadable, or API responses appearing strange. The root cause of all these problems is character encoding.

What This Article Covers:

  • History of ASCII, ANSI, Unicode
  • UTF-8, UTF-16, UTF-32 encoding methods
  • Korean encoding (EUC-KR, CP949)
  • BOM, Endian, encoding detection
  • Practical problem solving

Reality in Practice

When learning development, everything seems clean and theoretical. But practice is different. You wrestle with legacy code, chase tight deadlines, and face unexpected bugs. The content covered in this article was initially learned as theory, but it was through applying it to actual projects that I realized “Ah, this is why it’s designed this way.”

What stands out in my memory is the trial and error from my first project. I did everything by the book but couldn’t figure out why it wasn’t working, spending days struggling. Eventually, through a senior developer’s code review, I discovered the problem and learned a lot in the process. In this article, I’ll cover not just theory but also the pitfalls you might encounter in practice and how to solve them.

Table of Contents

  1. History of Character Encoding
  2. ASCII: 7-bit Character Set
  3. ANSI and Code Pages
  4. Unicode: Global Character Integration
  5. UTF-8: Variable Length Encoding
  6. UTF-16 and UTF-32
  7. Korean Encoding (EUC-KR, CP949)
  8. BOM and Endian
  9. Practical Problem Solving
  10. Programming Language-specific Handling

1. History of Character Encoding

Timeline

Here’s an implementation example using mermaid. Please review the code to understand the role of each part.

timeline
    title Character Encoding Evolution
    1963 : ASCII established\n7-bit, 128 chars
    1987 : ISO-8859-1 (Latin-1)\n8-bit, 256 chars
    1991 : Unicode 1.0\n16-bit unified charset
    1992 : UTF-8 invented\nVariable length encoding
    1996 : UTF-16\nSurrogate pairs
    2003 : UTF-8 web standardization
    2008 : UTF-8 most used\non the web
    2026 : UTF-8 98% market share

Why Do Multiple Encodings Exist?

Here’s an implementation example using mermaid. Please review the code to understand the role of each part.

flowchart TB
    Problem["Problem: Computers\nonly understand numbers"]
    
    ASCII["ASCII\n128 English chars"]
    Extended["Extended ASCII\n256 chars per language"]
    Unicode["Unicode\nGlobal character integration"]
    
    Problem --> ASCII
    ASCII --> Extended
    Extended --> Unicode
    
    ASCII --> Issue1["Problem: Cannot express\nKorean, Chinese"]
    Extended --> Issue2["Problem: Different\ncode pages per country"]
    Unicode --> Solution["Solution: Assign unique\nnumber to all characters"]

2. ASCII: 7-bit Character Set

What is ASCII?

ASCII (American Standard Code for Information Interchange) represents English alphabet, numbers, and special characters with 7 bits (0-127).

ASCII Table

Here’s an implementation example using code. Please review the code to understand the role of each part.

Dec  Hex  Char  |  Dec  Hex  Char  |  Dec  Hex  Char
-------------------------------------------------
 32  20   Space |  64  40   @      |  96  60   `
 33  21   !     |  65  41   A      |  97  61   a
 34  22   "     |  66  42   B      |  98  62   b
 35  23   #     |  67  43   C      |  99  63   c
...
 48  30   0     |  80  50   P      | 112  70   p
 49  31   1     |  81  51   Q      | 113  71   q
...
 57  39   9     |  90  5A   Z      | 122  7A   z

ASCII Control Characters

Here’s an implementation example using Python. Please review the code to understand the role of each part.

# Main control characters
NUL = 0x00  # Null
LF  = 0x0A  # Line Feed (\n)
CR  = 0x0D  # Carriage Return (\r)
ESC = 0x1B  # Escape
DEL = 0x7F  # Delete

# Line break methods
# Unix/Linux: LF (\n)
# Windows: CR+LF (\r\n)
# Mac (Classic): CR (\r)

ASCII Examples

Here’s a detailed implementation using Python. Implement logic through functions. Please review the code to understand the role of each part.

# Character → Code
ord('A')  # 65
ord('a')  # 97
ord('0')  # 48

# Code → Character
chr(65)   # 'A'
chr(97)   # 'a'

# Check ASCII range
def is_ascii(text):
    return all(ord(c) < 128 for c in text)

is_ascii("Hello")  # True
is_ascii("안녕")    # False

3. ANSI and Code Pages

What is ANSI?

ANSI extends to 8 bits (0-255) to support each country’s language. However, the meaning of the 128-255 range differs per Code Page.

Major Code Pages

Code PageNameRegionFeatures
CP437OEM-USUSADOS default
CP850Latin-1Western EuropeDOS multilingual
CP949Extended CompleteKoreaWindows Korean
CP932Shift-JISJapanWindows Japanese
CP936GBKChinaWindows Chinese
ISO-8859-1Latin-1Western EuropeUnix/Web
ISO-8859-15Latin-9Western EuropeEuro (€) added

Code Page Problems

Here’s an implementation example using Python. Please review the code to understand the role of each part.

# Same byte value, different meaning
byte_value = 0xC7

# CP949 (Korean): '한'
text_korean = byte_value.to_bytes(1, 'big').decode('cp949')  # Error (needs 2 bytes)

# ISO-8859-1 (Latin-1): 'Ç'
text_latin = byte_value.to_bytes(1, 'big').decode('latin-1')  # 'Ç'

# Reading same file with different encoding causes corruption!

4. Unicode: Global Character Integration

What is Unicode?

Unicode is a character set that assigns unique Code Points to all characters worldwide.

Unicode Structure

Here’s an implementation example using code. Please review the code to understand the role of each part.

U+0000 ~ U+10FFFF (1,114,112 code points)

U+0000 ~ U+007F   : ASCII (128 chars)
U+0080 ~ U+00FF   : Latin-1 Supplement
U+0100 ~ U+017F   : Latin Extended-A
U+0370 ~ U+03FF   : Greek
U+0400 ~ U+04FF   : Cyrillic
U+0600 ~ U+06FF   : Arabic
U+0E00 ~ U+0E7F   : Thai
U+3040 ~ U+309F   : Hiragana (Japanese)
U+30A0 ~ U+30FF   : Katakana (Japanese)
U+4E00 ~ U+9FFF   : CJK Unified Ideographs (Chinese/Japanese/Korean)
U+AC00 ~ U+D7AF   : Hangul Syllables (Korean 11,172 chars)
U+1F600 ~ U+1F64F : Emoticons (Emoji)

Korean Unicode Range

Here’s an implementation example using Python. Please review the code to understand the role of each part.

# Korean syllables (가-힣)
print(f"가: U+{ord('가'):04X}")  # U+AC00
print(f"힣: U+{ord('힣'):04X}")  # U+D7A3

# Korean letters (ㄱ-ㅎ, ㅏ-ㅣ)
print(f"ㄱ: U+{ord('ㄱ'):04X}")  # U+3131
print(f"ㅎ: U+{ord('ㅎ'):04X}")  # U+314E
print(f"ㅏ: U+{ord('ㅏ'):04X}")  # U+314F
print(f"ㅣ: U+{ord('ㅣ'):04X}")  # U+3163

# Emoji
print(f"😀: U+{ord('😀'):04X}")  # U+1F600

Unicode vs Encoding

Here’s an implementation example using code. Try running the code directly to see how it works.

Unicode: Character Set
         Assigns number (code point) to each character
         Example: '한' = U+D55C

UTF-8/UTF-16/UTF-32: Encoding
                      Method to convert code points to bytes
                      Example: U+D55C → UTF-8: ED 95 9C (3 bytes)
                                      → UTF-16: D5 5C (2 bytes)

5. UTF-8: Variable Length Encoding

What is UTF-8?

UTF-8 encodes Unicode with 1-4 byte variable length. It’s the web standard and perfectly compatible with ASCII.

UTF-8 Encoding Rules

Here’s an implementation example using code. Try running the code directly to see how it works.

Code Point Range       | Bytes | Encoding Pattern
U+0000   ~ U+007F     | 1     | 0xxxxxxx
U+0080   ~ U+07FF     | 2     | 110xxxxx 10xxxxxx
U+0800   ~ U+FFFF     | 3     | 1110xxxx 10xxxxxx 10xxxxxx
U+10000  ~ U+10FFFF   | 4     | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-8 Encoding Examples

English ‘A’ (U+0041)

Here’s an implementation example using code. Try running the code directly to see how it works.

Code Point: U+0041 (65)
Binary: 0100 0001

UTF-8 Encoding:
0100 0001 = 0x41 (1 byte)

Memory: 41

Korean ‘한’ (U+D55C)

Here’s an implementation example using code. Try running the code directly to see how it works.

Code Point: U+D55C (54,620)
Binary: 1101 0101 0101 1100

UTF-8 Encoding (3 bytes):
1110xxxx 10xxxxxx 10xxxxxx
1110 1101  10 010101  10 011100
   E   D      9   5      9   C

Memory: ED 95 9C

Emoji ’😀’ (U+1F600)

Here’s an implementation example using code. Try running the code directly to see how it works.

Code Point: U+1F600 (128,512)
Binary: 0001 1111 0110 0000 0000

UTF-8 Encoding (4 bytes):
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
11110 000  10 011111  10 011000  10 000000
   F   0      9   F      9   8      8   0

Memory: F0 9F 98 80

UTF-8 Advantages

Here’s an implementation example using mermaid. Please review the code to understand the role of each part.

flowchart TB
    UTF8[UTF-8]
    
    Adv1["✅ ASCII compatible\nEnglish is 1 byte"]
    Adv2["✅ Self-synchronizing\nCan read from middle"]
    Adv3["✅ Byte order independent\nNo Endian issues"]
    Adv4["✅ Web standard\n98% market share"]
    
    UTF8 --> Adv1
    UTF8 --> Adv2
    UTF8 --> Adv3
    UTF8 --> Adv4

UTF-8 Encoding with Python

# String → Bytes
text = "Hello 한글 😀"

# UTF-8 encoding
utf8_bytes = text.encode('utf-8')
print(utf8_bytes)
# b'Hello \xed\x95\x9c\xea\xb8\x80 \xf0\x9f\x98\x80'

# Byte analysis
for i, byte in enumerate(utf8_bytes):
    print(f"{i:2d}: 0x{byte:02X} ({byte:3d}) {chr(byte) if byte < 128 else '?'}")

# Output:
#  0: 0x48 ( 72) H
#  1: 0x65 (101) e
#  2: 0x6C (108) l
#  3: 0x6C (108) l
#  4: 0x6F (111) o
#  5: 0x20 ( 32)  
#  6: 0xED (237) ?  ← '한' start
#  7: 0x95 (149) ?
#  8: 0x9C (156) ?
#  9: 0xEA (234) ?  ← '글' start
# 10: 0xB8 (184) ?
# 11: 0x80 (128) ?
# 12: 0x20 ( 32)  
# 13: 0xF0 (240) ?  ← '😀' start
# 14: 0x9F (159) ?
# 15: 0x98 (152) ?
# 16: 0x80 (128) ?

# Bytes → String
decoded = utf8_bytes.decode('utf-8')
print(decoded)  # "Hello 한글 😀"

6. UTF-16 and UTF-32

UTF-16

UTF-16 encodes with 2 or 4 bytes. Used internally in Windows, Java, and JavaScript.

UTF-16 Encoding Rules

Code Point Range       | Bytes | Method
U+0000   ~ U+FFFF     | 2     | Direct encoding
U+10000  ~ U+10FFFF   | 4     | Surrogate pair

Surrogate Pair

Here’s a detailed implementation using Python. Please review the code to understand the role of each part.

# Encode emoji '😀' (U+1F600) to UTF-16

# 1. U+1F600 - 0x10000 = 0xF600
# 2. High 10 bits: 0x3D (61)
# 3. Low 10 bits: 0x200 (512)
# 4. High Surrogate: 0xD800 + 0x3D = 0xD83D
# 5. Low Surrogate: 0xDC00 + 0x200 = 0xDE00

text = "😀"
utf16_bytes = text.encode('utf-16-le')
print(utf16_bytes.hex())  # '3dd8 00de' (Little-Endian)

# UTF-16 BE (Big-Endian)
utf16_be = text.encode('utf-16-be')
print(utf16_be.hex())  # 'd83d de00'

UTF-16 Example

Here’s an implementation example using Python. Please review the code to understand the role of each part.

text = "Hello 한글"

# UTF-16 LE (Little-Endian)
utf16_le = text.encode('utf-16-le')
print(utf16_le.hex())
# 48 00 65 00 6c 00 6c 00 6f 00 20 00 5c d5 00 ae 00 b8

# UTF-16 BE (Big-Endian)
utf16_be = text.encode('utf-16-be')
print(utf16_be.hex())
# 00 48 00 65 00 6c 00 6c 00 6f 00 20 d5 5c ae 00 b8 00

UTF-32

UTF-32 encodes all characters with fixed 4-byte length.

Here’s an implementation example using Python. Please review the code to understand the role of each part.

text = "A한😀"

# UTF-32 LE
utf32 = text.encode('utf-32-le')
print(utf32.hex())
# 41 00 00 00  5c d5 00 00  00 f6 01 00

# Each character is exactly 4 bytes
# 'A':  0x00000041
# '한': 0x0000D55C
# '😀': 0x0001F600

Encoding Comparison

text = "Hello 한글 😀"

encodings = ['utf-8', 'utf-16-le', 'utf-16-be', 'utf-32-le']

for enc in encodings:
    encoded = text.encode(enc)
    print(f"{enc:12s}: {len(encoded):2d} bytes | {encoded.hex()[:40]}...")

# Output:
# utf-8       : 17 bytes | 48656c6c6f20ed959ceab880f09f9880
# utf-16-le   : 20 bytes | 480065006c006c006f002000d55c00aeb800...
# utf-16-be   : 20 bytes | 004800650069006c006f0020d55cae00b800...
# utf-32-le   : 36 bytes | 4100000065000000...

7. Korean Encoding (EUC-KR, CP949)

Korean Encoding History

Here’s an implementation example using mermaid. Try running the code directly to see how it works.

timeline
    title Korean Encoding Evolution
    1987 : KS X 1001\nComplete 2,350 chars
    1992 : EUC-KR\nComplete standard
    1996 : CP949 (MS)\nExtended complete 11,172 chars
    2000s : UTF-8\nUnicode based

EUC-KR

EUC-KR represents 2,350 Korean characters with 2 bytes.

Here’s an implementation example using Python. Ensure stability with error handling. Please review the code to understand the role of each part.

# EUC-KR encoding
text = "한글"

euckr_bytes = text.encode('euc-kr')
print(euckr_bytes.hex())  # c7d1 b1db

# '한': 0xC7D1
# '글': 0xB1DB

# Problem: Characters like '똠', '쀍' cannot be represented
try:
    "똠".encode('euc-kr')
except UnicodeEncodeError as e:
    print(f"❌ Cannot encode to EUC-KR: {e}")

CP949 (Extended Complete)

CP949 extends EUC-KR to support all 11,172 characters.

Here’s an implementation example using Python. Try running the code directly to see how it works.

# CP949 encoding
text = "똠방각하"

cp949_bytes = text.encode('cp949')
print(cp949_bytes.hex())

# Can represent characters not in EUC-KR
text2 = "쀍똠뙠"
print(text2.encode('cp949').hex())

UTF-8 vs EUC-KR Comparison

Here’s an implementation example using Python. Please review the code to understand the role of each part.

text = "Hello 한글"

# UTF-8: English 1 byte, Korean 3 bytes
utf8 = text.encode('utf-8')
print(f"UTF-8:   {len(utf8)} bytes | {utf8.hex()}")
# UTF-8:   14 bytes | 48656c6c6f20ed959ceab880

# EUC-KR: English 1 byte, Korean 2 bytes
euckr = text.encode('euc-kr')
print(f"EUC-KR:  {len(euckr)} bytes | {euckr.hex()}")
# EUC-KR:  10 bytes | 48656c6c6f20c7d1b1db

8. BOM and Endian

BOM (Byte Order Mark)

BOM is a special byte at the start of a file indicating encoding method and byte order.

Here’s an implementation example using code. Try running the code directly to see how it works.

Encoding   | BOM (hex)      | Size
UTF-8      | EF BB BF       | 3 bytes
UTF-16 LE  | FF FE          | 2 bytes
UTF-16 BE  | FE FF          | 2 bytes
UTF-32 LE  | FF FE 00 00    | 4 bytes
UTF-32 BE  | 00 00 FE FF    | 4 bytes

BOM Example

Here’s a detailed implementation using Python. Please review the code to understand the role of each part.

# UTF-8 with BOM
text = "Hello"
with open('file_with_bom.txt', 'wb') as f:
    f.write(b'\xef\xbb\xbf')  # BOM
    f.write(text.encode('utf-8'))

# File content (hex):
# EF BB BF 48 65 6C 6C 6F
# ^^^^^^^^ BOM
#          ^^^^^^^^^^^^^^ "Hello"

# UTF-8 without BOM (recommended)
with open('file_no_bom.txt', 'wb') as f:
    f.write(text.encode('utf-8'))

# File content (hex):
# 48 65 6C 6C 6F

BOM Detection and Removal

Here’s a detailed implementation using Python. Implement logic through functions. Please review the code to understand the role of each part.

def detect_and_remove_bom(data):
    """Detect and remove BOM"""
    bom_signatures = [
        (b'\xef\xbb\xbf', 'utf-8-sig'),
        (b'\xff\xfe\x00\x00', 'utf-32-le'),
        (b'\x00\x00\xfe\xff', 'utf-32-be'),
        (b'\xff\xfe', 'utf-16-le'),
        (b'\xfe\xff', 'utf-16-be'),
    ]
    
    for bom, encoding in bom_signatures:
        if data.startswith(bom):
            return data[len(bom):], encoding
    
    return data, None

# Usage
with open('file.txt', 'rb') as f:
    data = f.read()

data, encoding = detect_and_remove_bom(data)
if encoding:
    print(f"✅ BOM detected: {encoding}")
    text = data.decode(encoding.replace('-sig', ''))
else:
    print("ℹ️  No BOM, assuming UTF-8")
    text = data.decode('utf-8')

Endian (Byte Order)

Here’s a detailed implementation using Python. Please review the code to understand the role of each part.

# Big-Endian: Large byte first
# Little-Endian: Small byte first

# Example: Store 0x1234 in memory
# Big-Endian:    12 34
# Little-Endian: 34 12

# Important in UTF-16
text = "한"  # U+D55C

# UTF-16 BE (Big-Endian)
be = text.encode('utf-16-be')
print(be.hex())  # d5 5c

# UTF-16 LE (Little-Endian)
le = text.encode('utf-16-le')
print(le.hex())  # 5c d5

# UTF-8 is byte-based, so Endian independent
utf8 = text.encode('utf-8')
print(utf8.hex())  # ed 95 9c (always same)

9. Practical Problem Solving

Problem 1: Korean Character Corruption (���)

Cause

Here’s an implementation example using Python. Try running the code directly to see how it works.

# ❌ Saved as UTF-8 but read as EUC-KR
with open('file.txt', 'w', encoding='utf-8') as f:
    f.write("한글")

# Incorrect reading
with open('file.txt', 'r', encoding='euc-kr') as f:
    text = f.read()
    print(text)  # '���' (corrupted)

Solution

Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.

# ✅ Read with correct encoding
with open('file.txt', 'r', encoding='utf-8') as f:
    text = f.read()
    print(text)  # '한글' (correct)

# ✅ Auto-detect encoding
import chardet

with open('file.txt', 'rb') as f:
    raw_data = f.read()
    result = chardet.detect(raw_data)
    encoding = result['encoding']
    confidence = result['confidence']
    
    print(f"Detected: {encoding} ({confidence*100:.1f}% confidence)")
    
    text = raw_data.decode(encoding)
    print(text)

Problem 2: UnicodeDecodeError

Here’s a detailed implementation using Python. Ensure stability with error handling. Please review the code to understand the role of each part.

# ❌ Decode with wrong encoding
utf8_bytes = "한글".encode('utf-8')

try:
    text = utf8_bytes.decode('ascii')
except UnicodeDecodeError as e:
    print(f"❌ {e}")
    # 'ascii' codec can't decode byte 0xed in position 0

# ✅ Error handling options
# 1. Ignore
text = utf8_bytes.decode('ascii', errors='ignore')
print(text)  # "" (Korean removed)

# 2. Replace
text = utf8_bytes.decode('ascii', errors='replace')
print(text)  # "������" (replaced with ? character)

# 3. XML/HTML entity
text = utf8_bytes.decode('ascii', errors='xmlcharrefreplace')
print(text)  # "&#54620;&#44544;" (numeric reference)

Problem 3: Korean Character Corruption on Web

Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.

import requests

# ❌ Wrong method
response = requests.get('https://example.com/korean-page')
print(response.text)  # May be corrupted

# ✅ Check Content-Type header
response = requests.get('https://example.com/korean-page')
content_type = response.headers.get('Content-Type', '')
print(f"Content-Type: {content_type}")
# Content-Type: text/html; charset=euc-kr

# ✅ Decode with correct encoding
if 'euc-kr' in content_type.lower():
    text = response.content.decode('euc-kr')
else:
    text = response.text  # requests auto-detects

# ✅ Or auto-detect with chardet
import chardet
detected = chardet.detect(response.content)
text = response.content.decode(detected['encoding'])

Problem 4: CSV File Encoding

Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.

import csv

# ❌ CSV saved from Windows Excel (CP949)
with open('data.csv', 'r', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)  # UnicodeDecodeError!

# ✅ Correct encoding
with open('data.csv', 'r', encoding='cp949') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

# ✅ Auto-detect encoding
import chardet

with open('data.csv', 'rb') as f:
    raw_data = f.read()
    detected = chardet.detect(raw_data)
    encoding = detected['encoding']

with open('data.csv', 'r', encoding=encoding) as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

10. Programming Language-specific Handling

Python

Here’s a detailed implementation using Python. Please review the code to understand the role of each part.

# Default encoding: UTF-8
text = "Hello 한글 😀"

# Encoding
utf8 = text.encode('utf-8')
utf16 = text.encode('utf-16')
euckr = text.encode('euc-kr')  # Emoji causes error

# Decoding
text = utf8.decode('utf-8')

# File I/O
with open('file.txt', 'w', encoding='utf-8') as f:
    f.write(text)

with open('file.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Byte string literal
utf8_bytes = b'\xed\x95\x9c\xea\xb8\x80'
text = utf8_bytes.decode('utf-8')  # "한글"

JavaScript/Node.js

Here’s a detailed implementation using JavaScript. Please review the code to understand the role of each part.

// JavaScript internal: UTF-16
const text = "Hello 한글 😀";

// String length (caution: surrogate pairs)
console.log(text.length);  // 11 (😀 counted as 2)

// Correct length
console.log([...text].length);  // 10

// UTF-8 encoding (Node.js)
const buffer = Buffer.from(text, 'utf-8');
console.log(buffer);  // <Buffer 48 65 6c 6c 6f 20 ...>

// Decoding
const decoded = buffer.toString('utf-8');
console.log(decoded);  // "Hello 한글 😀"

// Supported encodings
// utf-8, utf-16le, latin1, base64, hex, ascii

Java

Here’s a detailed implementation using Java. Please review the code to understand the role of each part.

// Java internal: UTF-16
String text = "Hello 한글 😀";

// UTF-8 encoding
byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8);
System.out.println(Arrays.toString(utf8Bytes));

// Decoding
String decoded = new String(utf8Bytes, StandardCharsets.UTF_8);
System.out.println(decoded);

// File I/O
// Write as UTF-8
Files.writeString(
    Path.of("file.txt"), 
    text, 
    StandardCharsets.UTF_8
);

// Read as UTF-8
String content = Files.readString(
    Path.of("file.txt"), 
    StandardCharsets.UTF_8
);

C++

Here’s a detailed implementation using C++. Import necessary modules. Please review the code to understand the role of each part.

#include <iostream>
#include <fstream>
#include <string>
#include <codecvt>
#include <locale>

int main() {
    // UTF-8 string (C++11)
    std::string utf8_str = u8"Hello 한글 😀";
    
    // UTF-16 string
    std::u16string utf16_str = u"Hello 한글 😀";
    
    // UTF-32 string
    std::u32string utf32_str = U"Hello 한글 😀";
    
    // UTF-8 → UTF-16 conversion
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> converter;
    std::u16string utf16 = converter.from_bytes(utf8_str);
    
    // Write file (UTF-8)
    std::ofstream file("file.txt", std::ios::binary);
    file << utf8_str;
    file.close();
    
    // Read file
    std::ifstream input("file.txt", std::ios::binary);
    std::string content((std::istreambuf_iterator<char>(input)),
                        std::istreambuf_iterator<char>());
    
    std::cout << content << std::endl;
    
    return 0;
}

Go

Here’s a detailed implementation using Go. Import necessary modules, implement logic through functions. Please review the code to understand the role of each part.

package main

import (
    "fmt"
    "unicode/utf8"
    "golang.org/x/text/encoding/korean"
    "golang.org/x/text/transform"
    "io"
    "strings"
)

func main() {
    // Go internal: UTF-8
    text := "Hello 한글 😀"
    
    // Byte length vs character (rune) length
    fmt.Println("Bytes:", len(text))           // 17
    fmt.Println("Runes:", utf8.RuneCountInString(text))  // 10
    
    // UTF-8 → EUC-KR conversion
    encoder := korean.EUCKR.NewEncoder()
    euckrBytes, _, _ := transform.Bytes(encoder, []byte(text))
    fmt.Printf("EUC-KR: %x\n", euckrBytes)
    
    // EUC-KR → UTF-8 conversion
    decoder := korean.EUCKR.NewDecoder()
    utf8Text, _, _ := transform.String(decoder, string(euckrBytes))
    fmt.Println(utf8Text)
}

Advanced Topics

Normalization

Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.

import unicodedata

# Two ways to represent Korean '가'
# 1. Composed (NFC): U+AC00
nfc = "가"
print(f"NFC: {len(nfc)} chars, {nfc.encode('utf-8').hex()}")
# NFC: 1 chars, eab080

# 2. Decomposed (NFD): U+1100 + U+1161 (ㄱ + ㅏ)
nfd = unicodedata.normalize('NFD', nfc)
print(f"NFD: {len(nfd)} chars, {nfd.encode('utf-8').hex()}")
# NFD: 2 chars, e384 80e185a1

# Comparison
print(nfc == nfd)  # False (different byte sequence)

# Compare after normalization
print(unicodedata.normalize('NFC', nfc) == 
      unicodedata.normalize('NFC', nfd))  # True

Encoding Detection

Here’s a detailed implementation using Python. Import necessary modules, implement logic through functions. Please review the code to understand the role of each part.

import chardet

def detect_encoding(file_path):
    """Auto-detect file encoding"""
    with open(file_path, 'rb') as f:
        raw_data = f.read()
    
    result = chardet.detect(raw_data)
    
    return {
        'encoding': result['encoding'],
        'confidence': result['confidence'],
        'language': result.get('language', '')
    }

# Usage
info = detect_encoding('unknown.txt')
print(f"Encoding: {info['encoding']}")
print(f"Confidence: {info['confidence']*100:.1f}%")

# Read with correct encoding
with open('unknown.txt', 'r', encoding=info['encoding']) as f:
    content = f.read()

Encoding Conversion

Here’s an implementation example using Python. Implement logic through functions. Please review the code to understand the role of each part.

def convert_file_encoding(input_file, output_file, from_enc, to_enc):
    """Convert file encoding"""
    # Read original
    with open(input_file, 'r', encoding=from_enc) as f:
        content = f.read()
    
    # Save with new encoding
    with open(output_file, 'w', encoding=to_enc) as f:
        f.write(content)
    
    print(f"✅ Converted: {from_enc}{to_enc}")

# EUC-KR → UTF-8 conversion
convert_file_encoding('old.txt', 'new.txt', 'euc-kr', 'utf-8')

Encoding in Web Development

HTML

Here’s an implementation example using HTML. Please review the code to understand the role of each part.

<!DOCTYPE html>
<html>
<head>
    <!-- ✅ UTF-8 declaration (required) -->
    <meta charset="UTF-8">
    <title>Korean Page</title>
</head>
<body>
    <h1>안녕하세요</h1>
</body>
</html>

HTTP Headers

Here’s a detailed implementation using Python. Import necessary modules, implement logic through functions. Please review the code to understand the role of each part.

from flask import Flask, Response

app = Flask(__name__)

@app.route('/korean')
def korean_page():
    content = "<h1>안녕하세요</h1>"
    
    # ✅ Specify charset in Content-Type
    return Response(
        content,
        mimetype='text/html; charset=utf-8'
    )

# ❌ Without charset, browser guesses (may corrupt)

JSON

Here’s an implementation example using Python. Import necessary modules. Please review the code to understand the role of each part.

import json

data = {"name": "홍길동", "message": "안녕하세요"}

# JSON is UTF-8 by default
json_str = json.dumps(data, ensure_ascii=False)
print(json_str)
# {"name": "홍길동", "message": "안녕하세요"}

# ensure_ascii=True (default)
json_str_ascii = json.dumps(data, ensure_ascii=True)
print(json_str_ascii)
# {"name": "\ud64d\uae38\ub3d9", "message": "\uc548\ub155\ud558\uc138\uc694"}

URL Encoding

Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.

from urllib.parse import quote, unquote

# URL with Korean
text = "한글 검색"

# URL encoding (UTF-8 based)
encoded = quote(text)
print(encoded)
# %ED%95%9C%EA%B8%80%20%EA%B2%80%EC%83%89

# URL decoding
decoded = unquote(encoded)
print(decoded)  # "한글 검색"

# Complete URL
url = f"https://example.com/search?q={encoded}"
print(url)
# https://example.com/search?q=%ED%95%9C%EA%B8%80%20%EA%B2%80%EC%83%89

Database Encoding

MySQL

Here’s a detailed implementation using SQL. Please review the code to understand the role of each part.

-- Create database (UTF-8)
CREATE DATABASE mydb
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

-- utf8mb4: 4-byte UTF-8 (emoji support)
-- utf8: 3-byte UTF-8 (no emoji, deprecated)

-- Create table
CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(100) CHARACTER SET utf8mb4
);

-- Set encoding on connection
SET NAMES utf8mb4;

PostgreSQL

Here’s an implementation example using SQL. Please review the code to understand the role of each part.

-- Create database
CREATE DATABASE mydb
ENCODING 'UTF8'
LC_COLLATE 'ko_KR.UTF-8'
LC_CTYPE 'ko_KR.UTF-8';

-- Check client encoding
SHOW client_encoding;

-- Change encoding
SET client_encoding TO 'UTF8';

Python + DB

Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.

import psycopg2

# PostgreSQL connection
conn = psycopg2.connect(
    host='localhost',
    database='mydb',
    user='user',
    password='pass',
    client_encoding='utf8'  # ✅ Explicit specification
)

cursor = conn.cursor()

# Insert Korean data
cursor.execute(
    "INSERT INTO users (name) VALUES (%s)",
    ("홍길동",)
)

# Query
cursor.execute("SELECT name FROM users")
name = cursor.fetchone()[0]
print(name)  # "홍길동"

Practical Tools

Command Line Tools

Here’s a detailed implementation using bash. Please review the code to understand the role of each part.

# 1. Check encoding with file command
file -i file.txt
# file.txt: text/plain; charset=utf-8

# 2. Convert encoding with iconv
iconv -f EUC-KR -t UTF-8 old.txt > new.txt

# 3. Batch convert multiple files
find . -name "*.txt" -exec iconv -f EUC-KR -t UTF-8 {} -o {}.utf8 \;

# 4. Check bytes with hexdump
echo "한글" | hexdump -C
# 00000000  ed 95 9c ea b8 80 0a

# 5. Remove BOM
tail -c +4 file_with_bom.txt > file_no_bom.txt  # UTF-8 BOM (3 bytes)

Python Script

Here’s a detailed implementation using Python. Import necessary modules, implement logic through functions, ensure stability with error handling. Please review the code to understand the role of each part.

#!/usr/bin/env python3
"""
Batch file encoding conversion tool
"""
import os
import sys
import chardet
from pathlib import Path

def convert_directory(directory, from_enc=None, to_enc='utf-8'):
    """Convert encoding of all text files in directory"""
    for file_path in Path(directory).rglob('*.txt'):
        try:
            # Read original
            with open(file_path, 'rb') as f:
                raw_data = f.read()
            
            # Detect encoding
            if from_enc is None:
                detected = chardet.detect(raw_data)
                source_enc = detected['encoding']
                confidence = detected['confidence']
                
                if confidence < 0.7:
                    print(f"⚠️  {file_path}: Low confidence ({confidence:.2f})")
                    continue
            else:
                source_enc = from_enc
            
            # Skip if already UTF-8
            if source_enc.lower().replace('-', '') == 'utf8':
                print(f"✓ {file_path}: Already UTF-8")
                continue
            
            # Convert
            text = raw_data.decode(source_enc)
            
            # Save
            with open(file_path, 'w', encoding=to_enc) as f:
                f.write(text)
            
            print(f"✅ {file_path}: {source_enc}{to_enc}")
            
        except Exception as e:
            print(f"❌ {file_path}: {e}")

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print("Usage: python convert_encoding.py <directory>")
        sys.exit(1)
    
    convert_directory(sys.argv[1])

Encoding Comparison Table

Storage Space Comparison

text = "Hello 한글 😀"

encodings = {
    'ASCII (English only)': 'ascii',
    'UTF-8': 'utf-8',
    'UTF-16 LE': 'utf-16-le',
    'UTF-16 BE': 'utf-16-be',
    'UTF-32 LE': 'utf-32-le',
    'EUC-KR': 'euc-kr',
    'CP949': 'cp949',
}

print(f"Original text: {text}\n")
print(f"{'Encoding':20s} | {'Bytes':6s} | Hex")
print("-" * 60)

for name, enc in encodings.items():
    try:
        encoded = text.encode(enc)
        hex_str = encoded.hex()[:30] + ('...' if len(encoded) > 15 else '')
        print(f"{name:20s} | {len(encoded):4d}B | {hex_str}")
    except UnicodeEncodeError:
        print(f"{name:20s} | {'N/A':6s} | (Cannot encode)")

# Output:
# Original text: Hello 한글 😀
# 
# Encoding             | Bytes  | Hex
# ------------------------------------------------------------
# ASCII (English only) |    N/A | (Cannot encode)
# UTF-8                |   17B | 48656c6c6f20ed959ceab880f0...
# UTF-16 LE            |   20B | 480065006c006c006f00200000...
# UTF-16 BE            |   20B | 004800650069006c006f002000...
# UTF-32 LE            |   36B | 410000006500000069000000...
# EUC-KR               |    N/A | (Cannot encode)
# CP949                |    N/A | (Cannot encode)

Feature Comparison

EncodingBytes/CharASCII CompatibleKorean EfficiencyEmojiMain Usage
ASCII1English only
EUC-KR1-2✅✅Korean legacy
CP9491-2✅✅Windows Korean
UTF-81-4Web, Linux, modern standard
UTF-162-4✅✅Windows, Java internal
UTF-324Internal processing

Real-World Scenarios

Scenario 1: Legacy System Integration

Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.

# Problem: Bank API responds with EUC-KR
import requests

response = requests.get('http://legacy-bank-api.com/account')

# ❌ Auto-decode (assumes UTF-8)
# print(response.text)  # Corrupted

# ✅ Correct handling
content = response.content  # Bytes
text = content.decode('euc-kr')
print(text)

# ✅ Or provide hint to requests
response.encoding = 'euc-kr'
print(response.text)

Scenario 2: Multilingual Application

Here’s a detailed implementation using Python. Import necessary modules, implement logic through functions. Please review the code to understand the role of each part.

import locale
import sys

def setup_encoding():
    """Setup system encoding"""
    # Check stdout encoding
    print(f"stdout encoding: {sys.stdout.encoding}")
    
    # System locale
    print(f"System locale: {locale.getpreferredencoding()}")
    
    # Force UTF-8 (Python 3.7+)
    if sys.stdout.encoding != 'utf-8':
        sys.stdout.reconfigure(encoding='utf-8')

# Handle multilingual text
texts = {
    'en': "Hello",
    'ko': "안녕하세요",
    'ja': "こんにちは",
    'zh': "你好",
    'ar': "مرحبا",
    'ru': "Здравствуйте",
    'emoji': "👋🌍"
}

for lang, text in texts.items():
    utf8 = text.encode('utf-8')
    print(f"{lang:5s}: {text:15s} | {len(utf8):2d} bytes | {utf8.hex()[:30]}")

Scenario 3: File Upload Handling

Here’s a detailed implementation using Python. Import necessary modules, implement logic through functions, ensure stability with error handling. Please review the code to understand the role of each part.

from flask import Flask, request
import chardet

app = Flask(__name__)

@app.route('/upload', methods=['POST'])
def upload_file():
    file = request.files['file']
    
    # Read as binary
    content = file.read()
    
    # Detect encoding
    detected = chardet.detect(content)
    encoding = detected['encoding']
    confidence = detected['confidence']
    
    print(f"Detected: {encoding} ({confidence*100:.1f}%)")
    
    # Convert to UTF-8
    if encoding.lower() != 'utf-8':
        try:
            text = content.decode(encoding)
            utf8_content = text.encode('utf-8')
            
            return {
                'status': 'converted',
                'from': encoding,
                'to': 'utf-8',
                'content': text
            }
        except Exception as e:
            return {'status': 'error', 'message': str(e)}, 400
    
    return {
        'status': 'ok',
        'encoding': 'utf-8',
        'content': content.decode('utf-8')
    }

Best Practices

1. Always Use UTF-8

Here’s a detailed implementation using Python. Please review the code to understand the role of each part.

# ✅ File I/O
with open('file.txt', 'w', encoding='utf-8') as f:
    f.write("한글")

# ✅ Source code encoding declaration (Python 2)
# -*- coding: utf-8 -*-

# ✅ HTML
# <meta charset="UTF-8">

# ✅ HTTP header
# Content-Type: text/html; charset=utf-8

# ✅ Database
# CREATE DATABASE mydb CHARACTER SET utf8mb4;

2. Read in Binary Mode and Decode Explicitly

Here’s an implementation example using Python. Please review the code to understand the role of each part.

# ✅ Safe method
with open('file.txt', 'rb') as f:
    raw_data = f.read()

# Decode after checking encoding
text = raw_data.decode('utf-8')

# ❌ Risky method (uses system default encoding)
with open('file.txt', 'r') as f:  # encoding not specified
    text = f.read()

3. Error Handling

Here’s a detailed implementation using Python. Implement logic through functions, ensure stability with error handling. Please review the code to understand the role of each part.

# ✅ Error handling strategy
def safe_decode(data, encodings=['utf-8', 'cp949', 'euc-kr', 'latin-1']):
    """Try multiple encodings"""
    for enc in encodings:
        try:
            return data.decode(enc), enc
        except UnicodeDecodeError:
            continue
    
    # If all fail, decode ignoring errors
    return data.decode('utf-8', errors='replace'), 'utf-8'

# Usage
with open('unknown.txt', 'rb') as f:
    data = f.read()

text, encoding = safe_decode(data)
print(f"Decoded as {encoding}: {text}")

4. BOM Handling

Here’s an implementation example using Python. Please review the code to understand the role of each part.

# ✅ Auto-handle UTF-8 BOM
with open('file.txt', 'r', encoding='utf-8-sig') as f:
    text = f.read()  # Automatically removes BOM if present

# ✅ Save without BOM (recommended)
with open('file.txt', 'w', encoding='utf-8') as f:
    f.write(text)

# ❌ Save with BOM (avoid)
with open('file.txt', 'w', encoding='utf-8-sig') as f:
    f.write(text)

Problem Solving Checklist

When Korean Characters Are Corrupted

Here’s an implementation example using Python. Import necessary modules. Please review the code to understand the role of each part.

# 1. Check file encoding
import chardet

with open('file.txt', 'rb') as f:
    result = chardet.detect(f.read())
    print(result)

# 2. Read with correct encoding
with open('file.txt', 'r', encoding='cp949') as f:
    text = f.read()

# 3. Re-save as UTF-8
with open('file.txt', 'w', encoding='utf-8') as f:
    f.write(text)

When Korean Characters Are Corrupted on Web

Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.

# 1. Check HTTP header
import requests

response = requests.get('https://example.com')
print(response.encoding)  # ISO-8859-1 (wrong guess)

# 2. Set correct encoding
response.encoding = 'utf-8'
print(response.text)

# 3. Check Content-Type header
print(response.headers.get('Content-Type'))
# text/html; charset=euc-kr

# 4. Explicit decoding
text = response.content.decode('euc-kr')

When Korean Characters Are Corrupted in Database

Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.

# 1. Check connection encoding
import pymysql

conn = pymysql.connect(
    host='localhost',
    user='user',
    password='pass',
    database='mydb',
    charset='utf8mb4'  # ✅ Explicit specification
)

# 2. Check table encoding
cursor = conn.cursor()
cursor.execute("SHOW CREATE TABLE users")
print(cursor.fetchone())

# 3. Convert encoding
# ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4;

Summary

Encoding Selection Guide

Here’s an implementation example using mermaid. Please review the code to understand the role of each part.

flowchart TD
    Start[Start new project] --> Q1{Language?}
    
    Q1 -->|English only| ASCII["ASCII\nor UTF-8"]
    Q1 -->|Multilingual| UTF8["✅ UTF-8\nRecommended"]
    Q1 -->|Legacy integration| Q2{System?}
    
    Q2 -->|Windows Korean| CP949[CP949]
    Q2 -->|Unix Korean| EUCKR[EUC-KR]
    Q2 -->|Japanese| SJIS[Shift-JIS]
    
    UTF8 --> Best["✅ Best choice\n- Web standard\n- All characters supported\n- ASCII compatible"]

Core Principles

Here’s a detailed implementation using Python. Ensure stability with error handling. Please review the code to understand the role of each part.

# 1. Always use UTF-8
encoding = 'utf-8'

# 2. Specify encoding
with open('file.txt', 'w', encoding='utf-8') as f:
    f.write(text)

# 3. Binary mode + explicit decoding
with open('file.txt', 'rb') as f:
    data = f.read()
text = data.decode('utf-8')

# 4. Error handling
try:
    text = data.decode('utf-8')
except UnicodeDecodeError:
    text = data.decode('utf-8', errors='replace')

# 5. Test
assert "한글 😀".encode('utf-8').decode('utf-8') == "한글 😀"

Encoding Summary

EncodingBytesAdvantagesDisadvantagesWhen to Use
UTF-81-4Web standard, ASCII compatibleKorean 3 bytesAll new projects
UTF-162-4Korean 2 bytesASCII incompatibleWindows/Java internal
UTF-324Fixed lengthSpace wasteInternal processing
EUC-KR1-2Korean 2 bytesSome Korean unsupportedLegacy systems
CP9491-2All Korean supportedWindows onlyWindows Korean

Debugging Tools

Python Encoding Debugger

Here’s a detailed implementation using Python. Implement logic through functions, ensure stability with error handling. Please review the code to understand the role of each part.

def analyze_encoding(file_path):
    """Detailed file encoding analysis"""
    with open(file_path, 'rb') as f:
        raw_data = f.read()
    
    print(f"📄 File: {file_path}")
    print(f"📊 Size: {len(raw_data)} bytes\n")
    
    # Check BOM
    if raw_data.startswith(b'\xef\xbb\xbf'):
        print("🔖 BOM: UTF-8")
    elif raw_data.startswith(b'\xff\xfe'):
        print("🔖 BOM: UTF-16 LE")
    elif raw_data.startswith(b'\xfe\xff'):
        print("🔖 BOM: UTF-16 BE")
    else:
        print("🔖 BOM: None")
    
    # Detect encoding
    detected = chardet.detect(raw_data)
    print(f"\n🔍 Detected encoding: {detected['encoding']}")
    print(f"📈 Confidence: {detected['confidence']*100:.1f}%")
    
    # Try multiple encodings
    print("\n🧪 Decoding test:")
    encodings = ['utf-8', 'cp949', 'euc-kr', 'utf-16', 'latin-1']
    
    for enc in encodings:
        try:
            text = raw_data.decode(enc)
            preview = text[:50].replace('\n', '\\n')
            print(f"  ✅ {enc:10s}: {preview}")
        except UnicodeDecodeError as e:
            print(f"  ❌ {enc:10s}: {e}")
    
    # Hex dump (first 100 bytes)
    print(f"\n🔢 Hex Dump (first 100 bytes):")
    for i in range(0, min(100, len(raw_data)), 16):
        hex_str = ' '.join(f'{b:02x}' for b in raw_data[i:i+16])
        ascii_str = ''.join(chr(b) if 32 <= b < 127 else '.' for b in raw_data[i:i+16])
        print(f"  {i:04x}: {hex_str:48s} | {ascii_str}")

# Usage
analyze_encoding('mystery.txt')

References

One-line Summary: Use UTF-8 for all new projects, consider EUC-KR/CP949 only for legacy system integration, and always explicitly specify encoding to prevent Korean character corruption.

... 996 lines not shown ... Token usage: 63706/1000000; 936294 remaining Start-Sleep -Seconds 3