Complete Character Encoding Guide | ASCII, UTF-8, UTF-16, EUC-KR
이 글의 핵심
Principles and differences of all character encoding methods including ASCII, ANSI, Unicode, UTF-8, UTF-16, UTF-32, EUC-KR, CP949. Complete understanding from solving Korean character corruption to BOM and Endian with practical examples.
Introduction: Why Should You Know Character Encoding?
When developing, you experience Korean characters getting corrupted, files being unreadable, or API responses appearing strange. The root cause of all these problems is character encoding.
What This Article Covers:
- History of ASCII, ANSI, Unicode
- UTF-8, UTF-16, UTF-32 encoding methods
- Korean encoding (EUC-KR, CP949)
- BOM, Endian, encoding detection
- Practical problem solving
Reality in Practice
When learning development, everything seems clean and theoretical. But practice is different. You wrestle with legacy code, chase tight deadlines, and face unexpected bugs. The content covered in this article was initially learned as theory, but it was through applying it to actual projects that I realized “Ah, this is why it’s designed this way.”
What stands out in my memory is the trial and error from my first project. I did everything by the book but couldn’t figure out why it wasn’t working, spending days struggling. Eventually, through a senior developer’s code review, I discovered the problem and learned a lot in the process. In this article, I’ll cover not just theory but also the pitfalls you might encounter in practice and how to solve them.
Table of Contents
- History of Character Encoding
- ASCII: 7-bit Character Set
- ANSI and Code Pages
- Unicode: Global Character Integration
- UTF-8: Variable Length Encoding
- UTF-16 and UTF-32
- Korean Encoding (EUC-KR, CP949)
- BOM and Endian
- Practical Problem Solving
- Programming Language-specific Handling
1. History of Character Encoding
Timeline
Here’s an implementation example using mermaid. Please review the code to understand the role of each part.
timeline
title Character Encoding Evolution
1963 : ASCII established\n7-bit, 128 chars
1987 : ISO-8859-1 (Latin-1)\n8-bit, 256 chars
1991 : Unicode 1.0\n16-bit unified charset
1992 : UTF-8 invented\nVariable length encoding
1996 : UTF-16\nSurrogate pairs
2003 : UTF-8 web standardization
2008 : UTF-8 most used\non the web
2026 : UTF-8 98% market share
Why Do Multiple Encodings Exist?
Here’s an implementation example using mermaid. Please review the code to understand the role of each part.
flowchart TB
Problem["Problem: Computers\nonly understand numbers"]
ASCII["ASCII\n128 English chars"]
Extended["Extended ASCII\n256 chars per language"]
Unicode["Unicode\nGlobal character integration"]
Problem --> ASCII
ASCII --> Extended
Extended --> Unicode
ASCII --> Issue1["Problem: Cannot express\nKorean, Chinese"]
Extended --> Issue2["Problem: Different\ncode pages per country"]
Unicode --> Solution["Solution: Assign unique\nnumber to all characters"]
2. ASCII: 7-bit Character Set
What is ASCII?
ASCII (American Standard Code for Information Interchange) represents English alphabet, numbers, and special characters with 7 bits (0-127).
ASCII Table
Here’s an implementation example using code. Please review the code to understand the role of each part.
Dec Hex Char | Dec Hex Char | Dec Hex Char
-------------------------------------------------
32 20 Space | 64 40 @ | 96 60 `
33 21 ! | 65 41 A | 97 61 a
34 22 " | 66 42 B | 98 62 b
35 23 # | 67 43 C | 99 63 c
...
48 30 0 | 80 50 P | 112 70 p
49 31 1 | 81 51 Q | 113 71 q
...
57 39 9 | 90 5A Z | 122 7A z
ASCII Control Characters
Here’s an implementation example using Python. Please review the code to understand the role of each part.
# Main control characters
NUL = 0x00 # Null
LF = 0x0A # Line Feed (\n)
CR = 0x0D # Carriage Return (\r)
ESC = 0x1B # Escape
DEL = 0x7F # Delete
# Line break methods
# Unix/Linux: LF (\n)
# Windows: CR+LF (\r\n)
# Mac (Classic): CR (\r)
ASCII Examples
Here’s a detailed implementation using Python. Implement logic through functions. Please review the code to understand the role of each part.
# Character → Code
ord('A') # 65
ord('a') # 97
ord('0') # 48
# Code → Character
chr(65) # 'A'
chr(97) # 'a'
# Check ASCII range
def is_ascii(text):
return all(ord(c) < 128 for c in text)
is_ascii("Hello") # True
is_ascii("안녕") # False
3. ANSI and Code Pages
What is ANSI?
ANSI extends to 8 bits (0-255) to support each country’s language. However, the meaning of the 128-255 range differs per Code Page.
Major Code Pages
| Code Page | Name | Region | Features |
|---|---|---|---|
| CP437 | OEM-US | USA | DOS default |
| CP850 | Latin-1 | Western Europe | DOS multilingual |
| CP949 | Extended Complete | Korea | Windows Korean |
| CP932 | Shift-JIS | Japan | Windows Japanese |
| CP936 | GBK | China | Windows Chinese |
| ISO-8859-1 | Latin-1 | Western Europe | Unix/Web |
| ISO-8859-15 | Latin-9 | Western Europe | Euro (€) added |
Code Page Problems
Here’s an implementation example using Python. Please review the code to understand the role of each part.
# Same byte value, different meaning
byte_value = 0xC7
# CP949 (Korean): '한'
text_korean = byte_value.to_bytes(1, 'big').decode('cp949') # Error (needs 2 bytes)
# ISO-8859-1 (Latin-1): 'Ç'
text_latin = byte_value.to_bytes(1, 'big').decode('latin-1') # 'Ç'
# Reading same file with different encoding causes corruption!
4. Unicode: Global Character Integration
What is Unicode?
Unicode is a character set that assigns unique Code Points to all characters worldwide.
Unicode Structure
Here’s an implementation example using code. Please review the code to understand the role of each part.
U+0000 ~ U+10FFFF (1,114,112 code points)
U+0000 ~ U+007F : ASCII (128 chars)
U+0080 ~ U+00FF : Latin-1 Supplement
U+0100 ~ U+017F : Latin Extended-A
U+0370 ~ U+03FF : Greek
U+0400 ~ U+04FF : Cyrillic
U+0600 ~ U+06FF : Arabic
U+0E00 ~ U+0E7F : Thai
U+3040 ~ U+309F : Hiragana (Japanese)
U+30A0 ~ U+30FF : Katakana (Japanese)
U+4E00 ~ U+9FFF : CJK Unified Ideographs (Chinese/Japanese/Korean)
U+AC00 ~ U+D7AF : Hangul Syllables (Korean 11,172 chars)
U+1F600 ~ U+1F64F : Emoticons (Emoji)
Korean Unicode Range
Here’s an implementation example using Python. Please review the code to understand the role of each part.
# Korean syllables (가-힣)
print(f"가: U+{ord('가'):04X}") # U+AC00
print(f"힣: U+{ord('힣'):04X}") # U+D7A3
# Korean letters (ㄱ-ㅎ, ㅏ-ㅣ)
print(f"ㄱ: U+{ord('ㄱ'):04X}") # U+3131
print(f"ㅎ: U+{ord('ㅎ'):04X}") # U+314E
print(f"ㅏ: U+{ord('ㅏ'):04X}") # U+314F
print(f"ㅣ: U+{ord('ㅣ'):04X}") # U+3163
# Emoji
print(f"😀: U+{ord('😀'):04X}") # U+1F600
Unicode vs Encoding
Here’s an implementation example using code. Try running the code directly to see how it works.
Unicode: Character Set
Assigns number (code point) to each character
Example: '한' = U+D55C
UTF-8/UTF-16/UTF-32: Encoding
Method to convert code points to bytes
Example: U+D55C → UTF-8: ED 95 9C (3 bytes)
→ UTF-16: D5 5C (2 bytes)
5. UTF-8: Variable Length Encoding
What is UTF-8?
UTF-8 encodes Unicode with 1-4 byte variable length. It’s the web standard and perfectly compatible with ASCII.
UTF-8 Encoding Rules
Here’s an implementation example using code. Try running the code directly to see how it works.
Code Point Range | Bytes | Encoding Pattern
U+0000 ~ U+007F | 1 | 0xxxxxxx
U+0080 ~ U+07FF | 2 | 110xxxxx 10xxxxxx
U+0800 ~ U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx
U+10000 ~ U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
UTF-8 Encoding Examples
English ‘A’ (U+0041)
Here’s an implementation example using code. Try running the code directly to see how it works.
Code Point: U+0041 (65)
Binary: 0100 0001
UTF-8 Encoding:
0100 0001 = 0x41 (1 byte)
Memory: 41
Korean ‘한’ (U+D55C)
Here’s an implementation example using code. Try running the code directly to see how it works.
Code Point: U+D55C (54,620)
Binary: 1101 0101 0101 1100
UTF-8 Encoding (3 bytes):
1110xxxx 10xxxxxx 10xxxxxx
1110 1101 10 010101 10 011100
E D 9 5 9 C
Memory: ED 95 9C
Emoji ’😀’ (U+1F600)
Here’s an implementation example using code. Try running the code directly to see how it works.
Code Point: U+1F600 (128,512)
Binary: 0001 1111 0110 0000 0000
UTF-8 Encoding (4 bytes):
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
11110 000 10 011111 10 011000 10 000000
F 0 9 F 9 8 8 0
Memory: F0 9F 98 80
UTF-8 Advantages
Here’s an implementation example using mermaid. Please review the code to understand the role of each part.
flowchart TB
UTF8[UTF-8]
Adv1["✅ ASCII compatible\nEnglish is 1 byte"]
Adv2["✅ Self-synchronizing\nCan read from middle"]
Adv3["✅ Byte order independent\nNo Endian issues"]
Adv4["✅ Web standard\n98% market share"]
UTF8 --> Adv1
UTF8 --> Adv2
UTF8 --> Adv3
UTF8 --> Adv4
UTF-8 Encoding with Python
# String → Bytes
text = "Hello 한글 😀"
# UTF-8 encoding
utf8_bytes = text.encode('utf-8')
print(utf8_bytes)
# b'Hello \xed\x95\x9c\xea\xb8\x80 \xf0\x9f\x98\x80'
# Byte analysis
for i, byte in enumerate(utf8_bytes):
print(f"{i:2d}: 0x{byte:02X} ({byte:3d}) {chr(byte) if byte < 128 else '?'}")
# Output:
# 0: 0x48 ( 72) H
# 1: 0x65 (101) e
# 2: 0x6C (108) l
# 3: 0x6C (108) l
# 4: 0x6F (111) o
# 5: 0x20 ( 32)
# 6: 0xED (237) ? ← '한' start
# 7: 0x95 (149) ?
# 8: 0x9C (156) ?
# 9: 0xEA (234) ? ← '글' start
# 10: 0xB8 (184) ?
# 11: 0x80 (128) ?
# 12: 0x20 ( 32)
# 13: 0xF0 (240) ? ← '😀' start
# 14: 0x9F (159) ?
# 15: 0x98 (152) ?
# 16: 0x80 (128) ?
# Bytes → String
decoded = utf8_bytes.decode('utf-8')
print(decoded) # "Hello 한글 😀"
6. UTF-16 and UTF-32
UTF-16
UTF-16 encodes with 2 or 4 bytes. Used internally in Windows, Java, and JavaScript.
UTF-16 Encoding Rules
Code Point Range | Bytes | Method
U+0000 ~ U+FFFF | 2 | Direct encoding
U+10000 ~ U+10FFFF | 4 | Surrogate pair
Surrogate Pair
Here’s a detailed implementation using Python. Please review the code to understand the role of each part.
# Encode emoji '😀' (U+1F600) to UTF-16
# 1. U+1F600 - 0x10000 = 0xF600
# 2. High 10 bits: 0x3D (61)
# 3. Low 10 bits: 0x200 (512)
# 4. High Surrogate: 0xD800 + 0x3D = 0xD83D
# 5. Low Surrogate: 0xDC00 + 0x200 = 0xDE00
text = "😀"
utf16_bytes = text.encode('utf-16-le')
print(utf16_bytes.hex()) # '3dd8 00de' (Little-Endian)
# UTF-16 BE (Big-Endian)
utf16_be = text.encode('utf-16-be')
print(utf16_be.hex()) # 'd83d de00'
UTF-16 Example
Here’s an implementation example using Python. Please review the code to understand the role of each part.
text = "Hello 한글"
# UTF-16 LE (Little-Endian)
utf16_le = text.encode('utf-16-le')
print(utf16_le.hex())
# 48 00 65 00 6c 00 6c 00 6f 00 20 00 5c d5 00 ae 00 b8
# UTF-16 BE (Big-Endian)
utf16_be = text.encode('utf-16-be')
print(utf16_be.hex())
# 00 48 00 65 00 6c 00 6c 00 6f 00 20 d5 5c ae 00 b8 00
UTF-32
UTF-32 encodes all characters with fixed 4-byte length.
Here’s an implementation example using Python. Please review the code to understand the role of each part.
text = "A한😀"
# UTF-32 LE
utf32 = text.encode('utf-32-le')
print(utf32.hex())
# 41 00 00 00 5c d5 00 00 00 f6 01 00
# Each character is exactly 4 bytes
# 'A': 0x00000041
# '한': 0x0000D55C
# '😀': 0x0001F600
Encoding Comparison
text = "Hello 한글 😀"
encodings = ['utf-8', 'utf-16-le', 'utf-16-be', 'utf-32-le']
for enc in encodings:
encoded = text.encode(enc)
print(f"{enc:12s}: {len(encoded):2d} bytes | {encoded.hex()[:40]}...")
# Output:
# utf-8 : 17 bytes | 48656c6c6f20ed959ceab880f09f9880
# utf-16-le : 20 bytes | 480065006c006c006f002000d55c00aeb800...
# utf-16-be : 20 bytes | 004800650069006c006f0020d55cae00b800...
# utf-32-le : 36 bytes | 4100000065000000...
7. Korean Encoding (EUC-KR, CP949)
Korean Encoding History
Here’s an implementation example using mermaid. Try running the code directly to see how it works.
timeline
title Korean Encoding Evolution
1987 : KS X 1001\nComplete 2,350 chars
1992 : EUC-KR\nComplete standard
1996 : CP949 (MS)\nExtended complete 11,172 chars
2000s : UTF-8\nUnicode based
EUC-KR
EUC-KR represents 2,350 Korean characters with 2 bytes.
Here’s an implementation example using Python. Ensure stability with error handling. Please review the code to understand the role of each part.
# EUC-KR encoding
text = "한글"
euckr_bytes = text.encode('euc-kr')
print(euckr_bytes.hex()) # c7d1 b1db
# '한': 0xC7D1
# '글': 0xB1DB
# Problem: Characters like '똠', '쀍' cannot be represented
try:
"똠".encode('euc-kr')
except UnicodeEncodeError as e:
print(f"❌ Cannot encode to EUC-KR: {e}")
CP949 (Extended Complete)
CP949 extends EUC-KR to support all 11,172 characters.
Here’s an implementation example using Python. Try running the code directly to see how it works.
# CP949 encoding
text = "똠방각하"
cp949_bytes = text.encode('cp949')
print(cp949_bytes.hex())
# Can represent characters not in EUC-KR
text2 = "쀍똠뙠"
print(text2.encode('cp949').hex())
UTF-8 vs EUC-KR Comparison
Here’s an implementation example using Python. Please review the code to understand the role of each part.
text = "Hello 한글"
# UTF-8: English 1 byte, Korean 3 bytes
utf8 = text.encode('utf-8')
print(f"UTF-8: {len(utf8)} bytes | {utf8.hex()}")
# UTF-8: 14 bytes | 48656c6c6f20ed959ceab880
# EUC-KR: English 1 byte, Korean 2 bytes
euckr = text.encode('euc-kr')
print(f"EUC-KR: {len(euckr)} bytes | {euckr.hex()}")
# EUC-KR: 10 bytes | 48656c6c6f20c7d1b1db
8. BOM and Endian
BOM (Byte Order Mark)
BOM is a special byte at the start of a file indicating encoding method and byte order.
Here’s an implementation example using code. Try running the code directly to see how it works.
Encoding | BOM (hex) | Size
UTF-8 | EF BB BF | 3 bytes
UTF-16 LE | FF FE | 2 bytes
UTF-16 BE | FE FF | 2 bytes
UTF-32 LE | FF FE 00 00 | 4 bytes
UTF-32 BE | 00 00 FE FF | 4 bytes
BOM Example
Here’s a detailed implementation using Python. Please review the code to understand the role of each part.
# UTF-8 with BOM
text = "Hello"
with open('file_with_bom.txt', 'wb') as f:
f.write(b'\xef\xbb\xbf') # BOM
f.write(text.encode('utf-8'))
# File content (hex):
# EF BB BF 48 65 6C 6C 6F
# ^^^^^^^^ BOM
# ^^^^^^^^^^^^^^ "Hello"
# UTF-8 without BOM (recommended)
with open('file_no_bom.txt', 'wb') as f:
f.write(text.encode('utf-8'))
# File content (hex):
# 48 65 6C 6C 6F
BOM Detection and Removal
Here’s a detailed implementation using Python. Implement logic through functions. Please review the code to understand the role of each part.
def detect_and_remove_bom(data):
"""Detect and remove BOM"""
bom_signatures = [
(b'\xef\xbb\xbf', 'utf-8-sig'),
(b'\xff\xfe\x00\x00', 'utf-32-le'),
(b'\x00\x00\xfe\xff', 'utf-32-be'),
(b'\xff\xfe', 'utf-16-le'),
(b'\xfe\xff', 'utf-16-be'),
]
for bom, encoding in bom_signatures:
if data.startswith(bom):
return data[len(bom):], encoding
return data, None
# Usage
with open('file.txt', 'rb') as f:
data = f.read()
data, encoding = detect_and_remove_bom(data)
if encoding:
print(f"✅ BOM detected: {encoding}")
text = data.decode(encoding.replace('-sig', ''))
else:
print("ℹ️ No BOM, assuming UTF-8")
text = data.decode('utf-8')
Endian (Byte Order)
Here’s a detailed implementation using Python. Please review the code to understand the role of each part.
# Big-Endian: Large byte first
# Little-Endian: Small byte first
# Example: Store 0x1234 in memory
# Big-Endian: 12 34
# Little-Endian: 34 12
# Important in UTF-16
text = "한" # U+D55C
# UTF-16 BE (Big-Endian)
be = text.encode('utf-16-be')
print(be.hex()) # d5 5c
# UTF-16 LE (Little-Endian)
le = text.encode('utf-16-le')
print(le.hex()) # 5c d5
# UTF-8 is byte-based, so Endian independent
utf8 = text.encode('utf-8')
print(utf8.hex()) # ed 95 9c (always same)
9. Practical Problem Solving
Problem 1: Korean Character Corruption (���)
Cause
Here’s an implementation example using Python. Try running the code directly to see how it works.
# ❌ Saved as UTF-8 but read as EUC-KR
with open('file.txt', 'w', encoding='utf-8') as f:
f.write("한글")
# Incorrect reading
with open('file.txt', 'r', encoding='euc-kr') as f:
text = f.read()
print(text) # '���' (corrupted)
Solution
Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.
# ✅ Read with correct encoding
with open('file.txt', 'r', encoding='utf-8') as f:
text = f.read()
print(text) # '한글' (correct)
# ✅ Auto-detect encoding
import chardet
with open('file.txt', 'rb') as f:
raw_data = f.read()
result = chardet.detect(raw_data)
encoding = result['encoding']
confidence = result['confidence']
print(f"Detected: {encoding} ({confidence*100:.1f}% confidence)")
text = raw_data.decode(encoding)
print(text)
Problem 2: UnicodeDecodeError
Here’s a detailed implementation using Python. Ensure stability with error handling. Please review the code to understand the role of each part.
# ❌ Decode with wrong encoding
utf8_bytes = "한글".encode('utf-8')
try:
text = utf8_bytes.decode('ascii')
except UnicodeDecodeError as e:
print(f"❌ {e}")
# 'ascii' codec can't decode byte 0xed in position 0
# ✅ Error handling options
# 1. Ignore
text = utf8_bytes.decode('ascii', errors='ignore')
print(text) # "" (Korean removed)
# 2. Replace
text = utf8_bytes.decode('ascii', errors='replace')
print(text) # "������" (replaced with ? character)
# 3. XML/HTML entity
text = utf8_bytes.decode('ascii', errors='xmlcharrefreplace')
print(text) # "한글" (numeric reference)
Problem 3: Korean Character Corruption on Web
Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.
import requests
# ❌ Wrong method
response = requests.get('https://example.com/korean-page')
print(response.text) # May be corrupted
# ✅ Check Content-Type header
response = requests.get('https://example.com/korean-page')
content_type = response.headers.get('Content-Type', '')
print(f"Content-Type: {content_type}")
# Content-Type: text/html; charset=euc-kr
# ✅ Decode with correct encoding
if 'euc-kr' in content_type.lower():
text = response.content.decode('euc-kr')
else:
text = response.text # requests auto-detects
# ✅ Or auto-detect with chardet
import chardet
detected = chardet.detect(response.content)
text = response.content.decode(detected['encoding'])
Problem 4: CSV File Encoding
Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.
import csv
# ❌ CSV saved from Windows Excel (CP949)
with open('data.csv', 'r', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row) # UnicodeDecodeError!
# ✅ Correct encoding
with open('data.csv', 'r', encoding='cp949') as f:
reader = csv.reader(f)
for row in reader:
print(row)
# ✅ Auto-detect encoding
import chardet
with open('data.csv', 'rb') as f:
raw_data = f.read()
detected = chardet.detect(raw_data)
encoding = detected['encoding']
with open('data.csv', 'r', encoding=encoding) as f:
reader = csv.reader(f)
for row in reader:
print(row)
10. Programming Language-specific Handling
Python
Here’s a detailed implementation using Python. Please review the code to understand the role of each part.
# Default encoding: UTF-8
text = "Hello 한글 😀"
# Encoding
utf8 = text.encode('utf-8')
utf16 = text.encode('utf-16')
euckr = text.encode('euc-kr') # Emoji causes error
# Decoding
text = utf8.decode('utf-8')
# File I/O
with open('file.txt', 'w', encoding='utf-8') as f:
f.write(text)
with open('file.txt', 'r', encoding='utf-8') as f:
text = f.read()
# Byte string literal
utf8_bytes = b'\xed\x95\x9c\xea\xb8\x80'
text = utf8_bytes.decode('utf-8') # "한글"
JavaScript/Node.js
Here’s a detailed implementation using JavaScript. Please review the code to understand the role of each part.
// JavaScript internal: UTF-16
const text = "Hello 한글 😀";
// String length (caution: surrogate pairs)
console.log(text.length); // 11 (😀 counted as 2)
// Correct length
console.log([...text].length); // 10
// UTF-8 encoding (Node.js)
const buffer = Buffer.from(text, 'utf-8');
console.log(buffer); // <Buffer 48 65 6c 6c 6f 20 ...>
// Decoding
const decoded = buffer.toString('utf-8');
console.log(decoded); // "Hello 한글 😀"
// Supported encodings
// utf-8, utf-16le, latin1, base64, hex, ascii
Java
Here’s a detailed implementation using Java. Please review the code to understand the role of each part.
// Java internal: UTF-16
String text = "Hello 한글 😀";
// UTF-8 encoding
byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8);
System.out.println(Arrays.toString(utf8Bytes));
// Decoding
String decoded = new String(utf8Bytes, StandardCharsets.UTF_8);
System.out.println(decoded);
// File I/O
// Write as UTF-8
Files.writeString(
Path.of("file.txt"),
text,
StandardCharsets.UTF_8
);
// Read as UTF-8
String content = Files.readString(
Path.of("file.txt"),
StandardCharsets.UTF_8
);
C++
Here’s a detailed implementation using C++. Import necessary modules. Please review the code to understand the role of each part.
#include <iostream>
#include <fstream>
#include <string>
#include <codecvt>
#include <locale>
int main() {
// UTF-8 string (C++11)
std::string utf8_str = u8"Hello 한글 😀";
// UTF-16 string
std::u16string utf16_str = u"Hello 한글 😀";
// UTF-32 string
std::u32string utf32_str = U"Hello 한글 😀";
// UTF-8 → UTF-16 conversion
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> converter;
std::u16string utf16 = converter.from_bytes(utf8_str);
// Write file (UTF-8)
std::ofstream file("file.txt", std::ios::binary);
file << utf8_str;
file.close();
// Read file
std::ifstream input("file.txt", std::ios::binary);
std::string content((std::istreambuf_iterator<char>(input)),
std::istreambuf_iterator<char>());
std::cout << content << std::endl;
return 0;
}
Go
Here’s a detailed implementation using Go. Import necessary modules, implement logic through functions. Please review the code to understand the role of each part.
package main
import (
"fmt"
"unicode/utf8"
"golang.org/x/text/encoding/korean"
"golang.org/x/text/transform"
"io"
"strings"
)
func main() {
// Go internal: UTF-8
text := "Hello 한글 😀"
// Byte length vs character (rune) length
fmt.Println("Bytes:", len(text)) // 17
fmt.Println("Runes:", utf8.RuneCountInString(text)) // 10
// UTF-8 → EUC-KR conversion
encoder := korean.EUCKR.NewEncoder()
euckrBytes, _, _ := transform.Bytes(encoder, []byte(text))
fmt.Printf("EUC-KR: %x\n", euckrBytes)
// EUC-KR → UTF-8 conversion
decoder := korean.EUCKR.NewDecoder()
utf8Text, _, _ := transform.String(decoder, string(euckrBytes))
fmt.Println(utf8Text)
}
Advanced Topics
Normalization
Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.
import unicodedata
# Two ways to represent Korean '가'
# 1. Composed (NFC): U+AC00
nfc = "가"
print(f"NFC: {len(nfc)} chars, {nfc.encode('utf-8').hex()}")
# NFC: 1 chars, eab080
# 2. Decomposed (NFD): U+1100 + U+1161 (ㄱ + ㅏ)
nfd = unicodedata.normalize('NFD', nfc)
print(f"NFD: {len(nfd)} chars, {nfd.encode('utf-8').hex()}")
# NFD: 2 chars, e384 80e185a1
# Comparison
print(nfc == nfd) # False (different byte sequence)
# Compare after normalization
print(unicodedata.normalize('NFC', nfc) ==
unicodedata.normalize('NFC', nfd)) # True
Encoding Detection
Here’s a detailed implementation using Python. Import necessary modules, implement logic through functions. Please review the code to understand the role of each part.
import chardet
def detect_encoding(file_path):
"""Auto-detect file encoding"""
with open(file_path, 'rb') as f:
raw_data = f.read()
result = chardet.detect(raw_data)
return {
'encoding': result['encoding'],
'confidence': result['confidence'],
'language': result.get('language', '')
}
# Usage
info = detect_encoding('unknown.txt')
print(f"Encoding: {info['encoding']}")
print(f"Confidence: {info['confidence']*100:.1f}%")
# Read with correct encoding
with open('unknown.txt', 'r', encoding=info['encoding']) as f:
content = f.read()
Encoding Conversion
Here’s an implementation example using Python. Implement logic through functions. Please review the code to understand the role of each part.
def convert_file_encoding(input_file, output_file, from_enc, to_enc):
"""Convert file encoding"""
# Read original
with open(input_file, 'r', encoding=from_enc) as f:
content = f.read()
# Save with new encoding
with open(output_file, 'w', encoding=to_enc) as f:
f.write(content)
print(f"✅ Converted: {from_enc} → {to_enc}")
# EUC-KR → UTF-8 conversion
convert_file_encoding('old.txt', 'new.txt', 'euc-kr', 'utf-8')
Encoding in Web Development
HTML
Here’s an implementation example using HTML. Please review the code to understand the role of each part.
<!DOCTYPE html>
<html>
<head>
<!-- ✅ UTF-8 declaration (required) -->
<meta charset="UTF-8">
<title>Korean Page</title>
</head>
<body>
<h1>안녕하세요</h1>
</body>
</html>
HTTP Headers
Here’s a detailed implementation using Python. Import necessary modules, implement logic through functions. Please review the code to understand the role of each part.
from flask import Flask, Response
app = Flask(__name__)
@app.route('/korean')
def korean_page():
content = "<h1>안녕하세요</h1>"
# ✅ Specify charset in Content-Type
return Response(
content,
mimetype='text/html; charset=utf-8'
)
# ❌ Without charset, browser guesses (may corrupt)
JSON
Here’s an implementation example using Python. Import necessary modules. Please review the code to understand the role of each part.
import json
data = {"name": "홍길동", "message": "안녕하세요"}
# JSON is UTF-8 by default
json_str = json.dumps(data, ensure_ascii=False)
print(json_str)
# {"name": "홍길동", "message": "안녕하세요"}
# ensure_ascii=True (default)
json_str_ascii = json.dumps(data, ensure_ascii=True)
print(json_str_ascii)
# {"name": "\ud64d\uae38\ub3d9", "message": "\uc548\ub155\ud558\uc138\uc694"}
URL Encoding
Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.
from urllib.parse import quote, unquote
# URL with Korean
text = "한글 검색"
# URL encoding (UTF-8 based)
encoded = quote(text)
print(encoded)
# %ED%95%9C%EA%B8%80%20%EA%B2%80%EC%83%89
# URL decoding
decoded = unquote(encoded)
print(decoded) # "한글 검색"
# Complete URL
url = f"https://example.com/search?q={encoded}"
print(url)
# https://example.com/search?q=%ED%95%9C%EA%B8%80%20%EA%B2%80%EC%83%89
Database Encoding
MySQL
Here’s a detailed implementation using SQL. Please review the code to understand the role of each part.
-- Create database (UTF-8)
CREATE DATABASE mydb
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
-- utf8mb4: 4-byte UTF-8 (emoji support)
-- utf8: 3-byte UTF-8 (no emoji, deprecated)
-- Create table
CREATE TABLE users (
id INT PRIMARY KEY,
name VARCHAR(100) CHARACTER SET utf8mb4
);
-- Set encoding on connection
SET NAMES utf8mb4;
PostgreSQL
Here’s an implementation example using SQL. Please review the code to understand the role of each part.
-- Create database
CREATE DATABASE mydb
ENCODING 'UTF8'
LC_COLLATE 'ko_KR.UTF-8'
LC_CTYPE 'ko_KR.UTF-8';
-- Check client encoding
SHOW client_encoding;
-- Change encoding
SET client_encoding TO 'UTF8';
Python + DB
Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.
import psycopg2
# PostgreSQL connection
conn = psycopg2.connect(
host='localhost',
database='mydb',
user='user',
password='pass',
client_encoding='utf8' # ✅ Explicit specification
)
cursor = conn.cursor()
# Insert Korean data
cursor.execute(
"INSERT INTO users (name) VALUES (%s)",
("홍길동",)
)
# Query
cursor.execute("SELECT name FROM users")
name = cursor.fetchone()[0]
print(name) # "홍길동"
Practical Tools
Command Line Tools
Here’s a detailed implementation using bash. Please review the code to understand the role of each part.
# 1. Check encoding with file command
file -i file.txt
# file.txt: text/plain; charset=utf-8
# 2. Convert encoding with iconv
iconv -f EUC-KR -t UTF-8 old.txt > new.txt
# 3. Batch convert multiple files
find . -name "*.txt" -exec iconv -f EUC-KR -t UTF-8 {} -o {}.utf8 \;
# 4. Check bytes with hexdump
echo "한글" | hexdump -C
# 00000000 ed 95 9c ea b8 80 0a
# 5. Remove BOM
tail -c +4 file_with_bom.txt > file_no_bom.txt # UTF-8 BOM (3 bytes)
Python Script
Here’s a detailed implementation using Python. Import necessary modules, implement logic through functions, ensure stability with error handling. Please review the code to understand the role of each part.
#!/usr/bin/env python3
"""
Batch file encoding conversion tool
"""
import os
import sys
import chardet
from pathlib import Path
def convert_directory(directory, from_enc=None, to_enc='utf-8'):
"""Convert encoding of all text files in directory"""
for file_path in Path(directory).rglob('*.txt'):
try:
# Read original
with open(file_path, 'rb') as f:
raw_data = f.read()
# Detect encoding
if from_enc is None:
detected = chardet.detect(raw_data)
source_enc = detected['encoding']
confidence = detected['confidence']
if confidence < 0.7:
print(f"⚠️ {file_path}: Low confidence ({confidence:.2f})")
continue
else:
source_enc = from_enc
# Skip if already UTF-8
if source_enc.lower().replace('-', '') == 'utf8':
print(f"✓ {file_path}: Already UTF-8")
continue
# Convert
text = raw_data.decode(source_enc)
# Save
with open(file_path, 'w', encoding=to_enc) as f:
f.write(text)
print(f"✅ {file_path}: {source_enc} → {to_enc}")
except Exception as e:
print(f"❌ {file_path}: {e}")
if __name__ == '__main__':
if len(sys.argv) < 2:
print("Usage: python convert_encoding.py <directory>")
sys.exit(1)
convert_directory(sys.argv[1])
Encoding Comparison Table
Storage Space Comparison
text = "Hello 한글 😀"
encodings = {
'ASCII (English only)': 'ascii',
'UTF-8': 'utf-8',
'UTF-16 LE': 'utf-16-le',
'UTF-16 BE': 'utf-16-be',
'UTF-32 LE': 'utf-32-le',
'EUC-KR': 'euc-kr',
'CP949': 'cp949',
}
print(f"Original text: {text}\n")
print(f"{'Encoding':20s} | {'Bytes':6s} | Hex")
print("-" * 60)
for name, enc in encodings.items():
try:
encoded = text.encode(enc)
hex_str = encoded.hex()[:30] + ('...' if len(encoded) > 15 else '')
print(f"{name:20s} | {len(encoded):4d}B | {hex_str}")
except UnicodeEncodeError:
print(f"{name:20s} | {'N/A':6s} | (Cannot encode)")
# Output:
# Original text: Hello 한글 😀
#
# Encoding | Bytes | Hex
# ------------------------------------------------------------
# ASCII (English only) | N/A | (Cannot encode)
# UTF-8 | 17B | 48656c6c6f20ed959ceab880f0...
# UTF-16 LE | 20B | 480065006c006c006f00200000...
# UTF-16 BE | 20B | 004800650069006c006f002000...
# UTF-32 LE | 36B | 410000006500000069000000...
# EUC-KR | N/A | (Cannot encode)
# CP949 | N/A | (Cannot encode)
Feature Comparison
| Encoding | Bytes/Char | ASCII Compatible | Korean Efficiency | Emoji | Main Usage |
|---|---|---|---|---|---|
| ASCII | 1 | ✅ | ❌ | ❌ | English only |
| EUC-KR | 1-2 | ✅ | ✅✅ | ❌ | Korean legacy |
| CP949 | 1-2 | ✅ | ✅✅ | ❌ | Windows Korean |
| UTF-8 | 1-4 | ✅ | ✅ | ✅ | Web, Linux, modern standard |
| UTF-16 | 2-4 | ❌ | ✅✅ | ✅ | Windows, Java internal |
| UTF-32 | 4 | ❌ | ❌ | ✅ | Internal processing |
Real-World Scenarios
Scenario 1: Legacy System Integration
Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.
# Problem: Bank API responds with EUC-KR
import requests
response = requests.get('http://legacy-bank-api.com/account')
# ❌ Auto-decode (assumes UTF-8)
# print(response.text) # Corrupted
# ✅ Correct handling
content = response.content # Bytes
text = content.decode('euc-kr')
print(text)
# ✅ Or provide hint to requests
response.encoding = 'euc-kr'
print(response.text)
Scenario 2: Multilingual Application
Here’s a detailed implementation using Python. Import necessary modules, implement logic through functions. Please review the code to understand the role of each part.
import locale
import sys
def setup_encoding():
"""Setup system encoding"""
# Check stdout encoding
print(f"stdout encoding: {sys.stdout.encoding}")
# System locale
print(f"System locale: {locale.getpreferredencoding()}")
# Force UTF-8 (Python 3.7+)
if sys.stdout.encoding != 'utf-8':
sys.stdout.reconfigure(encoding='utf-8')
# Handle multilingual text
texts = {
'en': "Hello",
'ko': "안녕하세요",
'ja': "こんにちは",
'zh': "你好",
'ar': "مرحبا",
'ru': "Здравствуйте",
'emoji': "👋🌍"
}
for lang, text in texts.items():
utf8 = text.encode('utf-8')
print(f"{lang:5s}: {text:15s} | {len(utf8):2d} bytes | {utf8.hex()[:30]}")
Scenario 3: File Upload Handling
Here’s a detailed implementation using Python. Import necessary modules, implement logic through functions, ensure stability with error handling. Please review the code to understand the role of each part.
from flask import Flask, request
import chardet
app = Flask(__name__)
@app.route('/upload', methods=['POST'])
def upload_file():
file = request.files['file']
# Read as binary
content = file.read()
# Detect encoding
detected = chardet.detect(content)
encoding = detected['encoding']
confidence = detected['confidence']
print(f"Detected: {encoding} ({confidence*100:.1f}%)")
# Convert to UTF-8
if encoding.lower() != 'utf-8':
try:
text = content.decode(encoding)
utf8_content = text.encode('utf-8')
return {
'status': 'converted',
'from': encoding,
'to': 'utf-8',
'content': text
}
except Exception as e:
return {'status': 'error', 'message': str(e)}, 400
return {
'status': 'ok',
'encoding': 'utf-8',
'content': content.decode('utf-8')
}
Best Practices
1. Always Use UTF-8
Here’s a detailed implementation using Python. Please review the code to understand the role of each part.
# ✅ File I/O
with open('file.txt', 'w', encoding='utf-8') as f:
f.write("한글")
# ✅ Source code encoding declaration (Python 2)
# -*- coding: utf-8 -*-
# ✅ HTML
# <meta charset="UTF-8">
# ✅ HTTP header
# Content-Type: text/html; charset=utf-8
# ✅ Database
# CREATE DATABASE mydb CHARACTER SET utf8mb4;
2. Read in Binary Mode and Decode Explicitly
Here’s an implementation example using Python. Please review the code to understand the role of each part.
# ✅ Safe method
with open('file.txt', 'rb') as f:
raw_data = f.read()
# Decode after checking encoding
text = raw_data.decode('utf-8')
# ❌ Risky method (uses system default encoding)
with open('file.txt', 'r') as f: # encoding not specified
text = f.read()
3. Error Handling
Here’s a detailed implementation using Python. Implement logic through functions, ensure stability with error handling. Please review the code to understand the role of each part.
# ✅ Error handling strategy
def safe_decode(data, encodings=['utf-8', 'cp949', 'euc-kr', 'latin-1']):
"""Try multiple encodings"""
for enc in encodings:
try:
return data.decode(enc), enc
except UnicodeDecodeError:
continue
# If all fail, decode ignoring errors
return data.decode('utf-8', errors='replace'), 'utf-8'
# Usage
with open('unknown.txt', 'rb') as f:
data = f.read()
text, encoding = safe_decode(data)
print(f"Decoded as {encoding}: {text}")
4. BOM Handling
Here’s an implementation example using Python. Please review the code to understand the role of each part.
# ✅ Auto-handle UTF-8 BOM
with open('file.txt', 'r', encoding='utf-8-sig') as f:
text = f.read() # Automatically removes BOM if present
# ✅ Save without BOM (recommended)
with open('file.txt', 'w', encoding='utf-8') as f:
f.write(text)
# ❌ Save with BOM (avoid)
with open('file.txt', 'w', encoding='utf-8-sig') as f:
f.write(text)
Problem Solving Checklist
When Korean Characters Are Corrupted
Here’s an implementation example using Python. Import necessary modules. Please review the code to understand the role of each part.
# 1. Check file encoding
import chardet
with open('file.txt', 'rb') as f:
result = chardet.detect(f.read())
print(result)
# 2. Read with correct encoding
with open('file.txt', 'r', encoding='cp949') as f:
text = f.read()
# 3. Re-save as UTF-8
with open('file.txt', 'w', encoding='utf-8') as f:
f.write(text)
When Korean Characters Are Corrupted on Web
Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.
# 1. Check HTTP header
import requests
response = requests.get('https://example.com')
print(response.encoding) # ISO-8859-1 (wrong guess)
# 2. Set correct encoding
response.encoding = 'utf-8'
print(response.text)
# 3. Check Content-Type header
print(response.headers.get('Content-Type'))
# text/html; charset=euc-kr
# 4. Explicit decoding
text = response.content.decode('euc-kr')
When Korean Characters Are Corrupted in Database
Here’s a detailed implementation using Python. Import necessary modules. Please review the code to understand the role of each part.
# 1. Check connection encoding
import pymysql
conn = pymysql.connect(
host='localhost',
user='user',
password='pass',
database='mydb',
charset='utf8mb4' # ✅ Explicit specification
)
# 2. Check table encoding
cursor = conn.cursor()
cursor.execute("SHOW CREATE TABLE users")
print(cursor.fetchone())
# 3. Convert encoding
# ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4;
Summary
Encoding Selection Guide
Here’s an implementation example using mermaid. Please review the code to understand the role of each part.
flowchart TD
Start[Start new project] --> Q1{Language?}
Q1 -->|English only| ASCII["ASCII\nor UTF-8"]
Q1 -->|Multilingual| UTF8["✅ UTF-8\nRecommended"]
Q1 -->|Legacy integration| Q2{System?}
Q2 -->|Windows Korean| CP949[CP949]
Q2 -->|Unix Korean| EUCKR[EUC-KR]
Q2 -->|Japanese| SJIS[Shift-JIS]
UTF8 --> Best["✅ Best choice\n- Web standard\n- All characters supported\n- ASCII compatible"]
Core Principles
Here’s a detailed implementation using Python. Ensure stability with error handling. Please review the code to understand the role of each part.
# 1. Always use UTF-8
encoding = 'utf-8'
# 2. Specify encoding
with open('file.txt', 'w', encoding='utf-8') as f:
f.write(text)
# 3. Binary mode + explicit decoding
with open('file.txt', 'rb') as f:
data = f.read()
text = data.decode('utf-8')
# 4. Error handling
try:
text = data.decode('utf-8')
except UnicodeDecodeError:
text = data.decode('utf-8', errors='replace')
# 5. Test
assert "한글 😀".encode('utf-8').decode('utf-8') == "한글 😀"
Encoding Summary
| Encoding | Bytes | Advantages | Disadvantages | When to Use |
|---|---|---|---|---|
| UTF-8 | 1-4 | Web standard, ASCII compatible | Korean 3 bytes | All new projects |
| UTF-16 | 2-4 | Korean 2 bytes | ASCII incompatible | Windows/Java internal |
| UTF-32 | 4 | Fixed length | Space waste | Internal processing |
| EUC-KR | 1-2 | Korean 2 bytes | Some Korean unsupported | Legacy systems |
| CP949 | 1-2 | All Korean supported | Windows only | Windows Korean |
Debugging Tools
Python Encoding Debugger
Here’s a detailed implementation using Python. Implement logic through functions, ensure stability with error handling. Please review the code to understand the role of each part.
def analyze_encoding(file_path):
"""Detailed file encoding analysis"""
with open(file_path, 'rb') as f:
raw_data = f.read()
print(f"📄 File: {file_path}")
print(f"📊 Size: {len(raw_data)} bytes\n")
# Check BOM
if raw_data.startswith(b'\xef\xbb\xbf'):
print("🔖 BOM: UTF-8")
elif raw_data.startswith(b'\xff\xfe'):
print("🔖 BOM: UTF-16 LE")
elif raw_data.startswith(b'\xfe\xff'):
print("🔖 BOM: UTF-16 BE")
else:
print("🔖 BOM: None")
# Detect encoding
detected = chardet.detect(raw_data)
print(f"\n🔍 Detected encoding: {detected['encoding']}")
print(f"📈 Confidence: {detected['confidence']*100:.1f}%")
# Try multiple encodings
print("\n🧪 Decoding test:")
encodings = ['utf-8', 'cp949', 'euc-kr', 'utf-16', 'latin-1']
for enc in encodings:
try:
text = raw_data.decode(enc)
preview = text[:50].replace('\n', '\\n')
print(f" ✅ {enc:10s}: {preview}")
except UnicodeDecodeError as e:
print(f" ❌ {enc:10s}: {e}")
# Hex dump (first 100 bytes)
print(f"\n🔢 Hex Dump (first 100 bytes):")
for i in range(0, min(100, len(raw_data)), 16):
hex_str = ' '.join(f'{b:02x}' for b in raw_data[i:i+16])
ascii_str = ''.join(chr(b) if 32 <= b < 127 else '.' for b in raw_data[i:i+16])
print(f" {i:04x}: {hex_str:48s} | {ascii_str}")
# Usage
analyze_encoding('mystery.txt')
References
- Unicode Standard
- UTF-8 Specification (RFC 3629)
- Character Encoding in Python
- The Absolute Minimum Every Software Developer Must Know About Unicode
One-line Summary: Use UTF-8 for all new projects, consider EUC-KR/CP949 only for legacy system integration, and always explicitly specify encoding to prevent Korean character corruption.