본문으로 건너뛰기
Previous
Next
Data Preprocessing in Python | Missing Values· Outliers

Data Preprocessing in Python | Missing Values· Outliers

Data Preprocessing in Python | Missing Values· Outliers

이 글의 핵심

Python data preprocessing: handle missing data, IQR/Z-score outliers, Min-Max and StandardScaler, categorical encoding, and a full sklearn-style pipeline with Pandas.

Introduction

“Good data builds good models”

Data preprocessing accounts for a large share of real-world machine learning work.

1. Missing values

Detecting nulls

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'name': ['철수', '영희', '민수', None],
    'age': [25, None, 28, 30],
    'salary': [3000, 4000, None, 5000]
})
print(df.isnull())
print(df.isnull().sum())  # nulls per column

Handling missing values

# Option 1: drop
df_dropped = df.dropna()           # rows with any null
df_dropped = df.dropna(axis=1)     # columns with any null
# Option 2: fill
df_filled = df.fillna(0)
df_filled = df.fillna(df.mean(numeric_only=True))
df_filled = df.ffill()  # forward fill
# Option 3: interpolate
df['age'] = df['age'].interpolate()

2. Outliers

IQR rule

Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_clean = df[
    (df['salary'] >= lower_bound) &
    (df['salary'] <= upper_bound)
]

Z-score rule

from scipy import stats
# Rows with non-null salary: flag |z| < 3 as inliers
sub = df.dropna(subset=['salary'])
z = np.abs(stats.zscore(sub['salary']))
df_clean = sub[z < 3]

3. Normalization and standardization

Min–Max scaling (0–1)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['salary_normalized'] = scaler.fit_transform(df[['salary']])
print(df[['salary', 'salary_normalized']])

Standardization (mean 0, std 1)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['salary_standardized'] = scaler.fit_transform(df[['salary']])
print(df[['salary', 'salary_standardized']])

4. Categorical encoding

Label encoding

from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({
    'city': ['서울', '부산', '서울', '대구', '부산']
})
encoder = LabelEncoder()
df['city_encoded'] = encoder.fit_transform(df['city'])
print(df)

One-hot encoding

# Pandas
df_encoded = pd.get_dummies(df, columns=['city'])
print(df_encoded)
# scikit-learn
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(df[['city']])

5. Feature engineering

Deriving features

df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=100),
    'sales': np.random.randint(100, 500, 100)
})
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['sales_ma7'] = df['sales'].rolling(window=7).mean()
df['sales_diff'] = df['sales'].diff()

6. End-to-end example

Full preprocessing pipeline

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
def preprocess_data(df):
    """Example preprocessing pipeline."""
    # 1. Missing values
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())
    categorical_cols = df.select_dtypes(include=['object']).columns
    df[categorical_cols] = df[categorical_cols].fillna('Unknown')
    # 2. Outliers (IQR per numeric column)
    for col in numeric_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 1.5 * IQR
        upper = Q3 + 1.5 * IQR
        df = df[(df[col] >= lower) & (df[col] <= upper)]
    # 3. Label encode categoricals
    for col in categorical_cols:
        le = LabelEncoder()
        df[f'{col}_encoded'] = le.fit_transform(df[col])
    # 4. Standardize numeric columns
    scaler = StandardScaler()
    df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
    return df
raw_data = pd.read_csv('raw_data.csv')
clean_data = preprocess_data(raw_data)
clean_data.to_csv('clean_data.csv', index=False)

Practical tips

Preprocessing checklist

# ✅ 1. Inspect
df.info()
df.describe()
df.isnull().sum()
# ✅ 2. Missing values
# Decide drop vs impute; use domain knowledge
# ✅ 3. Outliers
# Visualize; apply IQR or Z-score with care
# ✅ 4. Encoding
# Ordinal → label encoding
# Nominal → one-hot
# ✅ 5. Scaling
# Distance-based models: usually required
# Tree models: often optional

Summary

Key takeaways

  1. Missing data: drop, impute, or interpolate
  2. Outliers: IQR and Z-score (with domain checks)
  3. Normalization: Min–Max to [0, 1]
  4. Standardization: zero mean, unit variance
  5. Encoding: label vs one-hot

Next steps

  • [Hands-on data analysis](/en/blog/python-series-20-data-analysis/
  • Machine learning fundamentals

  • [Pandas basics | Python data analysis](/en/blog/python-series-16-pandas/
  • [Python environment setup](/en/blog/python-series-01-environment-setup/

자주 묻는 질문 (FAQ)

Q. 이 내용을 실무에서 언제 쓰나요?

A. Python data preprocessing: handle missing data, IQR/Z-score outliers, Min-Max and StandardScaler, categorical encoding, … 실무에서는 위 본문의 예제와 선택 가이드를 참고해 적용하면 됩니다.

Q. 선행으로 읽으면 좋은 글은?

A. 각 글 하단의 이전 글 또는 관련 글 링크를 따라가면 순서대로 배울 수 있습니다. Python 시리즈 목차에서 전체 흐름을 확인할 수 있습니다.

Q. 더 깊이 공부하려면?

A. cppreference와 해당 라이브러리 공식 문서를 참고하세요. 글 말미의 참고 자료 링크도 활용하면 좋습니다.


같이 보면 좋은 글 (내부 링크)

이 주제와 연결되는 다른 글입니다.

  • [Pandas Basics | Complete Guide to Python Data Analysis](/en/blog/python-series-16-pandas/
  • [Hands-on Data Analysis with Python | Pandas Workflows](/en/blog/python-series-20-data-analysis/
  • [Python Comprehensions | List· Dict](/en/blog/python-series-09-comprehensions/

이 글에서 다루는 키워드 (관련 검색어)

Python, Data Preprocessing, Pandas, Missing Data, Outliers, Normalization, Machine Learning 등으로 검색하시면 이 글이 도움이 됩니다.