Data Preprocessing in Python | Missing Values, Outliers, and Scaling
이 글의 핵심
End-to-end preprocessing for ML: imputation, outlier rules, scaling, label and one-hot encoding, feature engineering, and a reusable pipeline pattern.
Introduction
“Good data builds good models”
Data preprocessing accounts for a large share of real-world machine learning work.
1. Missing values
Detecting nulls
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': ['철수', '영희', '민수', None],
'age': [25, None, 28, 30],
'salary': [3000, 4000, None, 5000]
})
print(df.isnull())
print(df.isnull().sum()) # nulls per column
Handling missing values
# Option 1: drop
df_dropped = df.dropna() # rows with any null
df_dropped = df.dropna(axis=1) # columns with any null
# Option 2: fill
df_filled = df.fillna(0)
df_filled = df.fillna(df.mean(numeric_only=True))
df_filled = df.ffill() # forward fill
# Option 3: interpolate
df['age'] = df['age'].interpolate()
2. Outliers
IQR rule
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_clean = df[
(df['salary'] >= lower_bound) &
(df['salary'] <= upper_bound)
]
Z-score rule
from scipy import stats
# Rows with non-null salary: flag |z| < 3 as inliers
sub = df.dropna(subset=['salary'])
z = np.abs(stats.zscore(sub['salary']))
df_clean = sub[z < 3]
3. Normalization and standardization
Min–Max scaling (0–1)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['salary_normalized'] = scaler.fit_transform(df[['salary']])
print(df[['salary', 'salary_normalized']])
Standardization (mean 0, std 1)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['salary_standardized'] = scaler.fit_transform(df[['salary']])
print(df[['salary', 'salary_standardized']])
4. Categorical encoding
Label encoding
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({
'city': ['서울', '부산', '서울', '대구', '부산']
})
encoder = LabelEncoder()
df['city_encoded'] = encoder.fit_transform(df['city'])
print(df)
One-hot encoding
# Pandas
df_encoded = pd.get_dummies(df, columns=['city'])
print(df_encoded)
# scikit-learn
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(df[['city']])
5. Feature engineering
Deriving features
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=100),
'sales': np.random.randint(100, 500, 100)
})
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['sales_ma7'] = df['sales'].rolling(window=7).mean()
df['sales_diff'] = df['sales'].diff()
6. End-to-end example
Full preprocessing pipeline
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
def preprocess_data(df):
"""Example preprocessing pipeline."""
# 1. Missing values
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())
categorical_cols = df.select_dtypes(include=['object']).columns
df[categorical_cols] = df[categorical_cols].fillna('Unknown')
# 2. Outliers (IQR per numeric column)
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df = df[(df[col] >= lower) & (df[col] <= upper)]
# 3. Label encode categoricals
for col in categorical_cols:
le = LabelEncoder()
df[f'{col}_encoded'] = le.fit_transform(df[col])
# 4. Standardize numeric columns
scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
return df
raw_data = pd.read_csv('raw_data.csv')
clean_data = preprocess_data(raw_data)
clean_data.to_csv('clean_data.csv', index=False)
Practical tips
Preprocessing checklist
# ✅ 1. Inspect
df.info()
df.describe()
df.isnull().sum()
# ✅ 2. Missing values
# Decide drop vs impute; use domain knowledge
# ✅ 3. Outliers
# Visualize; apply IQR or Z-score with care
# ✅ 4. Encoding
# Ordinal → label encoding
# Nominal → one-hot
# ✅ 5. Scaling
# Distance-based models: usually required
# Tree models: often optional
Summary
Key takeaways
- Missing data: drop, impute, or interpolate
- Outliers: IQR and Z-score (with domain checks)
- Normalization: Min–Max to [0, 1]
- Standardization: zero mean, unit variance
- Encoding: label vs one-hot
Next steps
- Hands-on data analysis
- Machine learning fundamentals