Data Preprocessing in Python | Missing Values, Outliers, and Scaling

Data Preprocessing in Python | Missing Values, Outliers, and Scaling

이 글의 핵심

End-to-end preprocessing for ML: imputation, outlier rules, scaling, label and one-hot encoding, feature engineering, and a reusable pipeline pattern.

Introduction

“Good data builds good models”

Data preprocessing accounts for a large share of real-world machine learning work.


1. Missing values

Detecting nulls

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name': ['철수', '영희', '민수', None],
    'age': [25, None, 28, 30],
    'salary': [3000, 4000, None, 5000]
})

print(df.isnull())
print(df.isnull().sum())  # nulls per column

Handling missing values

# Option 1: drop
df_dropped = df.dropna()           # rows with any null
df_dropped = df.dropna(axis=1)     # columns with any null

# Option 2: fill
df_filled = df.fillna(0)
df_filled = df.fillna(df.mean(numeric_only=True))
df_filled = df.ffill()  # forward fill

# Option 3: interpolate
df['age'] = df['age'].interpolate()

2. Outliers

IQR rule

Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df_clean = df[
    (df['salary'] >= lower_bound) &
    (df['salary'] <= upper_bound)
]

Z-score rule

from scipy import stats

# Rows with non-null salary: flag |z| < 3 as inliers
sub = df.dropna(subset=['salary'])
z = np.abs(stats.zscore(sub['salary']))
df_clean = sub[z < 3]

3. Normalization and standardization

Min–Max scaling (0–1)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df['salary_normalized'] = scaler.fit_transform(df[['salary']])

print(df[['salary', 'salary_normalized']])

Standardization (mean 0, std 1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df['salary_standardized'] = scaler.fit_transform(df[['salary']])

print(df[['salary', 'salary_standardized']])

4. Categorical encoding

Label encoding

from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({
    'city': ['서울', '부산', '서울', '대구', '부산']
})

encoder = LabelEncoder()
df['city_encoded'] = encoder.fit_transform(df['city'])

print(df)

One-hot encoding

# Pandas
df_encoded = pd.get_dummies(df, columns=['city'])
print(df_encoded)

# scikit-learn
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(df[['city']])

5. Feature engineering

Deriving features

df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=100),
    'sales': np.random.randint(100, 500, 100)
})

df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

df['sales_ma7'] = df['sales'].rolling(window=7).mean()
df['sales_diff'] = df['sales'].diff()

6. End-to-end example

Full preprocessing pipeline

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder

def preprocess_data(df):
    """Example preprocessing pipeline."""

    # 1. Missing values
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

    categorical_cols = df.select_dtypes(include=['object']).columns
    df[categorical_cols] = df[categorical_cols].fillna('Unknown')

    # 2. Outliers (IQR per numeric column)
    for col in numeric_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 1.5 * IQR
        upper = Q3 + 1.5 * IQR
        df = df[(df[col] >= lower) & (df[col] <= upper)]

    # 3. Label encode categoricals
    for col in categorical_cols:
        le = LabelEncoder()
        df[f'{col}_encoded'] = le.fit_transform(df[col])

    # 4. Standardize numeric columns
    scaler = StandardScaler()
    df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

    return df

raw_data = pd.read_csv('raw_data.csv')
clean_data = preprocess_data(raw_data)
clean_data.to_csv('clean_data.csv', index=False)

Practical tips

Preprocessing checklist

# ✅ 1. Inspect
df.info()
df.describe()
df.isnull().sum()

# ✅ 2. Missing values
# Decide drop vs impute; use domain knowledge

# ✅ 3. Outliers
# Visualize; apply IQR or Z-score with care

# ✅ 4. Encoding
# Ordinal → label encoding
# Nominal → one-hot

# ✅ 5. Scaling
# Distance-based models: usually required
# Tree models: often optional

Summary

Key takeaways

  1. Missing data: drop, impute, or interpolate
  2. Outliers: IQR and Z-score (with domain checks)
  3. Normalization: Min–Max to [0, 1]
  4. Standardization: zero mean, unit variance
  5. Encoding: label vs one-hot

Next steps