Hands-on Data Analysis with Python | Pandas Workflows

Hands-on Data Analysis with Python | Pandas Workflows

이 글의 핵심

From CSV to insights: exploratory analysis with Pandas, Matplotlib, and Seaborn—distributions, correlations, cohort-style breakdowns, and time-series patterns.

Introduction

“Find insight in the data”

Hands-on data analysis is the process of understanding data and extracting meaning.


1. Load and explore

First pass

import pandas as pd
import numpy as np

df = pd.read_csv('sales_data.csv')

print(f"Shape: {df.shape}")
print(f"\nColumn info:")
print(df.info())

print(f"\nSummary statistics:")
print(df.describe())

print(f"\nFirst 5 rows:")
print(df.head())

print(f"\nMissing values per column:")
print(df.isnull().sum())

2. Exploratory data analysis (EDA)

Distributions

import matplotlib.pyplot as plt

df['age'].hist(bins=20)
plt.title('나이 분포')
plt.xlabel('나이')
plt.ylabel('빈도')
plt.show()

df.boxplot(column='salary', by='department')
plt.title('부서별 연봉 분포')
plt.show()

Correlation

import seaborn as sns

corr_matrix = df[['age', 'salary', 'experience']].corr()
print(corr_matrix)

sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation heatmap')
plt.show()

3. Group analysis

Aggregations

dept_avg = df.groupby('department')['salary'].mean()
print(dept_avg)

result = df.groupby('department').agg({
    'salary': ['mean', 'min', 'max'],
    'age': 'mean',
    'name': 'count'
})
print(result)

pivot = df.pivot_table(
    values='salary',
    index='department',
    columns='gender',
    aggfunc='mean'
)
print(pivot)

4. Time series

Resampling and rolling windows

df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')

monthly = df.resample('M').sum()

df['ma7'] = df['sales'].rolling(window=7).mean()
df['ma30'] = df['sales'].rolling(window=30).mean()

plt.figure(figsize=(12, 6))
plt.plot(df.index, df['sales'], label='일별 매출', alpha=0.5)
plt.plot(df.index, df['ma7'], label='7일 이동평균')
plt.plot(df.index, df['ma30'], label='30일 이동평균')
plt.legend()
plt.title('매출 추이')
plt.show()

5. Practical example

Customer analytics

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

customers = pd.read_csv('customers.csv')

print("=== 기본 통계 ===")
print(f"총 고객 수: {len(customers)}")
print(f"평균 나이: {customers['age'].mean():.1f}세")
print(f"평균 구매액: {customers['purchase_amount'].mean():,.0f}원")

customers['age_group'] = pd.cut(
    customers['age'],
    bins=[0, 20, 30, 40, 50, 100],
    labels=['10대', '20대', '30대', '40대', '50대+']
)

age_analysis = customers.groupby('age_group').agg({
    'purchase_amount': ['mean', 'sum', 'count']
})
print("\n=== 연령대별 분석 ===")
print(age_analysis)

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

axes[0, 0].hist(customers['age'], bins=20, edgecolor='black')
axes[0, 0].set_title('연령 분포')

axes[0, 1].hist(customers['purchase_amount'], bins=30, edgecolor='black')
axes[0, 1].set_title('구매액 분포')

age_group_avg = customers.groupby('age_group')['purchase_amount'].mean()
axes[1, 0].bar(age_group_avg.index.astype(str), age_group_avg.values)
axes[1, 0].set_title('연령대별 평균 구매액')
axes[1, 0].tick_params(axis='x', rotation=45)

numeric_cols = customers[['age', 'purchase_amount', 'visit_count']]
sns.heatmap(numeric_cols.corr(), annot=True, ax=axes[1, 1])
axes[1, 1].set_title('상관관계')

plt.tight_layout()
plt.savefig('customer_analysis.png', dpi=300)
plt.show()

print("\n=== 인사이트 ===")
print(f"가장 많이 구매하는 연령대: {age_group_avg.idxmax()}")
print(f"평균 구매액이 가장 높은 연령대: {age_group_avg.idxmax()}")

Practical tips

Analysis workflow

# 1. Understand the data
# - Column meanings
# - dtypes
# - Missing and extreme values

# 2. Hypotheses
# - "Do older customers spend more?"
# - "Are weekends higher revenue?"

# 3. Validate
# - Plots
# - Stats
# - Correlations (with care)

# 4. Insights
# - Business meaning
# - Action items

Going deeper

Outlier flags and group summaries (Pandas)

Mark IQR-based outliers on numeric columns, then summarize by group.

import numpy as np
import pandas as pd

rng = np.random.default_rng(7)
df = pd.DataFrame({
    "region": rng.choice(["A", "B", "C"], size=400),
    "revenue": rng.normal(100, 25, size=400),
})

def iqr_bounds(s: pd.Series, k: float = 1.5):
    q1, q3 = s.quantile([0.25, 0.75])
    iqr = q3 - q1
    lo, hi = q1 - k * iqr, q3 + k * iqr
    return lo, hi

lo, hi = iqr_bounds(df["revenue"])
df["outlier"] = (df["revenue"] < lo) | (df["revenue"] > hi)

summary = df.groupby("region")["revenue"].agg(["count", "mean", "std", "median"])
print(summary)
print("이상치 비율:", df["outlier"].mean())

Common mistakes

  • Computing means and correlations before dropping or imputing missing values.
  • Using rolling / diff on unsorted time indexes.
  • Claiming causality from plots without statistical tests.

Caveats

  • IQR is a weak distributional rule—combine with domain thresholds.
  • Multiple comparisons may require p-value adjustments.

In production

  • Note random seeds and data snapshot versions in notebooks.
  • Document metric definitions (numerator/denominator) with results.

Alternatives

ToolRole
pandasTable transforms and aggregates
SQLPush down large aggregations
SparkDistributed batch jobs

Further reading


Summary

Key takeaways

  1. EDA: explore structure and visualize early
  2. Group analysis: groupby, pivot_table
  3. Time series: resample, rolling
  4. Visualization: Matplotlib, Seaborn
  5. Insights: connect data to decisions

Next steps

  • File automation
  • Web scraping