Hands-on Data Analysis with Python | Pandas Workflows
이 글의 핵심
From CSV to insights: exploratory analysis with Pandas, Matplotlib, and Seaborn—distributions, correlations, cohort-style breakdowns, and time-series patterns.
Introduction
“Find insight in the data”
Hands-on data analysis is the process of understanding data and extracting meaning.
1. Load and explore
First pass
import pandas as pd
import numpy as np
df = pd.read_csv('sales_data.csv')
print(f"Shape: {df.shape}")
print(f"\nColumn info:")
print(df.info())
print(f"\nSummary statistics:")
print(df.describe())
print(f"\nFirst 5 rows:")
print(df.head())
print(f"\nMissing values per column:")
print(df.isnull().sum())
2. Exploratory data analysis (EDA)
Distributions
import matplotlib.pyplot as plt
df['age'].hist(bins=20)
plt.title('나이 분포')
plt.xlabel('나이')
plt.ylabel('빈도')
plt.show()
df.boxplot(column='salary', by='department')
plt.title('부서별 연봉 분포')
plt.show()
Correlation
import seaborn as sns
corr_matrix = df[['age', 'salary', 'experience']].corr()
print(corr_matrix)
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation heatmap')
plt.show()
3. Group analysis
Aggregations
dept_avg = df.groupby('department')['salary'].mean()
print(dept_avg)
result = df.groupby('department').agg({
'salary': ['mean', 'min', 'max'],
'age': 'mean',
'name': 'count'
})
print(result)
pivot = df.pivot_table(
values='salary',
index='department',
columns='gender',
aggfunc='mean'
)
print(pivot)
4. Time series
Resampling and rolling windows
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
monthly = df.resample('M').sum()
df['ma7'] = df['sales'].rolling(window=7).mean()
df['ma30'] = df['sales'].rolling(window=30).mean()
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['sales'], label='일별 매출', alpha=0.5)
plt.plot(df.index, df['ma7'], label='7일 이동평균')
plt.plot(df.index, df['ma30'], label='30일 이동평균')
plt.legend()
plt.title('매출 추이')
plt.show()
5. Practical example
Customer analytics
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
customers = pd.read_csv('customers.csv')
print("=== 기본 통계 ===")
print(f"총 고객 수: {len(customers)}")
print(f"평균 나이: {customers['age'].mean():.1f}세")
print(f"평균 구매액: {customers['purchase_amount'].mean():,.0f}원")
customers['age_group'] = pd.cut(
customers['age'],
bins=[0, 20, 30, 40, 50, 100],
labels=['10대', '20대', '30대', '40대', '50대+']
)
age_analysis = customers.groupby('age_group').agg({
'purchase_amount': ['mean', 'sum', 'count']
})
print("\n=== 연령대별 분석 ===")
print(age_analysis)
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes[0, 0].hist(customers['age'], bins=20, edgecolor='black')
axes[0, 0].set_title('연령 분포')
axes[0, 1].hist(customers['purchase_amount'], bins=30, edgecolor='black')
axes[0, 1].set_title('구매액 분포')
age_group_avg = customers.groupby('age_group')['purchase_amount'].mean()
axes[1, 0].bar(age_group_avg.index.astype(str), age_group_avg.values)
axes[1, 0].set_title('연령대별 평균 구매액')
axes[1, 0].tick_params(axis='x', rotation=45)
numeric_cols = customers[['age', 'purchase_amount', 'visit_count']]
sns.heatmap(numeric_cols.corr(), annot=True, ax=axes[1, 1])
axes[1, 1].set_title('상관관계')
plt.tight_layout()
plt.savefig('customer_analysis.png', dpi=300)
plt.show()
print("\n=== 인사이트 ===")
print(f"가장 많이 구매하는 연령대: {age_group_avg.idxmax()}")
print(f"평균 구매액이 가장 높은 연령대: {age_group_avg.idxmax()}")
Practical tips
Analysis workflow
# 1. Understand the data
# - Column meanings
# - dtypes
# - Missing and extreme values
# 2. Hypotheses
# - "Do older customers spend more?"
# - "Are weekends higher revenue?"
# 3. Validate
# - Plots
# - Stats
# - Correlations (with care)
# 4. Insights
# - Business meaning
# - Action items
Going deeper
Outlier flags and group summaries (Pandas)
Mark IQR-based outliers on numeric columns, then summarize by group.
import numpy as np
import pandas as pd
rng = np.random.default_rng(7)
df = pd.DataFrame({
"region": rng.choice(["A", "B", "C"], size=400),
"revenue": rng.normal(100, 25, size=400),
})
def iqr_bounds(s: pd.Series, k: float = 1.5):
q1, q3 = s.quantile([0.25, 0.75])
iqr = q3 - q1
lo, hi = q1 - k * iqr, q3 + k * iqr
return lo, hi
lo, hi = iqr_bounds(df["revenue"])
df["outlier"] = (df["revenue"] < lo) | (df["revenue"] > hi)
summary = df.groupby("region")["revenue"].agg(["count", "mean", "std", "median"])
print(summary)
print("이상치 비율:", df["outlier"].mean())
Common mistakes
- Computing means and correlations before dropping or imputing missing values.
- Using
rolling/diffon unsorted time indexes. - Claiming causality from plots without statistical tests.
Caveats
- IQR is a weak distributional rule—combine with domain thresholds.
- Multiple comparisons may require p-value adjustments.
In production
- Note random seeds and data snapshot versions in notebooks.
- Document metric definitions (numerator/denominator) with results.
Alternatives
| Tool | Role |
|---|---|
| pandas | Table transforms and aggregates |
| SQL | Push down large aggregations |
| Spark | Distributed batch jobs |
Further reading
Summary
Key takeaways
- EDA: explore structure and visualize early
- Group analysis:
groupby,pivot_table - Time series:
resample,rolling - Visualization: Matplotlib, Seaborn
- Insights: connect data to decisions
Next steps
- File automation
- Web scraping