웹 스크래핑은 합법인가요?

공개된 데이터 수집은 합법이지만, robots.txt를 확인하고 과도한 요청은 피해야 합니다.

BeautifulSoup vs Selenium?

BeautifulSoup은 정적 페이지, Selenium은 동적 페이지(JavaScript)에 적합합니다.

User-Agent는 왜 설정하나요?

봇으로 인식되어 차단되는 것을 방지하기 위해 설정합니다.

Python 웹 스크래핑 | BeautifulSoup, Selenium 완벽 정리

2026년 3월 28일 · 20분 읽기 · 수정 2026년 3월 28일 중급 튜토리얼

이 글의 핵심

Python 웹 스크래핑에 대해 정리한 개발 블로그 글입니다. response = requests.get('https://example.com') 개념과 예제 코드를 단계적으로 다루며, 실무·학습에 참고할 수 있도록 구성했습니다. 관련 키워드: Python, 웹스크래핑, 크롤링, BeautifulSoup, Selenium.

들어가며

”웹에서 데이터 수집하기”

웹 스크래핑은 웹사이트에서 자동으로 데이터를 수집하는 기술입니다. 브라우저가 주소를 치고 HTML을 받아 오는 과정을, Python 스크립트가 대신 반복해 주는 것으로 이해하시면 됩니다.

1. requests 기본

HTML 가져오기

requests.get은 서버에 “이 페이지 주세요”라고 부탁하고, 응답 본문·상태 코드·헤더를 돌려받는 우편 요청과 같습니다. 일부 사이트는 기본 User-Agent만 보면 봇으로 막기 때문에, 실제 브라우저에 가까운 헤더를 넣으면 차단이 줄어드는 경우가 많습니다.

import requests

# GET 요청
response = requests.get('https://example.com')

print(response.status_code)  # 200
print(response.text)  # HTML 내용
print(response.headers)  # 헤더 정보

# User-Agent 설정
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get('https://example.com', headers=headers)

2. BeautifulSoup

HTML 파싱

서버가 준 HTML 문자열은 한 덩어리 텍스트라서, 그대로는 제목·링크만 골라내기 어렵습니다. BeautifulSoup은 이를 태그 단위로 나눈 뒤 find/select로 원하는 조각만 집는 도구입니다. 마치 책에서 목차와 본문만 골라 읽는 것과 비슷합니다.

from bs4 import BeautifulSoup
import requests

# HTML 가져오기
url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 태그 찾기
title = soup.find('title')
print(title.text)

# 여러 태그 찾기
links = soup.find_all('a')
for link in links:
    print(link.get('href'))

# CSS 선택자
articles = soup.select('.article-title')
for article in articles:
    print(article.text)

실전 예제: 뉴스 크롤링

import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_news(url):
    """뉴스 제목과 링크 수집"""
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
    }
    
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    articles = []
    
    for item in soup.select('.news-item'):
        title = item.select_one('.title').text.strip()
        link = item.select_one('a')['href']
        date = item.select_one('.date').text.strip()
        
        articles.append({
            'title': title,
            'link': link,
            'date': date
        })
    
    return pd.DataFrame(articles)

# 사용
df = scrape_news('https://news.example.com')
df.to_csv('news.csv', index=False, encoding='utf-8-sig')

3. Selenium (동적 페이지)

설치

pip install selenium

기본 사용

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# 드라이버 설정
driver = webdriver.Chrome()

try:
    # 페이지 열기
    driver.get('https://example.com')
    
    # 요소 대기
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, 'content'))
    )
    
    # 요소 찾기
    title = driver.find_element(By.TAG_NAME, 'h1')
    print(title.text)
    
    # 클릭
    button = driver.find_element(By.ID, 'load-more')
    button.click()
    
    # 스크롤
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    
finally:
    driver.quit()

4. 실전 예제

상품 가격 모니터링

import requests
from bs4 import BeautifulSoup
import time
from datetime import datetime

def check_price(url, target_price):
    """상품 가격 확인"""
    headers = {
        'User-Agent': 'Mozilla/5.0'
    }
    
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 가격 추출 (사이트마다 다름)
    price_text = soup.select_one('.price').text
    price = int(price_text.replace(',', '').replace('원', ''))
    
    print(f"[{datetime.now()}] 현재 가격: {price:,}원")
    
    if price <= target_price:
        print(f"🎉 목표 가격 달성! ({target_price:,}원 이하)")
        return True
    
    return False

# 사용 (1시간마다 확인)
url = 'https://shopping.example.com/product/123'
target = 50000

while True:
    if check_price(url, target):
        break
    time.sleep(3600)  # 1시간 대기

5. 데이터 저장

CSV 저장

import pandas as pd

def scrape_and_save(url, output_file):
    """스크래핑 후 CSV 저장"""
    # 스크래핑
    data = scrape_data(url)
    
    # DataFrame 생성
    df = pd.DataFrame(data)
    
    # CSV 저장
    df.to_csv(output_file, index=False, encoding='utf-8-sig')
    print(f"저장 완료: {output_file}")

요청 간격·robots.txt·예외 처리 (에티켓)

같은 서버에 짧은 시간에 수백 번 요청을 보내면 부하와 차단으로 이어질 수 있습니다. robots.txt로 허용 범위를 확인하고, time.sleep으로 간격을 두며, 네트워크 오류는 try/except로 한 번 실패했다고 전체가 죽지 않게 받아 주는 편이 좋습니다.

# ✅ robots.txt 확인
# https://example.com/robots.txt

# ✅ 요청 간격 두기
import time
time.sleep(1)  # 1초 대기

# ✅ User-Agent 설정
headers = {'User-Agent': '...'}

# ✅ 에러 처리
try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
except requests.exceptions.RequestException as e:
    print(f"요청 실패: {e}")

정리

핵심 요약

requests: HTTP 요청
BeautifulSoup: HTML 파싱
Selenium: 동적 페이지
에티켓: robots.txt, 요청 간격
저장: CSV, JSON, 데이터베이스

다음 단계

작업 스케줄링
파일 자동화

Python 환경 설정 | Windows/Mac에서 Python 설치하고 시작하기