Python Web Scraping | BeautifulSoup and Selenium Explained
이 글의 핵심
Python web scraping tutorial: requests, BeautifulSoup for static HTML, Selenium for dynamic pages, ethics (robots.txt, rate limits), and CSV export—SEO-friendly patterns.
Introduction
“Collect data from the web” is a deceptively simple goal. Web scraping is the practice of automatically extracting data from websites by fetching HTML (or other representations), parsing it, and turning it into structured records you can store or analyze.
This article expands on a practical stack—requests, Beautiful Soup, and Selenium—and adds the topics that matter in real projects: ethics and compliance, tool selection, HTTP mechanics, selectors, JavaScript rendering, reliability, storage, and how sites push back with rate limits and anti-bot measures. The emphasis throughout is ethical, respectful scraping: get only what you need, only when you are allowed to, and never at the expense of someone else’s infrastructure.
Core concepts and legal considerations
Scraping sits at the intersection of automation and policy. Technically, you are a client: you send HTTP requests and interpret responses. Socially and legally, you are responsible for what you access, how often you access it, and what you do with the result.
robots.txt: A voluntary convention that tells well-behaved crawlers which URL paths are off-limits. It is not a law everywhere, but ignoring it is a strong signal of bad faith and can violate a site’s terms. Always readhttps://example.com/robots.txtbefore a large crawl.- Terms of service (ToS): Many sites contractually prohibit automated access or bulk collection even when data is public. Violating ToS can lead to account bans or legal exposure depending on jurisdiction and facts—consult qualified counsel for high-stakes or commercial use.
- Copyright and database rights: Facts (e.g., a product’s public price) may be fine to aggregate in some contexts; creative content, images, and proprietary databases may not. Re-publishing large chunks of text or media is riskier than storing structured numbers you derived.
- Personal data (GDPR, CCPA, etc.): If you scrape information that can identify people, you may need a lawful basis, notice, and retention limits. Treat PII as toxic by default: minimize collection and secure storage.
A practical rule: if you would not manually reload the page hundreds of times per minute, your bot should not either. Pair technical limits (delays, caching) with policy checks (ToS, robots.txt).
Beautiful Soup vs Scrapy vs Selenium
| Tool | Best for | Why choose it |
|---|---|---|
Beautiful Soup + requests | One-off scripts, small batches, static HTML | Minimal setup; easy to read; great for learning and quick parsers. |
| Scrapy | Large crawls, many URLs, politeness and pipelines | Framework with spiders, queues, middleware, and export pipelines; built for scale. |
| Selenium | Heavy client-side rendering, form flows, “click to load more” | Drives a real browser; you see what a user sees after JavaScript runs. |
Heuristic: start with requests + Beautiful Soup. If the HTML you need is not in the first response (empty shell, data loaded via XHR), move to browser automation (Selenium, or Playwright for modern apps). If you are crawling a whole site with rules and scheduled jobs, Scrapy often pays for itself.
HTTP requests, headers, sessions, and cookies
Servers infer client type from headers. A missing or generic User-Agent can get you 403s or CAPTCHAs. A session reuses cookies (e.g., after login) across requests—essential for many authenticated flows (only when permitted by the site’s rules).
import requests
headers = {
"User-Agent": (
"ResearchBot/1.0 (+https://yoursite.example/about; [email protected]) "
"Python-requests/2.x"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
session = requests.Session()
session.headers.update(headers)
r = session.get("https://example.com/api/items", params={"page": 1}, timeout=15)
r.raise_for_status()
print(r.status_code, r.headers.get("content-type", ""))
Cookies: use session.post(login_url, data={...}) once, then session.get(protected_url) so authentication sticks. Never log real credentials in shared repositories.
CSS selectors and XPath
CSS selectors are concise and widely used in front-end stacks. XPath is powerful for positional or text-contains queries that are awkward in pure CSS.
from bs4 import BeautifulSoup
import requests
html = requests.get("https://example.com", timeout=10).text
soup = BeautifulSoup(html, "html.parser")
# CSS: class and attribute
for a in soup.select('article h2 a[href^="/p/"]'):
print(a.get_text(strip=True), a["href"])
# lxml can expose XPath (install: pip install lxml)
from lxml import etree
tree = etree.HTML(html)
# Example: all third-column cells in a table
for cell in tree.xpath('//table[@id="prices"]//tr/td[3]'):
print((cell.text or "").strip())
Tip: prefer stable attributes—data-* ids, rel="canonical", or semantic tags—over brittle div > div > div chains that break on every deploy.
Dynamic content (JavaScript rendering)
If requests.get returns a skeleton and the data appears only after JS runs, use a headless browser. Selenium drives Chrome/Firefox; for new projects, Playwright is a strong alternative with solid waiting APIs.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(options=options)
try:
driver.get("https://example.com/dynamic")
WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "[data-loaded='true']"))
)
print(driver.find_element(By.TAG_NAME, "h1").text)
finally:
driver.quit()
Performance: browser automation is orders of magnitude heavier than raw HTTP. Cache pages, cap concurrency, and fall back to API endpoints the site already calls (inspect Network tab—often JSON you can requests.get with permission).
Rate limiting, retries, and error handling
Production scrapers must assume transient failures: timeouts, 5xx responses, connection resets, and throttling (HTTP 429). Use bounded retries with exponential backoff and jitter, cap total wait time, and respect Retry-After when present.
import random
import time
import requests
from requests.adapters import HTTPAdapter
from urllib3.util import Retry
def make_session() -> requests.Session:
s = requests.Session()
retry = Retry(
total=5,
connect=3,
read=3,
status=3,
status_forcelist=(429, 500, 502, 503, 504),
allowed_methods=frozenset(["GET", "POST"]),
backoff_factor=0.5, # 0.5, 1, 2, ... between retries
respect_retry_after_header=True,
)
adapter = HTTPAdapter(max_retries=retry, pool_maxsize=10)
s.mount("https://", adapter)
s.mount("http://", adapter)
return s
def fetch_politely(url: str) -> str:
s = make_session()
for attempt in range(1, 4):
try:
r = s.get(url, timeout=20)
r.raise_for_status()
time.sleep(0.5 + random.random() * 0.5) # gentle pacing; tune per site
return r.text
except requests.RequestException as e:
if attempt == 3:
raise
time.sleep(2**attempt + random.random())
raise RuntimeError("unreachable")
Logging: include URL, status code, and exception type. Do not log full HTML or tokens.
Data storage strategies
Choose storage by volume, query needs, and downstream tools.
- CSV (
pandas.to_csv): simplest for ad-hoc analysis and Excel handoff; useutf-8-sigfor Excel on Windows. - JSON lines (one JSON object per line): good for streaming large datasets and append-only pipelines.
- SQLite: zero-ops local DB; use transactions and deduplicate with unique keys.
- Parquet / cloud warehouses: for analytics at scale (often via pandas, DuckDB, or ETL to PostgreSQL).
import json
import sqlite3
from pathlib import Path
from datetime import datetime, timezone
def upsert_item(conn: sqlite3.Connection, url: str, title: str) -> None:
conn.execute(
"""
INSERT INTO pages (url, title, fetched_at)
VALUES (?, ?, ?)
ON CONFLICT(url) DO UPDATE SET
title=excluded.title,
fetched_at=excluded.fetched_at
""",
(url, title, datetime.now(timezone.utc).isoformat()),
)
def save_jsonl(path: Path, records: list[dict]) -> None:
with path.open("a", encoding="utf-8") as f:
for row in records:
f.write(json.dumps(row, ensure_ascii=False) + "\n")
Schema tip: store fetched_at, source URL, and a content hash to detect changes without re-parsing everything.
Anti-scraping, proxies, and rotation (ethics first)
Sites protect themselves with IP rate limits, fingerprinting, CAPTCHAs, and HSTS-only or per-request tokens. Commercial proxy pools and header rotation can technically bypass some controls—but bypassing access restrictions may violate law, ToS, and ethics.
Ethical use of proxies is limited: geo routing for permitted testing, corporate egress, or reducing load on a single IP when a site’s operator agrees. If you are blocked, first reduce rate, identify official APIs, and ask for access. “Rotation to evade blocks” is a last resort for gray-hat behavior this article does not endorse for production work.
If you do run at scale in a legitimate context (e.g., internal index of pages you own), combine caching, ETags, If-Modified-Since, and single-flight fetches to avoid duplicate work.
Real-world project: internal catalog sync (anonymized)
In one project, the goal was to nightly sync a subset of public product fields from a partner’s HTML catalog (no public API) into our DB for read-only display. Lessons learned:
- Negotiated a crawl window and max RPS in writing. That removed guesswork and legal risk.
- Cached by SKU; skipped unchanged rows using ETag + content hash, cutting traffic by ~70%.
- Parsers broke twice when the partner re-skinned the site. We added smoke tests on golden HTML fixtures and CI checks when selectors fail.
- Monitoring: alert on error rate and zero rows parsed—both indicate silent breakage.
The stack was requests + Beautiful Soup for static HTML; a separate job used Selenium for one wizard-style page the partner could not API-ify in time. We decommissioned Selenium once a JSON feed appeared—the best scrape is the one you delete when a proper contract exists.
1. requests basics (quick reference)
Fetching HTML
import requests
response = requests.get("https://example.com", timeout=15)
print(response.status_code) # 200
print(response.text[:500]) # HTML body (truncated)
print(response.headers.get("server"))
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
response = requests.get("https://example.com", headers=headers, timeout=15)
2. Beautiful Soup
Parsing HTML
from bs4 import BeautifulSoup
import requests
url = "https://example.com"
response = requests.get(url, timeout=15)
soup = BeautifulSoup(response.text, "html.parser")
title = soup.find("title")
if title:
print(title.get_text(strip=True))
for link in soup.find_all("a", href=True):
print(link["href"])
articles = soup.select(".article-title")
for article in articles:
print(article.get_text(strip=True))
Example: news headlines
import requests
from bs4 import BeautifulSoup
import pandas as pd
def scrape_news(url: str) -> pd.DataFrame:
"""Collect news titles and links (selectors are hypothetical)."""
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
response = requests.get(url, headers=headers, timeout=20)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
articles = []
for item in soup.select(".news-item"):
title_el = item.select_one(".title")
link_el = item.select_one("a")
date_el = item.select_one(".date")
if not (title_el and link_el and date_el):
continue
articles.append(
{
"title": title_el.get_text(strip=True),
"link": link_el["href"],
"date": date_el.get_text(strip=True),
}
)
return pd.DataFrame(articles)
# df = scrape_news("https://news.example.com")
# df.to_csv("news.csv", index=False, encoding="utf-8-sig")
3. Selenium (dynamic pages)
Install
pip install selenium
Basic usage
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
try:
driver.get("https://example.com")
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "content"))
)
title = driver.find_element(By.TAG_NAME, "h1")
print(title.text)
button = driver.find_element(By.ID, "load-more")
button.click()
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
finally:
driver.quit()
4. Real-world example: price monitoring (fixed parsing)
import re
import time
from datetime import datetime
import requests
from bs4 import BeautifulSoup
def parse_krw(s: str) -> int:
digits = re.sub(r"[^\d]", "", s)
return int(digits) if digits else 0
def check_price(url: str, target_price: int) -> bool:
"""Read product price from a page (selectors must match the real site)."""
headers = {"User-Agent": "Mozilla/5.0 (compatible; PriceCheck/1.0)"}
response = requests.get(url, headers=headers, timeout=20)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
node = soup.select_one(".price")
if not node:
raise ValueError("Price selector did not match any element")
price = parse_krw(node.get_text())
print(f"[{datetime.now().isoformat()}] Current price: {price:,} KRW")
if price <= target_price:
print(f"Target reached (<= {target_price:,} KRW)")
return True
return False
# Example: check hourly on a permitted page only
# url = "https://shopping.example.com/product/123"
# target = 50_000
# while True:
# if check_price(url, target):
# break
# time.sleep(3600)
5. Saving data (CSV and helpers)
import pandas as pd
def scrape_data(url: str) -> list[dict]:
"""Replace with your own parser: requests + Beautiful Soup → list[dict]."""
return []
def scrape_and_save(url: str, output_file: str) -> None:
"""Scrape and write CSV."""
data = scrape_data(url)
df = pd.DataFrame(data)
df.to_csv(output_file, index=False, encoding="utf-8-sig")
print(f"Saved: {output_file}")
Practical tips: scraping etiquette
# Check robots.txt: https://example.com/robots.txt
# Space out requests
import time
time.sleep(1)
# Descriptive User-Agent
headers = {"User-Agent": "MyScraper/1.0 (+https://example.com/bot)"}
# Handle errors
import requests
try:
response = requests.get("https://example.com", timeout=10)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
Summary
Key takeaways
requests: HTTP calls, sessions, headers, timeouts.- Beautiful Soup: fast parsing for static HTML; pair with CSS or XPath as needed.
- Selenium (or Playwright): when JavaScript owns the DOM.
- Scrapy: when the problem is a crawl at scale with pipelines and politeness.
- Ethics and law:
robots.txt, ToS, data minimization, and human-scale rates. - Reliability: retries, backoff, 429 handling, and monitoring.
- Storage: CSV, JSONL, or SQLite for structure and deduplication.
Next steps
- [Task scheduling](/en/blog/python-series-23-task-scheduling/
- [File automation](/en/blog/python-series-21-file-automation/
Related posts
- [Python environment setup | Install Python on Windows and Mac](/en/blog/python-series-01-environment-setup/
FAQ
When would I use this in production work?
When you have permission to automate access and a clear data need—monitoring, internal aggregation, or research. Always align with policy, APIs first, and document rate limits and failure modes.
What should I read first in the series?
Follow Previous / Related links at the bottom of each article, or use the Python series index for the full path.
Where can I go deeper?
Read the official docs for Requests, Beautiful Soup, Scrapy, and Selenium, and for HTML semantics the MDN documentation. For legal questions, use counsel—not blog posts.
See also (internal links)
- [Python File Automation | Organize, Rename, and Back Up Files](/en/blog/python-series-21-file-automation/
- [Pandas Basics | Complete Guide to Python Data Analysis](/en/blog/python-series-16-pandas/
- [JavaScript DOM Manipulation | Control Web Pages Dynamically](/en/blog/javascript-series-06-dom/
Keywords
Python, Web Scraping, Crawling, Beautiful Soup, Selenium, requests