Is web scraping legal?

Collecting publicly available data is often allowed, but check robots.txt, terms of use, and avoid hammering servers with excessive requests.

BeautifulSoup vs Selenium?

BeautifulSoup fits static HTML; Selenium fits pages that rely heavily on JavaScript.

Why set a User-Agent?

To reduce the chance of being blocked as an unidentified bot.

Python Web Scraping | BeautifulSoup and Selenium Explained

2026년 3월 28일 · 20분 읽기 · 수정 2026년 3월 28일 Intermediate Tutorial

이 글의 핵심

Python web scraping tutorial: requests, BeautifulSoup for static HTML, Selenium for dynamic pages, ethics (robots.txt, rate limits), and CSV export—SEO-friendly patterns.

Introduction

“Collect data from the web” is a deceptively simple goal. Web scraping is the practice of automatically extracting data from websites by fetching HTML (or other representations), parsing it, and turning it into structured records you can store or analyze.

This article expands on a practical stack—requests, Beautiful Soup, and Selenium—and adds the topics that matter in real projects: ethics and compliance, tool selection, HTTP mechanics, selectors, JavaScript rendering, reliability, storage, and how sites push back with rate limits and anti-bot measures. The emphasis throughout is ethical, respectful scraping: get only what you need, only when you are allowed to, and never at the expense of someone else’s infrastructure.

Core concepts and legal considerations

Scraping sits at the intersection of automation and policy. Technically, you are a client: you send HTTP requests and interpret responses. Socially and legally, you are responsible for what you access, how often you access it, and what you do with the result.

robots.txt: A voluntary convention that tells well-behaved crawlers which URL paths are off-limits. It is not a law everywhere, but ignoring it is a strong signal of bad faith and can violate a site’s terms. Always read https://example.com/robots.txt before a large crawl.
Terms of service (ToS): Many sites contractually prohibit automated access or bulk collection even when data is public. Violating ToS can lead to account bans or legal exposure depending on jurisdiction and facts—consult qualified counsel for high-stakes or commercial use.
Copyright and database rights: Facts (e.g., a product’s public price) may be fine to aggregate in some contexts; creative content, images, and proprietary databases may not. Re-publishing large chunks of text or media is riskier than storing structured numbers you derived.
Personal data (GDPR, CCPA, etc.): If you scrape information that can identify people, you may need a lawful basis, notice, and retention limits. Treat PII as toxic by default: minimize collection and secure storage.

A practical rule: if you would not manually reload the page hundreds of times per minute, your bot should not either. Pair technical limits (delays, caching) with policy checks (ToS, robots.txt).

Beautiful Soup vs Scrapy vs Selenium

Tool	Best for	Why choose it
Beautiful Soup + `requests`	One-off scripts, small batches, static HTML	Minimal setup; easy to read; great for learning and quick parsers.
Scrapy	Large crawls, many URLs, politeness and pipelines	Framework with spiders, queues, middleware, and export pipelines; built for scale.
Selenium	Heavy client-side rendering, form flows, “click to load more”	Drives a real browser; you see what a user sees after JavaScript runs.

Heuristic: start with requests + Beautiful Soup. If the HTML you need is not in the first response (empty shell, data loaded via XHR), move to browser automation (Selenium, or Playwright for modern apps). If you are crawling a whole site with rules and scheduled jobs, Scrapy often pays for itself.

HTTP requests, headers, sessions, and cookies

Servers infer client type from headers. A missing or generic User-Agent can get you 403s or CAPTCHAs. A session reuses cookies (e.g., after login) across requests—essential for many authenticated flows (only when permitted by the site’s rules).

import requests

headers = {
    "User-Agent": (
        "ResearchBot/1.0 (+https://yoursite.example/about; [email protected]) "
        "Python-requests/2.x"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

session = requests.Session()
session.headers.update(headers)

r = session.get("https://example.com/api/items", params={"page": 1}, timeout=15)
r.raise_for_status()
print(r.status_code, r.headers.get("content-type", ""))

Cookies: use session.post(login_url, data={...}) once, then session.get(protected_url) so authentication sticks. Never log real credentials in shared repositories.

CSS selectors and XPath

CSS selectors are concise and widely used in front-end stacks. XPath is powerful for positional or text-contains queries that are awkward in pure CSS.

from bs4 import BeautifulSoup
import requests

html = requests.get("https://example.com", timeout=10).text
soup = BeautifulSoup(html, "html.parser")

# CSS: class and attribute
for a in soup.select('article h2 a[href^="/p/"]'):
    print(a.get_text(strip=True), a["href"])

# lxml can expose XPath (install: pip install lxml)
from lxml import etree
tree = etree.HTML(html)
# Example: all third-column cells in a table
for cell in tree.xpath('//table[@id="prices"]//tr/td[3]'):
    print((cell.text or "").strip())

Tip: prefer stable attributes—data-* ids, rel="canonical", or semantic tags—over brittle div > div > div chains that break on every deploy.

Dynamic content (JavaScript rendering)

If requests.get returns a skeleton and the data appears only after JS runs, use a headless browser. Selenium drives Chrome/Firefox; for new projects, Playwright is a strong alternative with solid waiting APIs.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Chrome(options=options)
try:
    driver.get("https://example.com/dynamic")
    WebDriverWait(driver, 20).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "[data-loaded='true']"))
    )
    print(driver.find_element(By.TAG_NAME, "h1").text)
finally:
    driver.quit()

Performance: browser automation is orders of magnitude heavier than raw HTTP. Cache pages, cap concurrency, and fall back to API endpoints the site already calls (inspect Network tab—often JSON you can requests.get with permission).

Rate limiting, retries, and error handling

Production scrapers must assume transient failures: timeouts, 5xx responses, connection resets, and throttling (HTTP 429). Use bounded retries with exponential backoff and jitter, cap total wait time, and respect Retry-After when present.

import random
import time
import requests
from requests.adapters import HTTPAdapter
from urllib3.util import Retry

def make_session() -> requests.Session:
    s = requests.Session()
    retry = Retry(
        total=5,
        connect=3,
        read=3,
        status=3,
        status_forcelist=(429, 500, 502, 503, 504),
        allowed_methods=frozenset(["GET", "POST"]),
        backoff_factor=0.5,  # 0.5, 1, 2, ... between retries
        respect_retry_after_header=True,
    )
    adapter = HTTPAdapter(max_retries=retry, pool_maxsize=10)
    s.mount("https://", adapter)
    s.mount("http://", adapter)
    return s

def fetch_politely(url: str) -> str:
    s = make_session()
    for attempt in range(1, 4):
        try:
            r = s.get(url, timeout=20)
            r.raise_for_status()
            time.sleep(0.5 + random.random() * 0.5)  # gentle pacing; tune per site
            return r.text
        except requests.RequestException as e:
            if attempt == 3:
                raise
            time.sleep(2**attempt + random.random())
    raise RuntimeError("unreachable")

Logging: include URL, status code, and exception type. Do not log full HTML or tokens.

Data storage strategies

Choose storage by volume, query needs, and downstream tools.

CSV (pandas.to_csv): simplest for ad-hoc analysis and Excel handoff; use utf-8-sig for Excel on Windows.
JSON lines (one JSON object per line): good for streaming large datasets and append-only pipelines.
SQLite: zero-ops local DB; use transactions and deduplicate with unique keys.
Parquet / cloud warehouses: for analytics at scale (often via pandas, DuckDB, or ETL to PostgreSQL).

import json
import sqlite3
from pathlib import Path
from datetime import datetime, timezone

def upsert_item(conn: sqlite3.Connection, url: str, title: str) -> None:
    conn.execute(
        """
        INSERT INTO pages (url, title, fetched_at)
        VALUES (?, ?, ?)
        ON CONFLICT(url) DO UPDATE SET
          title=excluded.title,
          fetched_at=excluded.fetched_at
        """,
        (url, title, datetime.now(timezone.utc).isoformat()),
    )

def save_jsonl(path: Path, records: list[dict]) -> None:
    with path.open("a", encoding="utf-8") as f:
        for row in records:
            f.write(json.dumps(row, ensure_ascii=False) + "\n")

Schema tip: store fetched_at, source URL, and a content hash to detect changes without re-parsing everything.

Anti-scraping, proxies, and rotation (ethics first)

Sites protect themselves with IP rate limits, fingerprinting, CAPTCHAs, and HSTS-only or per-request tokens. Commercial proxy pools and header rotation can technically bypass some controls—but bypassing access restrictions may violate law, ToS, and ethics.

Ethical use of proxies is limited: geo routing for permitted testing, corporate egress, or reducing load on a single IP when a site’s operator agrees. If you are blocked, first reduce rate, identify official APIs, and ask for access. “Rotation to evade blocks” is a last resort for gray-hat behavior this article does not endorse for production work.

If you do run at scale in a legitimate context (e.g., internal index of pages you own), combine caching, ETags, If-Modified-Since, and single-flight fetches to avoid duplicate work.

Real-world project: internal catalog sync (anonymized)

In one project, the goal was to nightly sync a subset of public product fields from a partner’s HTML catalog (no public API) into our DB for read-only display. Lessons learned:

Negotiated a crawl window and max RPS in writing. That removed guesswork and legal risk.
Cached by SKU; skipped unchanged rows using ETag + content hash, cutting traffic by ~70%.
Parsers broke twice when the partner re-skinned the site. We added smoke tests on golden HTML fixtures and CI checks when selectors fail.
Monitoring: alert on error rate and zero rows parsed—both indicate silent breakage.

The stack was requests + Beautiful Soup for static HTML; a separate job used Selenium for one wizard-style page the partner could not API-ify in time. We decommissioned Selenium once a JSON feed appeared—the best scrape is the one you delete when a proper contract exists.

1. `requests` basics (quick reference)

Fetching HTML

import requests

response = requests.get("https://example.com", timeout=15)
print(response.status_code)  # 200
print(response.text[:500])  # HTML body (truncated)
print(response.headers.get("server"))

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
response = requests.get("https://example.com", headers=headers, timeout=15)

2. Beautiful Soup

Parsing HTML

from bs4 import BeautifulSoup
import requests

url = "https://example.com"
response = requests.get(url, timeout=15)
soup = BeautifulSoup(response.text, "html.parser")

title = soup.find("title")
if title:
    print(title.get_text(strip=True))

for link in soup.find_all("a", href=True):
    print(link["href"])

articles = soup.select(".article-title")
for article in articles:
    print(article.get_text(strip=True))

Example: news headlines

import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_news(url: str) -> pd.DataFrame:
    """Collect news titles and links (selectors are hypothetical)."""
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
    response = requests.get(url, headers=headers, timeout=20)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "html.parser")

    articles = []
    for item in soup.select(".news-item"):
        title_el = item.select_one(".title")
        link_el = item.select_one("a")
        date_el = item.select_one(".date")
        if not (title_el and link_el and date_el):
            continue
        articles.append(
            {
                "title": title_el.get_text(strip=True),
                "link": link_el["href"],
                "date": date_el.get_text(strip=True),
            }
        )
    return pd.DataFrame(articles)

# df = scrape_news("https://news.example.com")
# df.to_csv("news.csv", index=False, encoding="utf-8-sig")

3. Selenium (dynamic pages)

Install

pip install selenium

Basic usage

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
try:
    driver.get("https://example.com")
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, "content"))
    )
    title = driver.find_element(By.TAG_NAME, "h1")
    print(title.text)

    button = driver.find_element(By.ID, "load-more")
    button.click()
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
finally:
    driver.quit()

4. Real-world example: price monitoring (fixed parsing)

import re
import time
from datetime import datetime

import requests
from bs4 import BeautifulSoup

def parse_krw(s: str) -> int:
    digits = re.sub(r"[^\d]", "", s)
    return int(digits) if digits else 0

def check_price(url: str, target_price: int) -> bool:
    """Read product price from a page (selectors must match the real site)."""
    headers = {"User-Agent": "Mozilla/5.0 (compatible; PriceCheck/1.0)"}
    response = requests.get(url, headers=headers, timeout=20)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "html.parser")
    node = soup.select_one(".price")
    if not node:
        raise ValueError("Price selector did not match any element")
    price = parse_krw(node.get_text())
    print(f"[{datetime.now().isoformat()}] Current price: {price:,} KRW")
    if price <= target_price:
        print(f"Target reached (<= {target_price:,} KRW)")
        return True
    return False

# Example: check hourly on a permitted page only
# url = "https://shopping.example.com/product/123"
# target = 50_000
# while True:
#     if check_price(url, target):
#         break
#     time.sleep(3600)

5. Saving data (CSV and helpers)

import pandas as pd

def scrape_data(url: str) -> list[dict]:
    """Replace with your own parser: requests + Beautiful Soup → list[dict]."""
    return []

def scrape_and_save(url: str, output_file: str) -> None:
    """Scrape and write CSV."""
    data = scrape_data(url)
    df = pd.DataFrame(data)
    df.to_csv(output_file, index=False, encoding="utf-8-sig")
    print(f"Saved: {output_file}")

Practical tips: scraping etiquette

# Check robots.txt: https://example.com/robots.txt
# Space out requests
import time

time.sleep(1)

# Descriptive User-Agent
headers = {"User-Agent": "MyScraper/1.0 (+https://example.com/bot)"}

# Handle errors
import requests

try:
    response = requests.get("https://example.com", timeout=10)
    response.raise_for_status()
except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")

Summary

Key takeaways

requests: HTTP calls, sessions, headers, timeouts.
Beautiful Soup: fast parsing for static HTML; pair with CSS or XPath as needed.
Selenium (or Playwright): when JavaScript owns the DOM.
Scrapy: when the problem is a crawl at scale with pipelines and politeness.
Ethics and law: robots.txt, ToS, data minimization, and human-scale rates.
Reliability: retries, backoff, 429 handling, and monitoring.
Storage: CSV, JSONL, or SQLite for structure and deduplication.

Next steps

[Task scheduling](/en/blog/python-series-23-task-scheduling/
[File automation](/en/blog/python-series-21-file-automation/

[Python environment setup | Install Python on Windows and Mac](/en/blog/python-series-01-environment-setup/

FAQ

When would I use this in production work?

When you have permission to automate access and a clear data need—monitoring, internal aggregation, or research. Always align with policy, APIs first, and document rate limits and failure modes.

What should I read first in the series?

Follow Previous / Related links at the bottom of each article, or use the Python series index for the full path.

Where can I go deeper?

Read the official docs for Requests, Beautiful Soup, Scrapy, and Selenium, and for HTML semantics the MDN documentation. For legal questions, use counsel—not blog posts.

Keyboard Shortcuts

Python Web Scraping | BeautifulSoup and Selenium Explained

이 글의 핵심

Introduction

Core concepts and legal considerations

Beautiful Soup vs Scrapy vs Selenium

HTTP requests, headers, sessions, and cookies

CSS selectors and XPath

Dynamic content (JavaScript rendering)

Rate limiting, retries, and error handling

Data storage strategies

Anti-scraping, proxies, and rotation (ethics first)

Real-world project: internal catalog sync (anonymized)

1. `requests` basics (quick reference)

Fetching HTML

2. Beautiful Soup

Parsing HTML

Example: news headlines

3. Selenium (dynamic pages)

Install

Basic usage

4. Real-world example: price monitoring (fixed parsing)

5. Saving data (CSV and helpers)

Practical tips: scraping etiquette

Summary

Key takeaways

Next steps

FAQ

When would I use this in production work?

What should I read first in the series?

Where can I go deeper?

See also (internal links)

Keywords

이 글이 도움이 되셨나요?

Keyboard Shortcuts

이 글의 핵심

Introduction

Core concepts and legal considerations

Beautiful Soup vs Scrapy vs Selenium

HTTP requests, headers, sessions, and cookies

CSS selectors and XPath

Dynamic content (JavaScript rendering)

Rate limiting, retries, and error handling

Data storage strategies

Anti-scraping, proxies, and rotation (ethics first)

Real-world project: internal catalog sync (anonymized)

1. requests basics (quick reference)

Fetching HTML

2. Beautiful Soup

Parsing HTML

Example: news headlines

3. Selenium (dynamic pages)

Install

Basic usage

4. Real-world example: price monitoring (fixed parsing)

5. Saving data (CSV and helpers)

Practical tips: scraping etiquette

Summary

Key takeaways

Next steps

Related posts

FAQ

When would I use this in production work?

What should I read first in the series?

Where can I go deeper?

See also (internal links)

Keywords

이 글이 도움이 되셨나요?

1. `requests` basics (quick reference)