1 Hour to Web Scraping with Python
Build a real web scraper in 60 minutes. Extract data from websites, handle pagination, and save to CSV. No prior scraping experience needed.
By the end of this tutorial, you'll have a working web scraper that extracts product data from a real website and saves it to CSV.
šÆ What You'll Build
A Python script that:
- Fetches HTML from a website
- Extracts structured data (titles, prices, ratings)
- Handles pagination (multiple pages)
- Saves results to CSV
ā±ļø Time Breakdown
š Prerequisites
- Python 3.8+ (see 1 Hour to Python Basics if new)
- Basic HTML knowledge (tags like
<div>,<a>,<span>)
Step 1: Install Tools (0ā10 min)
Install requests (fetch HTML) and beautifulsoup4 (parse HTML):
pip install requests beautifulsoup4
Test it:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://example.com")
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.title.string)
Checkpoint
You should see Example Domain printed. If you get ModuleNotFoundError, re-run pip install.
Step 2: Understand HTML Structure (10ā15 min)
We'll scrape books.toscrape.com (a practice site).
Open it in your browser ā Right-click a book ā Inspect.
You'll see:
<article class="product_pod">
<h3><a href="..." title="A Light in the Attic">A Light in the ...</a></h3>
<p class="price_color">Ā£51.77</p>
<p class="star-rating Three">...</p>
</article>
Key selectors:
- Book title:
article.product_pod h3 a - Price:
p.price_color - Rating:
p.star-rating(class name contains rating)
Step 3: Extract One Item (15ā25 min)
Create scraper.py:
import requests
from bs4 import BeautifulSoup
url = "http://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find first book
book = soup.find('article', class_='product_pod')
title = book.h3.a['title']
price = book.find('p', class_='price_color').text
rating_class = book.find('p', class_='star-rating')['class'][1]
print(f"Title: {title}")
print(f"Price: {price}")
print(f"Rating: {rating_class}")
Run:
python scraper.py
Checkpoint
You should see one book's title, price, and rating (e.g., "Three").
Step 4: Extract All Items (25ā40 min)
Loop through all books on the page:
books = soup.find_all('article', class_='product_pod')
for book in books:
title = book.h3.a['title']
price = book.find('p', class_='price_color').text
rating = book.find('p', class_='star-rating')['class'][1]
print(f"{title} | {price} | {rating}")
Checkpoint
You should see 20 books printed (one page has 20 items).
Step 5: Handle Pagination (40ā55 min)
The site has a "next" button. Let's scrape multiple pages:
import requests
from bs4 import BeautifulSoup
base_url = "http://books.toscrape.com/catalogue/"
page_url = "page-{}.html"
all_books = []
for page_num in range(1, 4): # Scrape 3 pages
if page_num == 1:
url = "http://books.toscrape.com/"
else:
url = base_url + page_url.format(page_num)
print(f"Scraping {url}...")
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
books = soup.find_all('article', class_='product_pod')
for book in books:
title = book.h3.a['title']
price = book.find('p', class_='price_color').text.strip('Ā£')
rating = book.find('p', class_='star-rating')['class'][1]
all_books.append({
'title': title,
'price': price,
'rating': rating
})
print(f"Total books scraped: {len(all_books)}")
Checkpoint
You should see Total books scraped: 60 (3 pages Ć 20 books).
Step 6: Save to CSV (55ā60 min)
import csv
# ... (previous scraping code) ...
# Save to CSV
with open('books.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=['title', 'price', 'rating'])
writer.writeheader()
writer.writerows(all_books)
print("Saved to books.csv")
Run:
python scraper.py
Open books.csv in Excel or any text editor. You should see 60 books!
š You just built a real web scraper in 60 minutes!
š Bonus
Add delays (be polite):
import time
for page_num in range(1, 4):
# ... scraping code ...
time.sleep(1) # Wait 1 second between pages
Handle errors:
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
except requests.RequestException as e:
print(f"Error: {e}")
continue
Use headers (avoid blocks):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
š Next Steps
š Resources
- Beautiful Soup Docs
- Requests Docs
- Scrapy (advanced framework)
ā ļø Legal Note
Always check a website's robots.txt and Terms of Service before scraping. Respect rate limits and don't overload servers.