I’ve scraped websites for data, for price monitoring, for lead generation, and honestly just because I was curious what a page looked like parsed into JSON. Python is still the best language for it, and I’ve settled on a handful of patterns that I reuse everywhere.
Here are the three tricks I actually keep coming back to.
1. Session reuse beats fresh requests every time
Most tutorials show you requests.get() on every call. That’s fine for one-off scripts, but if you’re scraping multiple pages from the same site ( which is basically always ), you want a session. It reuses the TCP connection, keeps cookies, and respects headers across requests.
import requests
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
})
# First request establishes the connection
page = session.get("https://example.com/products")
# Subsequent requests reuse it — faster, cleaner
for product_id in product_ids:
resp = session.get(f"https://example.com/products/{product_id}")
# parse resp.text ...One session, one connection pool, one set of cookies. If you’re hitting the same domain more than once, there’s no reason not to do this.
2. Use CSS selectors with BeautifulSoup, not find() soup
I used to chain .find() and .find_all() calls five levels deep. It worked, until the page structure changed slightly. Then it all broke. select() with CSS selectors is more resilient and way more readable.
from bs4 import BeautifulSoup
html = session.get("https://example.com/products").text
soup = BeautifulSoup(html, "html.parser")
# Instead of this nightmare:
# soup.find("div", {"class": "products"}).find_all("article")...
# Just use CSS selectors:
products = soup.select("div.products article.product-card")
for product in products:
name = product.select_one("h2.title").text.strip()
price = product.select_one("span.price").text.strip()
link = product.select_one("a[href]")["href"]
print(f"{name}: {price} — {link}")The CSS selector approach is a single string. If the site redesigns and moves things around, you update one selector, not a chain of method calls. It’s also much easier to debug — paste the selector into your browser’s dev tools and see if it matches.
3. Exponential backoff with tenacity, not hand-rolled retries
Every scraper eventually hits a rate limit. Every scraper eventually gets a 503. I used to write my own retry loops with time.sleep() and a counter. Then I found tenacity and never went back.
from tenacity import retry, stop_after_attempt, wait_exponential
import requests
@retry(stop=stop_after_attempt(5), wait=wait_exponential(min=2, max=30))
def fetch_page(url, session):
resp = session.get(url)
resp.raise_for_status() # raises on 4xx/5xx, triggers retry
return resp.text
# That’s it. No try/except, no sleep, no counter.
html = fetch_page("https://example.com/data", session)Tenacity handles everything — the waiting, the retry count, the exponential backoff. You decorate your function and move on. If the site is flaky, it’ll try 5 times with increasing delays. If it’s truly down, it raises the last exception and you deal with it in your error handling ( which you do have, right? ).
Bonus: Always cache your responses locally
This isn’t a scraping trick per se, but it’s the thing that saves me the most time. Every time I write a scraper, I save the raw HTML to disk before parsing. That way when my parser breaks ( and it will ), I’m not re-requesting pages I already fetched.
import hashlib, os
from pathlib import Path
def cached_get(session, url, cache_dir=".cache"):
Path(cache_dir).mkdir(exist_ok=True)
filename = hashlib.md5(url.encode()).hexdigest() + ".html"
filepath = Path(cache_dir) / filename
if filepath.exists():
return filepath.read_text()
html = fetch_page(url, session)
filepath.write_text(html)
return html
# Develop your parser at full speed without hammering the server
html = cached_get(session, "https://example.com/products")
soup = BeautifulSoup(html, "html.parser")This pattern has saved me more time than any other scraping tip. Parse, debug, re-parse — all without hitting the original site again. The site owner doesn’t get annoyed, you don’t get rate-limited, and your iteration loop is instant.
Conclusion
Session objects, CSS selectors, tenacity for retries, and local caching. That’s 90% of what I need for any scraping job. The other 10% is site-specific hacks that I deal with case by case.
Stop writing fresh requests.get() calls without a session. Stop chaining find(). Stop hand-rolling retry logic. These three tricks are simple, battle-tested, and they’ll save you hours.
Thank you :)