Crawl budget là số pages Googlebot crawl trong một khoảng thời gian — waste crawl budget bằng duplicate URLs hay redirect chains có thể khiến important pages không được index.
Gồm 2 yếu tố: Crawl rate limit (tốc độ crawl không gây quá tải server) và Crawl demand (Google muốn crawl bao nhiêu dựa trên popularity/freshness). Quan trọng khi website >10,000 pages.
Lãng phí crawl budget:
- duplicate content (URL parameters, www/non-www)
- soft 404s (trang trống return 200)
- redirect chains (A→B→C→D)
- infinite URL spaces (calendar, filters tạo vô hạn URL combinations)
Tối ưu: canonical tags, robots.txt block unnecessary paths, clean URL structure, sitemap chỉ chứa important pages, server response time <500ms, dùng 410 thay vì 404 cho permanently deleted pages. Lưu ý: rel=next/prev cho pagination đã bị Google deprecated từ tháng 3/2019 — không còn hiệu lực.
Crawl budget is the number of pages Googlebot crawls in a given time period — wasting it on duplicate URLs or redirect chains can prevent important pages from being indexed.
It consists of two factors: Crawl rate limit (the speed at which Googlebot crawls without overloading the server) and Crawl demand (how much Google wants to crawl based on popularity and freshness). This matters most for websites with over 10,000 pages.
Common crawl budget waste:
- duplicate content (URL parameters, www vs non-www)
- soft 404s (empty pages returning a 200 status)
- redirect chains (A→B→C→D)
- infinite URL spaces (calendars, filters creating endless URL combinations)
Optimization: canonical tags, blocking unnecessary paths in robots.txt, clean URL structure, sitemaps listing only important pages, server response time under 500ms, using 410 instead of 404 for permanently deleted pages. Note: rel=next/prev for pagination was deprecated by Google in March 2019 — it no longer has any effect.