Crawl budget is a distribution problem, not a quota
Most teams think about crawl budget as a number to increase. On a large site, that framing causes more harm than the problem it is trying to solve.
The quota mental model breaks down
The instinct is to treat crawling like a monthly data cap: spend it carefully, ask Google for more. But Googlebot does not hand out a fixed allowance. It crawls in proportion to two things: how much it wants your pages, and how cheaply it can fetch them. The lever is not size. It is distribution: where that crawl actually lands.
When traffic is flat at scale, the cause is almost never "not enough crawl". It is crawl pouring into URLs that should not exist while the pages you care about wait in line.
Find where crawl is leaking
Server logs answer this directly. Normalize URLs into patterns, count Googlebot hits per pattern, and the waste shows itself.
$ zcat access.log.gz | grep Googlebot \
| awk '{print $7}' | sed -E 's#/[0-9]+#/:id#g' \
| sort | uniq -c | sort -rn | head
812043 /quote/:id # transactional, ranks -> keep
430112 /news/:id # fresh demand -> keep
58221 /tag/:id # thin, cannibalizing -> noindex
12090 /search?q= # infinite space -> block in robotsTwo patterns here are quietly eating the budget: tag pages that duplicate intent, and an internal search space that is effectively infinite. Neither earns rankings. Both compete for the same crawl.
Redistribute, do not request more
The fixes are unglamorous and they work:
- Block infinite spaces (faceted search, session URLs) in robots.txt so they never enter the queue.
- Noindex and prune thin patterns that cannibalize stronger pages.
- Strengthen internal links to the templates that convert, so demand signals point where you want crawl to go.
- Segment sitemaps by demand and keep them clean, so discovery tracks value.
Crawl budget is not a number you raise. It is a flow you direct. The work is deciding what does not deserve to be crawled.
What to measure
Watch the ratio of crawl hits on revenue templates versus everything else, and median time-to-index for new high-value pages. When distribution improves, both move before total crawl does.
|
Signal |
Vanity reading |
Useful reading |
|---|---|---|
|
Total crawl |
"Crawl went up" |
Crawl on value templates went up |
|
Pages indexed |
Count of indexed URLs |
Share of valuable URLs indexed |
|
Time-to-index |
Site average |
Median for new transactional pages |
Fix distribution and the quota takes care of itself.