AI Crawlers Are Hitting Small Business Websites. Here Is What to Do

AI Crawlers Are Hitting Small Business Websites. Here Is What to Do

Your website is not just serving customers anymore.

It is also serving Googlebot, Bingbot, GPTBot, ClaudeBot, Meta crawlers, product fetchers, uptime monitors, spam bots, security scanners, and tools you have never heard of. Some are useful. Some are waste. Some are trying to understand your business so they can show you in AI answers. Others are just pulling content without sending a single lead back.

For a small business, this sounds like a big-company problem until the site gets slow, hosting costs jump, or your server logs show thousands of requests from bots while real customers are trying to book appointments.

This is not a reason to panic and block everything. That can hurt search visibility. It is a reason to put crawler control on the same checklist as page speed, backups, and form testing.

Cloudflare reported that AI and search crawler traffic grew 18% from May 2024 to May 2025. The same analysis found GPTBot grew from 5% to 30% share of AI crawler traffic over that period. In a later Cloudflare analysis, training-related crawling drove nearly 80% of AI bot activity.

That is the practical problem. Bots are using real bandwidth, real CPU, and real crawl access. Your site needs rules.

What AI crawlers actually do

A crawler is software that visits web pages automatically. Search engines have used crawlers for decades to find pages and build indexes. Google explains that its crawlers and fetchers perform actions for Google products, either automatically or when triggered by a user request, and that common crawlers like Googlebot respect robots.txt rules for automatic crawls.

AI crawlers are similar from the server’s point of view. They request pages. They read content. They move on. The difference is the purpose.

OpenAI separates its bots by use case. Its documentation says OAI-SearchBot is used to surface websites in ChatGPT search features, while GPTBot is used to crawl content that may be used in training generative AI foundation models. OpenAI also says a site can allow OAI-SearchBot while disallowing GPTBot, because those settings are independent.

That distinction matters for small businesses.

If you run a roofing company, dental practice, machine shop, law firm, or HVAC business, you probably want to show up when someone asks an AI assistant for local options, service explanations, or buying advice. You may not want every training crawler hitting every PDF, blog post, gallery, and service page forever.

The right answer is not “block AI.” The right answer is “decide which access helps the business.”

Why this matters for a small business website

Most small business sites are not built like large media sites. They are usually on WordPress, Webflow, Squarespace, Wix, Shopify, or a low-cost host. The owner assumes traffic means people.

Server logs tell a messier story.

Cloudflare says around 30% of global web traffic comes from bots. That includes useful bots and harmful bots, but the business impact is simple: not every request is a prospect. If your site is slow during a bot spike, a real customer may feel the delay.

Google’s crawler documentation makes the load issue plain. Google says its goal is to crawl as many pages as it can without overwhelming the server, and site owners can reduce Google’s crawl rate if the site has trouble keeping up. That is Google. Not every crawler is as careful, and not every crawler gives you clean controls.

There is also a visibility tradeoff.

Cloudflare found that crawler growth does not always translate into referral traffic. Its crawl-to-click analysis said Anthropic had 38,000 crawls per visitor in July 2025, while Perplexity had 194 crawls per visitor in the same month. That does not mean every business should block those bots. It does mean you should stop assuming that every bot request has marketing value.

A small business needs three things: enough access to be discovered, enough protection to keep the site fast, and enough logging to know what is happening.

Start with a simple crawler audit

Do not edit robots.txt first. Look at the traffic first.

Ask your developer, host, or web team for a one-month crawler report. You want the top user agents, top requested URLs, status codes, bandwidth, and request volume by day. If you use Cloudflare, the bot and security reports are a good starting point. If you are on managed WordPress hosting, ask support for access logs or bot traffic summaries.

Focus on five questions:

  • Which bots are hitting the site most often?
  • Are they requesting important pages, junk URLs, search result pages, or old files?
  • Are they causing 404 errors, 500 errors, or CPU spikes?
  • Are useful bots, like Googlebot and Bingbot, getting blocked by accident?
  • Are AI training crawlers hitting pages that have no reason to be copied, such as internal search pages, PDFs, staging URLs, or duplicate archives?

This is boring work. It is also where the money is.

If a local service company has 80 pages, but bots are hammering 12,000 tag URLs, the fix is not an AI policy. The fix is technical cleanup. Block or noindex junk areas, clean up internal links, fix broken routes, and stop generating thin pages that crawlers can waste time on.

Decide what bots should see

Think of crawler access in four buckets.

First, keep search discovery open. Googlebot and Bingbot should usually reach your public pages, images, and structured data. Google’s common crawler documentation says Googlebot affects Google Search, Discover, Google Images, Google Video, Google News, and other Search features. If you block it casually, you can hurt visibility.

Second, allow AI search bots when they support discovery. OpenAI says sites that opt out of OAI-SearchBot will not be shown in ChatGPT search answers, though they may still appear as navigational links. If ChatGPT search matters for your category, allow the search bot while keeping a tighter rule for training bots.

Third, limit training access where it does not help. OpenAI says disallowing GPTBot indicates content should not be used in training its generative AI foundation models. That is different from blocking its search bot. Other AI companies have their own user agents and rules, so you need a current list rather than one copied from a random blog post two years ago.

Fourth, block obvious waste. Internal search results, cart pages, login pages, admin URLs, duplicate parameter URLs, staging domains, and generated thin pages should not be crawler playgrounds. Some of these should also be protected by authentication, not only robots.txt.

Use robots.txt, but do not treat it like a lock

Robots.txt is a public instruction file. It tells crawlers what they are allowed to request, but it is not a security system.

Cloudflare notes that crawlers honoring robots.txt policies is voluntary. Google says its common crawlers always obey robots.txt rules when crawling automatically. Those two facts can both be true. Good crawlers may obey the file. Bad or careless crawlers may not.

A practical robots.txt file for a small business should do three jobs.

It should allow the crawlers that support search visibility. It should disallow low-value areas that waste crawl budget or expose junk URLs. It should separate AI search access from AI training access when the crawler documentation supports that split.

For example, OpenAI’s docs make a clear distinction between OAI-SearchBot and GPTBot. That means a business can choose visibility in ChatGPT search without giving the same permission to GPTBot for training use. Do not copy this blindly into production without checking the current docs and your own business goals, but the pattern is useful:

User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

That is not a universal recommendation. A content-heavy business may make a different call. A local contractor that wants leads from AI search but does not care about model training may prefer the split.

Protect performance with caching and rate limits

Crawler control is not only about permission. It is also about capacity.

Google says its crawling infrastructure supports HTTP caching through ETag and Last-Modified headers. That matters because caching can reduce repeat work when crawlers revisit pages. If your site serves the same page from scratch every time, bots can create unnecessary load.

For small business sites, the basic setup is usually enough:

  • Put the site behind a CDN or host-level cache.
  • Cache static assets like images, CSS, JavaScript, and fonts.
  • Use page caching for public pages.
  • Keep your sitemap clean so crawlers find the right URLs.
  • Rate-limit aggressive bots at the firewall when they ignore normal behavior.
  • Return proper 404 or 410 responses for removed pages instead of redirecting everything to the homepage.

Do not rate-limit Googlebot or Bingbot without checking verification. Fake bots often pretend to be Googlebot. Google publishes crawler IP and reverse DNS guidance in its crawler documentation, and its common crawlers generally use published IP ranges. Your firewall or CDN should verify known bots before applying special rules.

This is one of those places where cheap fixes get expensive. Blocking a fake Googlebot is good. Blocking real Googlebot because a plugin guessed wrong is bad.

Clean up the pages bots should not waste time on

AI crawlers make old technical debt more visible.

If your website has five versions of every service page, hundreds of empty tag archives, broken image URLs, and old PDF proposals in public folders, more crawlers will find more mess. That can waste server resources and send confusing signals about what your business actually does.

Start with the pages that customers and crawlers both care about: homepage, service pages, location pages, case studies, pricing or process pages, contact page, and key blog posts. Make those fast, accurate, internally linked, and included in the XML sitemap.

Then remove or restrict the junk. Noindex thin archives. Delete old media that should not be public. Block internal search results. Fix redirect chains. If a staging site is public, put it behind a password.

This is not glamorous SEO work. It is shop cleanup. You are clearing the floor so the useful work can move faster.

What to monitor every month

You do not need a giant dashboard. You need a short monthly check.

Look at bot requests by user agent, top bot-hit URLs, server errors, crawl spikes, Google Search Console crawl stats, page speed on key landing pages, and whether important pages are still indexed. If you use Cloudflare or a similar service, compare verified bots, likely bots, and blocked requests.

Tie the technical data back to business outcomes. Did the site slow down during a campaign? Did forms drop during a bot spike? Did Google crawl fewer important pages after a firewall change? Did ChatGPT or Perplexity referrals show up in analytics after allowing search-specific bots?

The answer may be small at first. That is fine. The habit matters.

The small business policy I recommend

For most small business websites, the best crawler policy is balanced.

Keep Googlebot and Bingbot open. Keep the sitemap clean. Allow AI search bots that can send visibility. Limit AI training crawlers if you have no business reason to feed them. Block wasteful crawl paths. Use CDN caching and verified-bot rules. Review logs monthly.

Do not let fear make the decision. Do not let hype make it either.

AI crawler traffic is now part of owning a website. The businesses that handle it well will protect site speed, reduce hosting waste, and still show up where buyers are searching.

If you want help auditing your website’s crawler traffic, robots.txt rules, and technical SEO setup, start here. We’ll help you tighten the site without cutting off the visibility that brings in leads.

Richard Kastl

Richard Kastl

Founder & Lead Engineer

Richard Kastl has spent 14 years engineering websites that generate revenue. He combines expertise in web development, SEO, digital marketing, and conversion optimization to build sites that make the phone ring. His work has helped generate over $30M in pipeline for clients ranging from industrial manufacturers to SaaS companies.

Related Articles

← Back to Blog