Back to Docs

Debugging robots.txt and noindex issues

Figure out why Google/Bing say “blocked by robots.txt” or “noindex”, and how to fix it.

LovableHTML never sets noindex or X-Robots-Tag: noindex headers

We never mess with your robots.txt or your headers. We are an SEO tool after all, making your site un-indexable serves against what you, as user, user our services. LovableHTML does not set your pages to noindex and does not add an X-Robots-Tag: noindex header to your origin pages. If you’re seeing noindex or “blocked by robots.txt”, it’s coming from your origin site, your CDN (Cloudflare, etc.), or your website builder/host configuration.

Step 0: Identify what the crawler is complaining about

These two problems are often confused, but the fixes are different:

  • “Blocked by robots.txt”: The crawler is allowed to see the page, but it is choosing not to crawl it because your robots.txt rules disallow it.
  • “Excluded by ‘noindex’” (or “noindex detected”): The crawler can crawl the page, but it is instructed not to index it (via <meta name="robots" ...> or X-Robots-Tag).

If you’re not sure, use our free tools:

Branch A — “Blocked by robots.txt”

A1) Fetch the exact robots.txt Google/Bing will fetch

robots.txt is per hostname and per protocol. This matters:

  • https://example.com/robots.txt can differ from https://www.example.com/robots.txt
  • http:// and https:// can differ if you’re redirecting in odd ways

In a terminal, run:

language-bash.bash
CopyDownload
curl -i https://example.com/robots.txt
curl -i https://www.example.com/robots.txt

Things to look for:

  • HTTP status: 200 is expected. A 4xx/5xx can cause crawlers to temporarily reduce/stop crawling.
  • Accidental global block: User-agent: * followed by Disallow: /
  • Bot-specific blocks: User-agent: Googlebot / User-agent: Bingbot followed by a Disallow that matches your important pages

A2) Confirm you’re not blocking the bots you care about

At minimum, ensure you are not unintentionally blocking:

  • Google: Googlebot
  • Bing: Bingbot

Gotcha: A rule like Disallow: / under User-agent: * blocks Googlebot and Bingbot unless you have a more specific allowlist for them.

A3) If your robots.txt contains Cloudflare-managed content

If your robots.txt has Cloudflare “managed” content or Cloudflare documentation language in it, you may have enabled Cloudflare settings that alter or prepend directives.

Read and verify these:

  • https://developers.cloudflare.com/bots/additional-configurations/block-ai-bots/
  • https://developers.cloudflare.com/bots/additional-configurations/managed-robots-txt/#implementation

Common failure mode: enabling a managed robots setting and assuming it “only affects AI bots”, but ending up with an overly broad User-agent: * rule or accidentally serving the wrong file on the hostname Google actually crawls.

A4) Check redirect chains (the final hostname is what matters)

Sometimes robots.txt is fine on example.com, but Google is actually crawling www.example.com (or the reverse) due to redirects/canonicals.

language-bash.bash
CopyDownload
curl -IL https://example.com/

If your homepage redirects, repeat the robots.txt fetch for the final hostname in that redirect chain.

Branch B — “Excluded by ‘noindex’” / “noindex detected”

B1) Check for X-Robots-Tag headers

Run:

language-bash.bash
CopyDownload
curl -I https://example.com/some-page

If you see X-Robots-Tag: noindex (or similar), it’s typically set by:

  • your host/CDN configuration
  • a framework default for non-production environments
  • a security plugin / SEO plugin

B2) Check for a meta robots tag in the HTML

Fetch the page HTML and look for a robots meta tag:

language-bash.bash
CopyDownload
curl -sL https://example.com/some-page | head -n 60

Then look for something like:

language-html.html
CopyDownload
<meta name="robots" content="noindex">

or:

language-html.html
CopyDownload
<meta name="robots" content="noindex, nofollow">

Gotcha: Many website builders add noindex automatically for staging, password-protected, “unpublished”, or “trial” sites.

B3) Check that “noindex” isn’t coming from a different page than the one you think

This is very common when there are redirects:

  • https://example.com/page redirects to https://www.example.com/page/
  • the final page has noindex (or different HTML/head content)

Use:

language-bash.bash
CopyDownload
curl -IL https://example.com/page

Then re-run the curl -I and HTML checks against the final URL.

Website builder / hosting gotchas (very common)

Some builders/hosts manage robots.txt for you and/or expose UI toggles like “discourage search engines”, “hide from indexing”, or “site is in staging”.

If you’re using a website builder, check their docs for:

  • robots.txt management (where the file comes from, and whether you can override it)
  • environment/staging flags that add noindex
  • password protection / “coming soon” mode (often triggers noindex)

Extra checks that save time

  • Cache effects: CDNs can cache robots.txt. If you just changed it, purge cache for /robots.txt (and for both www/apex hostnames).
  • Wrong hostname: Google might crawl www while you’re checking apex (or vice versa). Always check both.
  • Different protocols: If anything still serves http://, confirm it redirects cleanly to https:// and that bots aren’t seeing a different robots.txt temporarily.
  • Verify with a bot-like fetch: Use the crawler simulator to see the final HTML/head the crawler sees, including meta robots and key headers.