Debugging robots.txt and noindex issues

Figure out why Google/Bing say “blocked by robots.txt” or “noindex”, and how to fix it.

LovableHTML never sets `noindex` or `X-Robots-Tag: noindex` headers

We never mess with your robots.txt or your headers. We are an SEO tool after all, making your site un-indexable serves against what you, as user, user our services. LovableHTML does not set your pages to noindex and does not add an X-Robots-Tag: noindex header to your origin pages. If you’re seeing noindex or “blocked by robots.txt”, it’s coming from your origin site, your CDN (Cloudflare, etc.), or your website builder/host configuration.

Step 0: Identify what the crawler is complaining about

These two problems are often confused, but the fixes are different:

“Blocked by robots.txt”: The crawler is allowed to see the page, but it is choosing not to crawl it because your robots.txt rules disallow it.
“Excluded by ‘noindex’” (or “noindex detected”): The crawler can crawl the page, but it is instructed not to index it (via <meta name="robots" ...> or X-Robots-Tag).

If you’re not sure, use our free tools:

Branch A — “Blocked by robots.txt”

A1) Fetch the exact robots.txt Google/Bing will fetch

robots.txt is per hostname and per protocol. This matters:

https://example.com/robots.txt can differ from https://www.example.com/robots.txt
http:// and https:// can differ if you’re redirecting in odd ways

In a terminal, run:

language-bash.bash

CopyDownload

curl -i https://example.com/robots.txt
curl -i https://www.example.com/robots.txt

Things to look for:

HTTP status: 200 is expected. A 4xx/5xx can cause crawlers to temporarily reduce/stop crawling.
Accidental global block: User-agent: * followed by Disallow: /
Bot-specific blocks: User-agent: Googlebot / User-agent: Bingbot followed by a Disallow that matches your important pages

A2) Confirm you’re not blocking the bots you care about

At minimum, ensure you are not unintentionally blocking:

Google: Googlebot
Bing: Bingbot

Gotcha: A rule like Disallow: / under User-agent: * blocks Googlebot and Bingbot unless you have a more specific allowlist for them.

A3) If your robots.txt contains Cloudflare-managed content

If your robots.txt has Cloudflare “managed” content or Cloudflare documentation language in it, you may have enabled Cloudflare settings that alter or prepend directives.

Read and verify these:

https://developers.cloudflare.com/bots/additional-configurations/block-ai-bots/
https://developers.cloudflare.com/bots/additional-configurations/managed-robots-txt/#implementation

Common failure mode: enabling a managed robots setting and assuming it “only affects AI bots”, but ending up with an overly broad User-agent: * rule or accidentally serving the wrong file on the hostname Google actually crawls.

A4) Check redirect chains (the final hostname is what matters)

Sometimes robots.txt is fine on example.com, but Google is actually crawling www.example.com (or the reverse) due to redirects/canonicals.

language-bash.bash

CopyDownload

curl -IL https://example.com/

If your homepage redirects, repeat the robots.txt fetch for the final hostname in that redirect chain.

Branch B — “Excluded by ‘noindex’” / “noindex detected”

B1) Check for `X-Robots-Tag` headers

Run:

language-bash.bash

CopyDownload

curl -I https://example.com/some-page

If you see X-Robots-Tag: noindex (or similar), it’s typically set by:

your host/CDN configuration
a framework default for non-production environments
a security plugin / SEO plugin

B2) Check for a meta robots tag in the HTML

Fetch the page HTML and look for a robots meta tag:

language-bash.bash

CopyDownload

curl -sL https://example.com/some-page | head -n 60

Then look for something like:

language-html.html

CopyDownload

<meta name="robots" content="noindex">

or:

language-html.html

CopyDownload

<meta name="robots" content="noindex, nofollow">

Gotcha: Many website builders add noindex automatically for staging, password-protected, “unpublished”, or “trial” sites.

B3) Check that “noindex” isn’t coming from a different page than the one you think

This is very common when there are redirects:

https://example.com/page redirects to https://www.example.com/page/
the final page has noindex (or different HTML/head content)

Use:

language-bash.bash

CopyDownload

curl -IL https://example.com/page

Then re-run the curl -I and HTML checks against the final URL.

Website builder / hosting gotchas (very common)

Some builders/hosts manage robots.txt for you and/or expose UI toggles like “discourage search engines”, “hide from indexing”, or “site is in staging”.

If you’re using a website builder, check their docs for:

robots.txt management (where the file comes from, and whether you can override it)
environment/staging flags that add noindex
password protection / “coming soon” mode (often triggers noindex)

Extra checks that save time

Cache effects: CDNs can cache robots.txt. If you just changed it, purge cache for /robots.txt (and for both www/apex hostnames).
Wrong hostname: Google might crawl www while you’re checking apex (or vice versa). Always check both.
Different protocols: If anything still serves http://, confirm it redirects cleanly to https:// and that bots aren’t seeing a different robots.txt temporarily.
Verify with a bot-like fetch: Use the crawler simulator to see the final HTML/head the crawler sees, including meta robots and key headers.

Debugging robots.txt and noindex issues

LovableHTML never sets noindex or X-Robots-Tag: noindex headers

Step 0: Identify what the crawler is complaining about

Branch A — “Blocked by robots.txt”

A1) Fetch the exact robots.txt Google/Bing will fetch

A2) Confirm you’re not blocking the bots you care about

A3) If your robots.txt contains Cloudflare-managed content

A4) Check redirect chains (the final hostname is what matters)

Branch B — “Excluded by ‘noindex’” / “noindex detected”

B1) Check for X-Robots-Tag headers

B2) Check for a meta robots tag in the HTML

B3) Check that “noindex” isn’t coming from a different page than the one you think

Website builder / hosting gotchas (very common)

Extra checks that save time

LovableHTML never sets `noindex` or `X-Robots-Tag: noindex` headers

B1) Check for `X-Robots-Tag` headers