Excluding Your Webflow Pages from Search Engine Indexing
There are several ways to disable search engine indexing for your Webflow website. Below are the key methods:
- Firstly, disable Indexing of the Webflow Subdomain
This can be done in the site settings with a simple toggle, which is part of our pre-launch checklist. - Disable Indexing of Static Site Pages
Use the Sitemap indexing toggle found in the Page settings. This toggle adds<meta content="noindex" name="robots">
to your page, preventing it from being crawled and indexed by search engines. - Disable Indexing of an Entire Folder
To do this, create a rule in therobots.txt
file. - Disable Publishing of Empty Template Collection Pages
This is managed through a toggle in the Collection template page settings. - Disable Indexing of Certain Pages in a Collection
This requires a one-time custom setup but can be easily managed in the future using the Option field in the Collection.
Understanding the Difference Between robots.txt
and the <meta name="robots" content="noindex">
Tag
Both the robots.txt
file and the <meta name="robots" content="noindex">
tag control how search engines interact with your website, but they function differently.
- robots.txt File
- Purpose: Used to instruct web crawlers on which parts of your website they should or should not access. It prevents crawlers from reaching specific pages or directories.
- Disallow Directive: Using "Disallow" in
robots.txt
prevents search engines from crawling specific URLs. However, if these URLs are linked from other websites, they can still be indexed based on the content found in the links, even though the crawlers won’t access the page content.
- <meta name="robots" content="noindex"> Tag
- Purpose: This tag is placed directly in the HTML of a webpage and instructs search engines not to index the specific page. This ensures the page won't appear in search results, even if crawlers have access to it.
- Usage: Useful when you want search engines to access a page but not include it in the search index. It can be combined with "nofollow" to prevent the passing of link equity.
The key difference between the two lies in their visibility and application. The robots.txt
file controls crawling by instructing search engines on which parts of the site they should avoid. However, it doesn't guarantee that a page won't be indexed if other pages link to it. On the other hand, the <noindex>
tag specifically prevents a page from being indexed, even if it is accessible to crawlers. In terms of application, robots.txt
is particularly useful for managing access to large sections of your site, such as entire directories, while the <noindex>
tag is more suited for controlling the indexation of individual pages.
In Summary
Use robots.txt
when you want to block crawlers from accessing parts of your site, and use <noindex>
when you want specific pages not to appear in search engine results. Remember, anyone can access your site’s robots.txt
file, so they may still be able to identify and access your private content.
Blocking Query Parameters
To block specific query parameters using robots.txt
, use the "Disallow" directive combined with wildcard characters (*). This helps prevent search engine crawlers from accessing URLs with those parameters, which is useful for managing duplicate content or preventing crawlers from indexing filtered or sorted versions of the same page.
Example: To block URLs that include the ?filter
and ?sort
query parameters:
User-agent: *
Disallow: /*?filter=
Disallow: /*?sort=
Explanation:
/*?filter=
: Blocks all URLs that contain the filter query parameter. For example:https://www.example.com/products?filter=color
https://www.example.com/products?filter=size
/*?sort=
: Blocks all URLs that contain the sort query parameter. For example:https://www.example.com/products?sort=price
https://www.example.com/products?sort=popularity
Important Notes:
- Wildcard Usage: The
*
wildcard matches any sequence of characters, allowing you to block URLs that contain the specified query parameter, regardless of where it appears in the URL. - Order Matters: If you're combining "Allow" and "Disallow" directives, place "Allow" rules before "Disallow" rules when they need to override them.
By setting up your robots.txt
file with these rules, you can effectively block search engines from crawling and indexing URLs with specific query parameters, helping to keep your search index focused on your site’s main content.
Use Password Protection
To prevent the discovery of specific pages on your website, protect them with a password. It's important to note that files uploaded to Webflow are publicly available and discoverable, though they may not necessarily be indexed by search engines if the file isn’t on a publicly viewable webpage or linked elsewhere. Password protection can prevent assets on your site from being discovered or indexed.