# SEO ## Crawling vs. Indexing Crawling is the automated discovery of pages and related content, whereas indexing is including a given URL in the search index and potentially displaying it among the search results. URLs can be indexed, even though they are not allowed to be crawled (via `Disallow:` rule) because search engines can infer what content that URL contains based on the links leading towards it. ## Affecting crawling and indexing behaviour You can indicate to the Crawlers which pages you want to skip during crawling (`Disallow:`) and which you want to exclude from the search index altogether (`Noindex:`). There are three places to put these rules: 1. The `robots.txt` file (⚠️ Google does not support `Noindex:` here) 2. HTML document's `<head />` as `<meta />` tags 3. HTTP Response Headers ### Hide a single page from search with `Noindex:` [Docs - Robots Meta Tag](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag) ```html <meta name="robots" content="noindex" /> ``` ### Best Practices - Don't `Disallow:` pages that contain a `Noindex:` rule, as this rule needs to be crawled for the search engine to respect it. - Don't `Disallow:` CSS or images as they might affect your layout and hinder successful crawling. Instead consider setting a `Disallow:` rule for `User-agent: Googlebot-Image` ([List of Google crawlers](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers)) to affect Google Image Search's behaviour. - Avoid `Allow:` rules as all URLs are implicitly allowed to be crawled. ## Example robots.txt ``` # THIS IS A COMMENT User-agent:* Disallow: /*/archive Disallow: /*/eula User-agent:Googlebot-Image Disallow: /*/images/ Sitemap:https://example.com/sitemap.xml Sitemap:https://sub.example.com/sitemap.xml ``` ## Sitemaps A sitemap file can at contain up to 50,000 URLs (and be 50MB in size) and gives search engines the structure of your page. While the Crawler might still find your pages due to other (crawled) pages linking to it, including your page in the sitemap ensures that the Crawler will see it. ### Use `sitemapindex` to reference other sitemaps ```xml <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>https://example.com/sitemap.xml</loc> <lastmod>2023-09-22</lastmod> </sitemap> <sitemap> <loc>https://sub.example.com/sitemap.xml</loc> <lastmod>2023-09-22</lastmod> </sitemap> </sitemapindex> ``` ## Sitemap.xml Example: ```xml <urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd"> <url> <loc>https://example.com/en-us/</loc> <lastmod>2023-09-28</lastmod> <changefreq>daily</changefreq> <priority>1.0</priority> <xhtml:link rel="alternate" hreflang="de-de" href="https://example.com/de-de/" /> <xhtml:link rel="alternate" hreflang="en-us" href="https://example.com/en-us/" /> </url> </urlset> ``` **Note:** All child-elements except for `<loc />` are optional. ### Alternate Hreflang This tag allows you to register alternative language versions of the same page. Search engines try to serve the most relevant content to users and this is a further hint to the structure of your pages and the language they are served in. **Note:** how the URL needs to be self-referential, so the URL in `<loc />` is repeated as an `rel=alternative` within itself. The `hreflang` property is mostly used by Google ([Docs](https://developers.google.com/search/docs/specialty/international/localized-versions)), whereas Bing ([Docs](https://blogs.bing.com/webmaster/2011/03/01/how-to-tell-bing-your-websites-country-and-language/)) is paying more attention to: 1. `<meta http-equiv="content-language" content="en-us" />` 2. `<html lang="en-us" />` 3. HTTP Headers 4. Top-level domain 5. IP Address