Alien Road Company

The topics in this section describe how you can control Google’s ability to find and parse your content in order to show it in Search and other Google properties, as well as how to prevent Google from crawling specific content on your site.

Here’s a brief description of each page. To get an overview of crawling and indexing, read our How Search works guide.

Topics
File types indexable by GoogleGoogle can index the content of most types of pages and files. Explore a list of the most common file types that Google Search can index.
URL structureConsider organizing your content so that URLs are constructed logically and in a manner that is most intelligible to humans.
SitemapsTell Google about pages on your site that are new or updated.
Crawler managementMake sure Googlebot isn’t blockedAsk Google to recrawl your URLsReduce the Googlebot crawl rateVerifying Googlebot and other crawlersLarge site owner’s guide to managing your crawl budgetHow HTTP status codes, and network and DNS errors affect Google SearchGoogle crawlers
robots.txtA robots.txt file tells search engine crawlers which pages or files the crawler can or can’t request from your site.
Canonical URLsTell Google about any duplicate pages on your site in order to avoid excessive crawling. Learn how Google auto-detects duplicate content, how it treats duplicate content, and how it assigns a canonical page to any duplicate page groups found.
Mobile sitesLearn how you can optimize your site for mobile devices and ensure that it’s crawled and indexed properly.
AMPIf you have AMP pages, learn how AMP works in Google Search.
JavaScriptThere are some differences and limitations that you need to account for when designing your pages and applications to accommodate how crawlers access and render your content.
Page and content metadataUse valid HTML to specify page metadataAll meta tags that Google understandsRobots meta tag, data-nosnippet, and X-Robots-Tag specificationsBlock indexing with the noindex meta tagSafeSearch and your websiteMake your links crawlableQualify your outbound links to Google with rel attributes
RemovalsControl what you share with GoogleRemove a page hosted on your site from GoogleRemove images hosted on your page from appearing in search resultsKeep redacted information out of Google Search
Site moves and changesRedirects and Google SearchSite movesMinimize A/B testing impact in Google SearchTemporarily pause or disable a website
International and multilingual sitesIf your site contains content in different languages, or with different content for different locations, here’s how to help Google understand your site.

File types indexable by Google

Google can index the content of most types of pages and files. The most common file types we index include:

  • Adobe Portable Document Format (.pdf)
  • Adobe PostScript (.ps)
  • Google Earth (.kml, .kmz)
  • GPS eXchange Format (.gpx)
  • Hancom Hanword (.hwp)
  • HTML (.htm, .html, other file extensions)
  • Microsoft Excel (.xls, .xlsx)
  • Microsoft PowerPoint (.ppt, .pptx)
  • Microsoft Word (.doc, .docx)
  • OpenOffice presentation (.odp)
  • OpenOffice spreadsheet (.ods)
  • OpenOffice text (.odt)
  • Rich Text Format (.rtf)
  • Scalable Vector Graphics (.svg)
  • TeX/LaTeX (.tex)
  • Text (.txt, .text, other file extensions), including source code in common programming languages:
    • Basic source code (.bas)
    • C/C++ source code (.c, .cc, .cpp, .cxx, .h, .hpp)
    • C# source code (.cs)
    • Java source code (.java)
    • Perl source code (.pl)
    • Python source code (.py)
  • Wireless Markup Language (.wml, .wap)
  • XML (.xml)

Search by file type

You can use the filetype: operator in Google Search to limit results to a specific file type. For example, filetype:rtf galway will search for RTF files with the term “galway” in them.

Keep a simple URL structure

A site’s URL structure should be as simple as possible. Consider organizing your content so that URLs are constructed logically and in a manner that is most intelligible to humans.

When possible, use readable words rather than long ID numbers in your URLs.

Recommended: Simple, descriptive words in the URL:

http://en.wikipedia.org/wiki/Aviation

Recommended: Localized words in the URL, if applicable.

https://www.example.com/lebensmittel/pfefferminz

Recommended: Use UTF-8 encoding as necessary. For example, the following example uses UTF-8 encoding for Arabic characters in the URL:

https://www.example.com/%D9%86%D8%B9%D9%86%D8%A7%D8%B9/%D8%A8%D9%82%D8%A7%D9%84%D8%A9

The following example uses UTF-8 encoding for Chinese characters in the URL:

example.com/%E6%9D%82%E8%B4%A7/%E8%96%84%E8%8D%B7

The following example uses UTF-8 encoding for the umlaut in the URL:

https://www.example.com/gem%C3%BCse

The following example uses UTF-8 encoding for emojis in the URL:

example.com/%F0%9F%A6%99%E2%9C%A8

Not recommended: Using non-ASCII characters in the URL:

https://www.example.com/نعناع
https://www.example.com/杂货/薄荷
https://www.example.com/gemüse
https://www.example.com/?✨

Not recommended: Unreadable, long ID numbers in the URL:

https://www.example.com/index.php?id_sezione=360&sid=3a5ebc944f41daa6f849f730f1

If your site is multi-regional, consider using a URL structure that makes it easy to geotarget your site. For more examples of how you can structure your URLs, refer to using locale-specific URLs.

Recommended: Country-specific domain:

example.de

Recommended: Country-specific subdirectory with gTLD:

example.com/de/

Consider using hyphens to separate words in your URLs, as it helps users and search engines identify concepts in the URL more easily. We recommend that you use hyphens (-) instead of underscores (_) in your URLs.

Recommended: Hyphens (-):

https://www.example.com/summer-clothing/filter?color-profile=dark-grey

Not recommended: Underscores (_):

https://www.example.com/summer_clothing/filter?color_profile=dark_grey

Not recommended: Keywords in the URL joined together:

https://www.example.com/greendress

Overly complex URLs, especially those containing multiple parameters, can cause problems for crawlers by creating unnecessarily high numbers of URLs that point to identical or similar content on your site. As a result, Googlebot may consume much more bandwidth than necessary, or may be unable to completely index all the content on your site.

Common causes of this problem

Unnecessarily high numbers of URLs can be caused by a number of issues. These include:

  • Additive filtering of a set of items. Many sites provide different views of the same set of items or search results, often allowing the user to filter this set using defined criteria (for example: show me hotels on the beach). When filters can be combined in an additive manner (for example: hotels on the beach and with a fitness center), the number of URLs (views of data) in the sites explodes. Creating a large number of slightly different lists of hotels is redundant, because Googlebot needs to see only a small number of lists from which it can reach the page for each hotel. For example:
    • Hotel properties at “value rates”:https://www.example.com/hotel-search-results.jsp?Ne=292&N=461
    • Hotel properties at “value rates” on the beach:https://www.example.com/hotel-search-results.jsp?Ne=292&N=461+4294967240
    • Hotel properties at “value rates” on the beach and with a fitness center:https://www.example.com/hotel-search-results.jsp?Ne=292&N=461+4294967240+4294967270
  • Dynamic generation of documents. This can result in small changes because of counters, timestamps, or advertisements.
  • Problematic parameters in the URL. Session IDs, for example, can create massive amounts of duplication and a greater number of URLs.
  • Sorting parameters. Some large shopping sites provide multiple ways to sort the same items, resulting in a much greater number of URLs. For example:https://www.example.com/results?search_type=search_videos&search_query=tpb&search_sort=relevance&search_category=25
  • Irrelevant parameters in the URL, such as referral parameters. For example:https://www.example.com/search/noheaders?click=6EE2BF1AF6A3D705D5561B7C3564D9C2&clickPage=OPD+Product+Page&cat=79https://www.example.com/discuss/showthread.php?referrerid=249406&threadid=535913https://www.example.com/products/products.asp?N=200063&Ne=500955&ref=foo%2Cbar&Cn=Accessories.
  • Calendar issues. A dynamically generated calendar might generate links to future and previous dates with no restrictions on start or end dates. For example:https://www.example.com/calendar.php?d=13&m=8&y=2011
  • Broken relative links. Broken relative links can often cause infinite spaces. Frequently, this problem arises because of repeated path elements. For example:https://www.example.com/index.shtml/discuss/category/school/061121/html/interview/category/health/070223/html/category/business/070302/html/category/community/070413/html/FAQ.htm

Resolve this problem

To avoid potential problems with URL structure, we recommend the following:

  • Consider using a robots.txt file to block Googlebot’s access to problematic URLs. Typically, consider blocking dynamic URLs, such as URLs that generate search results, or URLs that can create infinite spaces, such as calendars. Using regular expressions in your robots.txt file can allow you to easily block large numbers of URLs.
  • Wherever possible, avoid the use of session IDs in URLs. Consider using cookies instead.
  • Whenever possible, shorten URLs by trimming unnecessary parameters.
  • If your site has an infinite calendar, add a nofollow attribute to links to dynamically created future calendar pages.
  • Check your site for broken relative links.

Author

alienroad