Search Engine Crawler Working Functionality

Search Engine Crawler Working Functionality

This post is all about how Search Engine works and how search engine uses the bots and other core programs to validate and ranks the site in SERP (Search Engine Results page).

search-engine-crawler-working-functions

Search Engine Crawler Working Functionality

Spider or Crawler

Spider or Crawler is the bot programs, specially programmed to crawl a complete web page/post. The spider is another name of the crawler. Generally every search engine have an own crawling software programs (bots) to crawl the sites on the internet.  Crawler fetches the data on the page/post.

Robts.txt file

Robots (crawlers) first checks the robots.txt file in the websites where the robot commands and rules to crawl the page is clearly defined in the text and command line format.

Example: robots.txt File in www.esearchadvisors.com

User-agent: *

Disallow:

Disallow: /cgi-bin/

Sitemap: http://www.esearchadvisors.com/sitemap.xml.gz

Robot commands description

User-agent: * is used to allow the all search bots to crawl the page.

Disallow: /cgi-bin/ is used to block the /cgi-bin/ folder from crawl. It contains non-index files. So, bots don’t crawl the files inside the /cgi-bin/ folder.

Sitemap: http://www.esearchadvisors.com/sitemap.xml.gz

This command line indicates the site map link to the crawler.  BOTS fetches all the links in the site-map and index the site frequently as possible.

Code to block other search engines except Google

User-agent: *

Disallow:

Disallow: /

User-agent: Googlebot

allow:/

The above code can only allow the Google bot to fetch details on the page the other search engines like yahoo, bing, ask, yandex, baidu and others can’t fetch the data on the site.

Code to block specific file types

User-agent: *

Disallow:/*.gif$

Code used to block .gif type of images from crawl

User-agent: *

Disallow: /*.jpg$

Code used to block .jpg type of images from crawl

User-agent: *

Disallow:/*.png$

Code used to block .png type of images from crawl

 

User-agent: *

Disallow: /*.mp4$

Code used to block .mp4 type of videos from crawl

Searches related to Search Engine Crawler Working

  • search engine crawler test
  • web crawler working
  • crawler tool
  • what is a web crawler and how does it work
  • web crawler software

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *