Robots
Gepost in /Software op 29 Juni 2012Deze blog is geschreven door Christiaan Schaake
Preface
Nowerdays the Internet consists of billions of Web pages all over the world. There are
no road signs that direct visitors to your site. So if you have not enough money to
start a huge advertisement campaign, you are stuck with search engines.
Search engines are used to find specific information on the Internet. Search engines
are constantly crawling (looking) over the Internet and indexing millions of pages per day.
You can easily add your page to the search engines work list. So within a couple of weeks
the search engine will crawl over your page.
But since the search engine is using a robot or spider, which is actually a program that
looks at your page and tries to get information about your site. The change that your site
is indexed correctly is very slim. There are however ways to help and direct a search
spider so that the changes Internet users can find your site in search engines will
dramatically increase.
There are two methods of directing a search engine on your site.
- Robots.txt
- META tags
The META tags help the spider to get the correct information about a specific web page.
Robots.txt
The robots.txt file is used by search engine spiders to see what they may or may not
include in there search. The robots.txt file must always be located at the root
of the website. You cannot make a robots.txt for a specific part of the website.
An example of a robots.txt would be my own at:
http://www.schaake.nu/robots.txt
The robots.txt consists of 2 commands. With the first command you can set a specific
user agent for which the directive will be set. So only that specific user agent will
look at the directive.
The second command will restrict access to a specific directory or file on the website.
Let's explain this with a sample:
# Some stuff we don't want google to see
With this example, all search engines will index the whole site except the /cgi-bin.
But only the googlebot will also not index the /googlesecrets.html page.
User-Agent: Googlebot
Disallow: /googlesecrets.html
Disallow: /cgi-bin
# All the other agents may also not index the cgi-bin
User-Agent: *
Disallow: /cgi-bin
Now what if we want to index the complete site, so we don't have any secrets at all.
# Allow complete access
Or we could disallow one agent to index our site completely.
User-Agent: *
Disallow:
# Disallow the googlebot completely
Note that not all search engines will look at the robots.txt file at all. Most of the
big commercial search engines will look at your robots.txt file. But search engines of
spammers (who are looking for email addresses) will not be stopped by the robots.txt file.
User-Agent: googlebot
Disallow: /
META tags
The META tags contain information about a specific web page. Most spiders will look at the META tags and use this information instead of trying to collect information about the page themselves.The drawback of this is that when you have incorrect or outdated information in your META tags, the spider will use this information instead of looking at the page itself.
The following META tags can be used to help a spider.
<META NAME="description" CONTENT="Desciption of the webpage"/>
The description tag holds the title of the webpage. Keep this the same as the %lt;TITLE> tag
in the page header. Some search engines will still look at the title tag instead of the
description META tag!<META NAME="keywords" CONTENT="keyword1 keyword2 keyword3"/>
To help a spider to collect keywords on your site, you can include the keywords META tag.
This tag contains some useful keywords Internet users can use to find your Web page.
Keywords are separated with a space.<META NAME="robot" CONTENT="index,follow"/>
The robots.txt can forbid spiders to look at specific pages or complete directories. But
sometimes you want some more control over the spider. The robot META tag will give you
all the control you need over a spider.The first part of the tag will tell the spider if the current page may be indexed, the second part will tell the spider if it may follow hyperlinks in the current page. Possible options are:
- index,follow - Spiders may index the page and follow all links on the page.
- noindex,nofollow - Spiders may not index the page and may not follow any links.
- index,nofollow - The page may be indexed, but no links may be followed. This is very usefull for page that link to forms.
- noindex,follow - The page may not be indexed, but the spider may follow all links. A good example would be a dynamic weblog.
(eg. <META NAME="robot" CONTENT="all"/>)
<META NAME="refresh" CONTENT="3600"/>
The refresh meta tag will tell the spider to refresh the page every number of seconds.
This directive could be used for internal search engines, but I would not see a reason
why a public search engine would refresh indexed it's content for your specific page.
It will take weeks before a search engine will visit your site again.
<META NAME="revisit-after" CONTENT="30"/>
This directive makes more sense. But I'm not sure if there are search engines that look
at this directive. The above example tells the search engine to revisit the site after 30
days. So if a search engine normally would plan a revisit after 14 days, it can wait another
16 days to revisit your site. This really keeps the bandwidth open.
<META NAME="generator" CONTENT="Microsoft Frontpage"/>
This META tag tell the spider which web design tool was used to generate or design this Web page.
A search engine could use this to build statics on the usage on design tools.
<META NAME="language" CONTENT="nl, en"/>
This META tag defines the language used on the Web page. Normally a spider will try to
detect the used language itself. But with this tag you can force a specific language.
<META NAME="copyright" CONTENT="Copyright 2003 Christiaan Schaake."/>
<META NAME="author" CONTENT="Christiaan Schaake"/>
These 2 META tags tell the spider who wrote the page and the copyrights of this page.
A search engine could include this information in the search results.
Not all search engine will look at the META tags, so always use plain text for important parts of your site. Do not make a first welcome page that only includes a big image or a shock-wave animation. And make use of the title and alt tags!