htaccess Elite

.htaccess tutorial


All times are UTC [ DST ]





Post new topic Reply to topic  [ 5 posts ] 
Author Message
 Post subject: Robots meta tag and googlebot
PostPosted: 07 Nov 2006 09:09 
Offline

Joined: 28 Oct 2006 07:37
Posts: 44
How do I change my site's title and description?

Google's creation of sites' titles and descriptions (or "snippets") is completely automated and takes into account both the content of a page as well as references to it that appear on the web. While we're unable to manually change titles or snippets for individual sites, please be assured that we're always working to make them as relevant as possible.

One source we use to generate snippets is the Open Directory Project. You can direct us not to use this as a source by adding a meta tag to your pages.

To prevent all search engines (that support the meta tag) from using this information for the page's description, use the following:

Code:
<META NAME="ROBOTS" CONTENT="NOODP">


To specifically prevent Google from using this information for a page's description, use the following:

Code:
<META NAME="GOOGLEBOT" CONTENT="NOODP">


If you use the robots meta tag for other directives, you can combine those. For instance:

Code:
<META NAME="GOOGLEBOT" CONTENT="NOODP, NOFOLLOW">


Note that once you add this meta tag to your pages, it may take some time for changes to your snippets to appear in the index.

If you're concerned about content in your title or snippet, you may want to double-check that this content doesn't appear on your site. If it does, changing it may affect your Google snippet after we next crawl your site. If it doesn't, try searching Google.com for the title or snippet enclosed in quotation marks. This will display pages on the web that refer to your site using this text. If you contact these webmasters to request that they change their information about your site, any changes to their sites will be recognized by our crawler after we next crawl their pages.


The "robots" tag is obeyed by many different web robots. To specify indexing restrictions only for googlebot, use "googlebot" in place of "robots". Example:
Code:
<meta name="googlebot" content="robots-terms" />


Googlebot obeys the noindex, nofollow, and noarchive Robots META Tag. If you place the tag in the head of your document, Google will obey the robots to not index, not follow, and/or not archive documents.

The content="robots-terms" is a comma separated list used in the robots META tag for search engines that may contain one or more of the following keywords: noindex, nofollow and/or noarchive.

noindex: Document will not be indexed by Googlebot.

nofollow: Internal and external links in the document will not be followed by Googlebot.

noarchive: Google will not archive a copy of the document (Google's Cached Page).

nosnippet: Google will not display snippets and will not archive a copy of the document. A snippet is a text excerpt from the returned result page that has all query terms in bold.

If this robots META tag is missing, or if there is no content, or the robot terms are not specified, then the robot terms will be assumed as "index, follow" (eg "all") which is the default for most major search engine spiders anyway.



Examples of the Googlebot robots META tag
The tags and their effects are:
  1. The robots term of noindex will produce the following effect; Googlebot will retrieve the document, but it will not index the document.
    Code:
    <meta name="googlebot" content="noindex" />
  2. nofollow will produce the following effect; Googlebot will not follow any links that are present on the page to other documents.
    Code:
    <meta name="googlebot" content="nofollow" />
  3. noarchive will produce the following effect; Google maintains a cache of all the documents that we fetch, to permit our users to access the content that we indexed (in the event that the original host of the content is inaccessible, or the content has changed). If you do not wish us to archive a document from your site, you can place this tag in the head of the document, and Google will not provide an archive copy for the document.
    Code:
    <meta name="googlebot" content="noarchive" />
  4. You can also combine any or all of the robots terms into a single META robots tag for the Googlebot.
    Code:
    <meta name="googlebot" content="noarchive, nofollow" />

Additional information on specific Googlebot Robots META Tags can be found at Google's Web Crawler page and also at Remove Content from Google's Index page.



Misinterpretation of the Standards

Googlebot's default indexing behavior is to index, follow or all. The below robots META tag is not required and not suggested in the Google guidelines which state that the use of the robots META tag is for restricting the indexing of content.
Code:
<meta name="googlebot" content="index, follow" />


When utilizing the above robots META tag, you are adding page weight that is not required. You shift the 'text to html ratio' when inserting the additional code within your documents.


Last edited by ti89 on 07 Nov 2006 09:46, edited 1 time in total.

Top
 Profile  
 
 Post subject:
PostPosted: 07 Nov 2006 09:39 
Offline

Joined: 28 Oct 2006 07:37
Posts: 44
The Robots META tag is a simple mechanism to indicate to visiting Web Robots if a page should be indexed, or links on the page should be followed.

It differs from the Protocol for Robots Exclusion in that you need no effort or permission from your Web Server Administrator.

Note: Currently only few robots support this tag!

Where to put the Robots META tag?

Like any META tag it should be placed in the HEAD section of an HTML page:

Code:
<html>
<head>
<meta name="robots" content="noindex,nofollow">
<meta name="description" content="This page ....">
<title>...</title>
</head>
<body>



What to put into the Robots META tag?

The content of the Robots META tag contains directives separated by commas. The currently defined directives are [NO]INDEX and [NO]FOLLOW.

The INDEX directive specifies if an indexing robot should index the page. The FOLLOW directive specifies if a robot is to follow links on the page. The defaults are INDEX and FOLLOW.

The values ALL and NONE set all directives on or off: ALL=INDEX,FOLLOW
NONE=NOINDEX,NOFOLLOW

Some examples:

Code:
<meta name="robots" content="index,follow">
<meta name="robots" content="noindex,follow">
<meta name="robots" content="index,nofollow">
<meta name="robots" content="noindex,nofollow">

Note the "robots" name of the tag and the content are case insensitive.

You obviously should not specify conflicting or repeating directives such as:

Code:
<meta name="robots" content="INDEX,NOINDEX,NOFOLLOW,FOLLOW,FOLLOW">


A formal syntax for the Robots META tag content is:
Code:
content    = all | none | directives
all        = "ALL"
none       = "NONE"
directives = directive ["," directives]
directive  = index | follow
index      = "INDEX" | "NOINDEX"
follow     = "FOLLOW" | "NOFOLLOW"


[url=http://www.robotstxt.org/wc/meta-user.html]HTML Author's Guide
to the Robots META tag[/url]


Top
 Profile  
 
 Post subject:
PostPosted: 07 Nov 2006 09:41 
Offline

Joined: 28 Oct 2006 07:37
Posts: 44
[url=http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt]SPIDERING BOF REPORT
Report by Michael Mauldin (Lycos)
(later edited by Michael Schwartz)[/url]

While the overall workshop goal was to determine areas where standards
could be pursued, the Spidering BOF attempted to reach actual standards
agreements about some immediate term issues facing robot-based search
services, at least among spider-based search service representatives who
were in attendance at the workshop (Excite, InfoSeek, and Lycos). The
agreements fell into four areas, but we report only three of them here
because the fourth area concerned a KEYWORDS tag that many workshop
participants felt was not appropriate for specification by this BOF
without the participation of other groups that have been working on that
issue.

The remaining three areas were:
  1. ROBOTS meta-tag

    <META NAME="ROBOTS"
    CONTENT="ALL | NONE | NOINDEX | NOFOLLOW">

    default = empty = "ALL"
    "NONE" = "NOINDEX, NOFOLLOW"

    The filler is a comma separated list of terms:
    ALL, NONE, INDEX, NOINDEX, FOLLOW, NOFOLLOW.

    Discussion: This tag is meant to provide users who cannot control
    the robots.txt file at their sites. It provides a last chance to
    keep their content out of search services. It was decided not to
    add syntax to allow robot specific permissions within the meta-tag.

    INDEX means that robots are welcome to include this page in
    search services.

    FOLLOW means that robots are welcome to follow links from this
    page to find other pages.

    So a value of "NOINDEX" allows the subsidiary links to be explored,
    even though the page is not indexed. A value of "NOFOLLOW" allows the
    page to be indexed, but no links from the page are explored (this may
    be useful if the page is a free entry point into pay-per-view content,
    for example. A value of "NONE" tells the robot to ignore the page.
  2. DESCRIPTION meta-tag

    <META NAME="DESCRIPTION" CONTENT="...text...">

    The intent is that the text can be used by a search service when
    printing a summary of the document. The text should not contain
    any formatting information.
  3. Other issues with ROBOTS.TXT

    These are issues recommended for future standards discussion that
    could not be resolved within the scope of this workshop.

    - Ambiguities in the current specification
    http://www.kollar.com/robots.html

    - A means of canonicalizing sites, using:
    HTTP-EQUIV HOST
    ROBOTS.TXT ALIAS

    - ways of supporting multiple robots.txt files per site ("robotsN.txt")

    - ways of advertising content that should be indexed (rather than
    just restricting content that should not be indexed)

    - Flow control information: retrieval interval or maximum
    connections open to server


Top
 Profile  
 
 Post subject:
PostPosted: 30 Jan 2007 12:51 
Offline

Joined: 28 Oct 2006 07:37
Posts: 44
Code:
#
#  Please, we do NOT allow nonauthorized robots.
#
#  http://www.webmasterworld.com/robots
#  Actual robots can always be found here for: http://www.webmasterworld.com/robots2
#  Old full robots.txt can be found here: http://www.webmasterworld.com/robots3
#
#  Any unauthorized bot running will result in IP's being banned.
#  Agent spoofing is considered a bot.
#
#  Fair warning to the clueless - honey pots are - and have been - running.
#  If you have been banned for bot running - please sticky an admin for a reinclusion request.
#
#  http://www.searchengineworld.com/robots/
#  This code found here: http://www.webmasterworld.com/robots.txt?view=rawcode

User-agent: *
Crawl-delay: 17

User-agent: *
Disallow: /gfx/
Disallow: /cgi-bin/
Disallow: /QuickSand/
Disallow: /pda/
Disallow: /zForumFFFFFF/


Top
 Profile  
 
 Post subject:
PostPosted: 30 Jan 2007 12:51 
Offline

Joined: 28 Oct 2006 07:37
Posts: 44
Code:
#
# WebmasterWorld.com: robots.txt
# GNU Robots.txt Feel free to use with credit
# given to WebmasterWorld.
#
# Please, we do NOT allow nonauthorized robots any longer.
# http://www.searchengineworld.com/robots/
# Yes, feel free to copy and use the following.


User-agent: OmniExplorer_Bot
Disallow: /

User-agent: FreeFind
Disallow: /

User-agent: BecomeBot
Disallow: /

User-agent: Nutch
Disallow: /

User-agent: Jetbot/1.0
Disallow: /

User-agent: Jetbot
Disallow: /

User-agent: WebVac
Disallow: /

User-agent: Stanford
Disallow: /

User-agent: naver
Disallow: /

User-agent: dumbot
Disallow: /

User-agent: Hatena Antenna
Disallow: /

User-agent: grub-client
Disallow: /

User-agent: grub
Disallow: /

User-agent: looksmart
Disallow: /

User-agent: WebZip
Disallow: /

User-agent: larbin
Disallow: /

User-agent: b2w/0.1
Disallow: /

User-agent: Copernic
Disallow: /

User-agent: psbot
Disallow: /

User-agent: Python-urllib
Disallow: /

User-agent: Googlebot-Image
Disallow: /

User-agent: NetMechanic
Disallow: /

User-agent: URL_Spider_Pro
Disallow: /

User-agent: CherryPicker
Disallow: /

User-agent: EmailCollector
Disallow: /

User-agent: EmailSiphon
Disallow: /

User-agent: WebBandit
Disallow: /

User-agent: EmailWolf
Disallow: /

User-agent: ExtractorPro
Disallow: /

User-agent: CopyRightCheck
Disallow: /

User-agent: Crescent
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: ProWebWalker
Disallow: /

User-agent: CheeseBot
Disallow: /

User-agent: LNSpiderguy
Disallow: /

User-agent: Mozilla
Disallow: /

User-agent: mozilla
Disallow: /

User-agent: mozilla/3
Disallow: /

User-agent: mozilla/4
Disallow: /

User-agent: mozilla/5
Disallow: /

User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows NT)
Disallow: /

User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 95)
Disallow: /

User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 98)
Disallow: /

User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows XP)
Disallow: /

User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 2000)
Disallow: /

User-agent: ia_archiver
Disallow: /

User-agent: ia_archiver/1.6
Disallow: /

User-agent: Alexibot
Disallow: /

User-agent: Teleport
Disallow: /

User-agent: TeleportPro
Disallow: /

User-agent: Stanford Comp Sci
Disallow: /

User-agent: MIIxpc
Disallow: /

User-agent: Telesoft
Disallow: /

User-agent: Website Quester
Disallow: /

User-agent: moget/2.1
Disallow: /

User-agent: WebZip/4.0
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: WebSauger
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: NetAnts
Disallow: /

User-agent: Mister PiX
Disallow: /

User-agent: WebAuto
Disallow: /

User-agent: TheNomad
Disallow: /

User-agent: WWW-Collector-E
Disallow: /

User-agent: RMA
Disallow: /

User-agent: libWeb/clsHTTP
Disallow: /

User-agent: asterias
Disallow: /

User-agent: httplib
Disallow: /

User-agent: turingos
Disallow: /

User-agent: spanner
Disallow: /

User-agent: InfoNaviRobot
Disallow: /

User-agent: Harvest/1.5
Disallow: /

User-agent: Bullseye/1.0
Disallow: /

User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)
Disallow: /

User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
Disallow: /

User-agent: CherryPickerSE/1.0
Disallow: /

User-agent: CherryPickerElite/1.0
Disallow: /

User-agent: WebBandit/3.50
Disallow: /

User-agent: NICErsPRO
Disallow: /

User-agent: Microsoft URL Control - 5.01.4511
Disallow: /

User-agent: DittoSpyder
Disallow: /

User-agent: Foobot
Disallow: /

User-agent: WebmasterWorldForumBot
Disallow: /

User-agent: SpankBot
Disallow: /

User-agent: BotALot
Disallow: /

User-agent: lwp-trivial/1.34
Disallow: /

User-agent: lwp-trivial
Disallow: /

User-agent: http://www.WebmasterWorld.com bot
Disallow: /

User-agent: BunnySlippers
Disallow: /

User-agent: Microsoft URL Control - 6.00.8169
Disallow: /

User-agent: URLy Warning
Disallow: /

User-agent: Wget/1.6
Disallow: /

User-agent: Wget/1.5.3
Disallow: /

User-agent: Wget
Disallow: /

User-agent: LinkWalker
Disallow: /

User-agent: cosmos
Disallow: /

User-agent: moget
Disallow: /

User-agent: hloader
Disallow: /

User-agent: humanlinks
Disallow: /

User-agent: LinkextractorPro
Disallow: /

User-agent: Offline Explorer
Disallow: /

User-agent: Mata Hari
Disallow: /

User-agent: LexiBot
Disallow: /

User-agent: Web Image Collector
Disallow: /

User-agent: The Intraformant
Disallow: /

User-agent: True_Robot/1.0
Disallow: /

User-agent: True_Robot
Disallow: /

User-agent: BlowFish/1.0
Disallow: /

User-agent: http://www.SearchEngineWorld.com bot
Disallow: /

User-agent: http://www.WebmasterWorld.com bot
Disallow: /

User-agent: JennyBot
Disallow: /

User-agent: MIIxpc/4.2
Disallow: /

User-agent: BuiltBotTough
Disallow: /

User-agent: ProPowerBot/2.14
Disallow: /

User-agent: BackDoorBot/1.0
Disallow: /

User-agent: toCrawl/UrlDispatcher
Disallow: /

User-agent: WebEnhancer
Disallow: /

User-agent: suzuran
Disallow: /

User-agent: VCI WebViewer VCI WebViewer Win32
Disallow: /

User-agent: VCI
Disallow: /

User-agent: Szukacz/1.4
Disallow: /

User-agent: QueryN Metasearch
Disallow: /

User-agent: Openfind data gathere
Disallow: /

User-agent: Openfind
Disallow: /

User-agent: Xenu's Link Sleuth 1.1c
Disallow: /

User-agent: Xenu's
Disallow: /

User-agent: Zeus
Disallow: /

User-agent: RepoMonkey Bait & Tackle/v1.01
Disallow: /

User-agent: RepoMonkey
Disallow: /

User-agent: Microsoft URL Control
Disallow: /

User-agent: Openbot
Disallow: /

User-agent: URL Control
Disallow: /

User-agent: Zeus Link Scout
Disallow: /

User-agent: Zeus 32297 Webster Pro V2.9 Win32
Disallow: /

User-agent: Webster Pro
Disallow: /

User-agent: EroCrawler
Disallow: /

User-agent: LinkScan/8.1a Unix
Disallow: /

User-agent: Keyword Density/0.9
Disallow: /

User-agent: Kenjin Spider
Disallow: /

User-agent: Iron33/1.0.2
Disallow: /

User-agent: Bookmark search tool
Disallow: /

User-agent: GetRight/4.2
Disallow: /

User-agent: FairAd Client
Disallow: /

User-agent: Gaisbot
Disallow: /

User-agent: Aqua_Products
Disallow: /

User-agent: Radiation Retriever 1.1
Disallow: /

User-agent: WebmasterWorld Extractor
Disallow: /

User-agent: Flaming AttackBot
Disallow: /

User-agent: Oracle Ultra Search
Disallow: /

User-agent: MSIECrawler
Disallow: /

User-agent: PerMan
Disallow: /

User-agent: searchpreview
Disallow: /

User-agent: sootle
Disallow: /

User-agent: es
Disallow: /

User-agent: Enterprise_Search/1.0
Disallow: /

User-agent: Enterprise_Search
Disallow: /


User-agent: *
Disallow: /gfx/
Disallow: /cgi-bin/
Disallow: /QuickSand/
Disallow: /pda/
Disallow: /zForumFFFFFF/


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 5 posts ] 

All times are UTC [ DST ]


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Powered by phpBB