 |
04-11-2007, 09:42 PM
|
#1 (permalink)
|
|
Inactive
Join Date: 01-21-04
Posts: 779
|
New Robots.txt Standard
From the V7N Search Blog:
Quote:
Google, Yahoo, MSN, and Ask have got together and announced a new robots.txt feature, sitemap auto-discovery.
“The new open-format autodiscovery allows webmasters to specify the location of their sitemaps within their robots.txt file, eliminating the need to submit sitemaps to each search engine separately“.
What are site-maps?
A sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site. More information here. Formatting guidelines are here.
What is the robots.txt specification for a sitemap?
Sitemap: <sitemap_location>
|
http://blog.v7n.com/2007/04/11/auto-...via-robotstxt/
|
|
|
04-12-2007, 06:13 AM
|
#2 (permalink)
|
|
Inactive
Join Date: 03-29-07
Posts: 60
Latest Blog: None
|
Thanks very interesting
|
|
|
04-12-2007, 12:57 PM
|
#3 (permalink)
|
|
Inactive
Join Date: 09-22-06
Location: Los Angeles
Posts: 678
Latest Blog: None
|
See also this announcement from Yahoo!.
Quote:
|
Since working with Google and Microsoft [JB: and Ask and IBM] to support a single format for submission with Sitemaps, we have continued to discuss further enhancements to make it easy for webmasters to get their content to all search engines quickly.
|
Below, please find the PHP code for generating these types of sitemaps semi-automatically. It's FREE! Enjoy! (Please post modifications.)
Code:
<?php
/*########################################################
# Generates a sitemap per specifications found at: #
# http://www.sitemaps.org/protocol.html #
# DOES NOT traverse directories #
# Apr 12 2007 By James Butler <james@musicforhumans.com> #
# Free for all: http://www.gnu.org/licenses/lgpl.html #
# #
# Useage: #
# 1) Save this as file name: sitemap_gen.php #
# 2) Change variables noted below for your site #
# 3) Place this file in your site's root directory #
# 4) Run from http://www.yourdomain.com/sitemap_gen.php #
# #
# <lastmod> -OPTIONAL #
# YYYY-MM-DD #
# <changefreq>-OPTIONAL #
# always #
# hourly #
# daily #
# weekly #
# monthly #
# yearly #
# never #
# <priority> -OPTIONAL #
# 0.0-1.0 [default 0.5] #
# #
# Add completed sitemap file to robots.txt: #
# Sitemap: http://www.yourdomain.com/sitemap.xml #
# #
########################################################*/
######## CHANGE THESE FOR YOUR SITE #########
# IMPORTANT: Trailing slashes are REQUIRED!
$my_domain = "http://www.yourdomain.com/";
$root_path_to_site = "/root/path/to/site/";
$file_types_to_include = array('html','htm');
############## END CHANGES ##################
$xml ="<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";
$xml.="<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\n";
$xml.=" <url>\n";
$xml.=" <loc>".$my_domain."</loc>\n";
$xml.=" <priority>1.0</priority>\n";
$xml.=" </url>\n";
function file_type($file){
$path_chunks = explode("/", $file);
$thefile = $path_chunks[count($path_chunks) - 1];
$dotpos = strrpos($thefile, ".");
return strtolower(substr($thefile, $dotpos + 1));
}
$file_count = 0;
$path = opendir($root_path_to_site);
while (false !== ($filename = readdir($path))) {
$files[] = $filename;
}
sort($files);
foreach ($files as $file) {
$extension = file_type($file);
if($file != '.' && $file != '..' && array_search($extension, $file_types_to_include) !== false) {
$file_count++;
$xml.=" <url>\n";
$xml.=" <loc>".$my_domain.$file."</loc>\n";
$xml.=" <lastmod>".date("Y-m-d",filemtime($file))."</lastmod>\n";
$xml.=" <changefreq>monthly</changefreq>\n";
$xml.=" <priority>0.5</priority>\n";
$xml.=" </url>\n";
}
}
$xml.="</urlset>\n";
if($file_count == 0){
echo "No files to add to the Sitemap\n";
}
else {
$sitemap=fopen("sitemap.xml","w+");
if (is_writable("sitemap.xml")) {
fwrite($sitemap,$xml);
fclose($sitemap);
echo "DONE! <a href='sitemap.xml'>View sitemap.xml</a><br>\n";
echo "Remove items you do not want included in the search engines.<br>\n";
echo "Modify < changefreq > and < priority > to taste.<br>\n";
echo "Add 'Sitemap: ".$my_domain."sitemap.xml' to robots.txt.<br>\n";
}
else {
exec("touch sitemap.xml");
exec("chmod 666 sitemap.xml");
if (is_writable("sitemap.xml")) {
fwrite($sitemap,$xml);
fclose($sitemap);
exec("chmod 644 sitemap.xml");
echo "DONE! <a href='sitemap.xml'>View sitemap.xml</a><br>\n";
echo "Remove items you do not want included in the search engines.<br>\n";
echo "Modify < changefreq > and < priority > to taste.<br>\n";
echo "Add 'Sitemap: ".$my_domain."sitemap.xml' to robots.txt.<br>\n";
}
else {
echo "File is not writable.<br>\n";
}
}
}
?>
|
|
|
04-12-2007, 07:27 PM
|
#4 (permalink)
|
|
Inactive
Join Date: 09-22-06
Location: Los Angeles
Posts: 678
Latest Blog: None
|
See also this announcement from Yahoo!.
Quote:
|
Since working with Google and Microsoft [JB: and Ask and IBM] to support a single format for submission with Sitemaps, we have continued to discuss further enhancements to make it easy for webmasters to get their content to all search engines quickly.
|
Below, please find the PHP code for generating these types of sitemaps semi-automatically. It's FREE! Enjoy! (Please post modifications.)
Code:
<?php
/*########################################################
# Generates a sitemap per specifications found at: #
# http://www.sitemaps.org/protocol.html #
# DOES NOT traverse directories #
# 20070712 James Butler james at musicforhumans dot com #
# Based on opendir() code by mike at mihalism dot com #
# http://us.php.net/manual/en/function.readdir.php#72793 #
# Free for all: http://www.gnu.org/licenses/lgpl.html #
# #
# Useage: #
# 1) Save this as file name: sitemap_gen.php #
# 2) Change variables noted below for your site #
# 3) Place this file in your site's root directory #
# 4) Run from http://www.yourdomain.com/sitemap_gen.php #
# #
# <lastmod> -OPTIONAL #
# YYYY-MM-DD #
# <changefreq>-OPTIONAL #
# always #
# hourly #
# daily #
# weekly #
# monthly #
# yearly #
# never #
# <priority> -OPTIONAL #
# 0.0-1.0 [default 0.5] #
# #
# Add completed sitemap file to robots.txt: #
# Sitemap: http://www.yourdomain.com/sitemap.xml #
# #
########################################################*/
######## CHANGE THESE FOR YOUR SITE #########
# IMPORTANT: Trailing slashes are REQUIRED!
$my_domain = "http://www.yourdomain.com/";
$root_path_to_site = "/root/path/to/site/";
$file_types_to_include = array('html','htm');
############## END CHANGES ##################
$xml ="<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";
$xml.="<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\n";
$xml.=" <url>\n";
$xml.=" <loc>".$my_domain."</loc>\n";
$xml.=" <priority>1.0</priority>\n";
$xml.=" </url>\n";
## START Modified mike at mihalism dot com Code ######
function file_type($file){
$path_chunks = explode("/", $file);
$thefile = $path_chunks[count($path_chunks) - 1];
$dotpos = strrpos($thefile, ".");
return strtolower(substr($thefile, $dotpos + 1));
}
$file_count = 0;
$path = opendir($root_path_to_site);
while (false !== ($filename = readdir($path))) {
$files[] = $filename;
}
sort($files);
foreach ($files as $file) {
$extension = file_type($file);
if($file != '.' && $file != '..' && array_search($extension, $file_types_to_include) !== false) {
$file_count++;
### END Modified mike at mihalism dot com Code ######
$xml.=" <url>\n";
$xml.=" <loc>".$my_domain.$file."</loc>\n";
$xml.=" <lastmod>".date("Y-m-d",filemtime($file))."</lastmod>\n";
$xml.=" <changefreq>monthly</changefreq>\n";
$xml.=" <priority>0.5</priority>\n";
$xml.=" </url>\n";
}
}
$xml.="</urlset>\n";
if($file_count == 0){
echo "No files to add to the Sitemap\n";
}
else {
$sitemap=fopen("sitemap.xml","w+");
if (is_writable("sitemap.xml")) {
fwrite($sitemap,$xml);
fclose($sitemap);
echo "DONE! <a href='sitemap.xml'>View sitemap.xml</a><br>\n";
echo "Remove items you do not want included in the search engines.<br>\n";
echo "Modify < changefreq > and < priority > to taste.<br>\n";
echo "Add 'Sitemap: ".$my_domain."sitemap.xml' to robots.txt.<br>\n";
}
else {
exec("touch sitemap.xml");
exec("chmod 666 sitemap.xml");
if (is_writable("sitemap.xml")) {
fwrite($sitemap,$xml);
fclose($sitemap);
exec("chmod 644 sitemap.xml");
echo "DONE! <a href='sitemap.xml'>View sitemap.xml</a><br>\n";
echo "Remove items you do not want included in the search engines.<br>\n";
echo "Modify < changefreq > and < priority > to taste.<br>\n";
echo "Add 'Sitemap: ".$my_domain."sitemap.xml' to robots.txt.<br>\n";
}
else {
echo "File is not writable.<br>\n";
}
}
}
?>
Last edited by StupidScript : 04-12-2007 at 07:28 PM.
Reason: PROPER ATTRIBUTION FOR CONTRIBUTED CODE
|
|
|
04-13-2007, 03:24 AM
|
#5 (permalink)
|
|
Inactive
Join Date: 11-09-06
Posts: 446
Latest Blog: None
|
This is a very good feature. I always find it amazing to have to submit my sitemaps to each search engine. This way I'll be able to focus on something else. Thank you.
|
|
|
04-15-2007, 08:25 PM
|
#6 (permalink)
|
|
Contributing Member
Join Date: 04-02-07
Location: San Francisco
Posts: 255
|
This is good news, and from the press releases I've been seeing it seems like more and more search engines are piling on the band wagon.
|
|
|
04-15-2007, 09:22 PM
|
#7 (permalink)
|
|
Contributing Member
Join Date: 08-26-06
Posts: 241
|
I was wondering, how often does Search Engine bots read the robots.txt file? Everytime?
|
|
|
04-16-2007, 03:10 PM
|
#8 (permalink)
|
|
Inactive
Join Date: 09-22-06
Location: Los Angeles
Posts: 678
Latest Blog: None
|
Every time.
|
|
|
04-16-2007, 04:24 PM
|
#9 (permalink)
|
|
Inactive
Join Date: 11-09-06
Posts: 88
Latest Blog: None
|
It's about time they implemented this. Great new standard to have. I know that google has implemented it. Any word on Yahoo and the others?
|
|
|
04-16-2007, 05:06 PM
|
#10 (permalink)
|
|
Contributing Member
Join Date: 04-02-07
Location: San Francisco
Posts: 255
|
Yahoo, MSN, Ask, Google and others all came out in support of it.
I just added the line to my sites. No clue if it's working, but figured what the heck. 
|
|
|
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
|
|
|
| Thread Tools |
|
|
| Display Modes |
Linear Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -7. The time now is 04:51 AM.
© Copyright 2008 V7 Inc
|
|