Sitemaps.org Data Driven Sitemap XML Generator (sitemap.xml for Google, Yahoo!, MSN and others)
Update: Get SitemapCFC at Google Code or at RIAForge.
If you're not familiar with Sitemaps, visit the sitemaps.org home page, which provides an overview of the Sitemaps protocol (adopted by Google, Yahoo! and Microsoft, among others).
I wanted to generate a sitemap.xml file to submit to Google, Yahoo!, etc. based on data from a simple CMS application's database. I ran some quick searches and was surprised to not quickly find a CFC that did exactly what I was looking for. There are a few sitemap generators out there that crawl site links to produce a sitemap XML file, but I didn't find any that generated an XML file based on data. I know there are a number of applications that have built-in site generator support (like BlogCFC and probably most modern blog and CMS apps), but I didn't find anything generic and flexible enough for my needs. Ray Camden did share a UDF that handles the basics very nicely, but I wanted to be able to pass in different URL collection types with flexible key/column names. I'd already cooked up two different (albeit simple) application-specific sitemap generators for apps that I maintain, so it was time to genericize and reuse!
I'll outline the Sitemap.cfc I created, its features and some examples. I will update this post with a link to RIAForge once I have the project approved for upload there. If you're not looking for a data-driven sitemap generator, but rather a crawler or spider style sitemap generator, then check out this "Google Sitemap XML Generator". For a data driven sitemap generator (or, if you want to use your own crawler and just need to model and generate a valid sitemap.xml file), read on...
I quickly put together a relatively simple Sitemap.cfc to suit my needs, but then I found myself adding more and more little enhancements. Since the sitemaps.org protocol is relatively simple, it wasn't too difficult to create a CFC to model the protocol. I tried to keep it simple, but flexible enough to take a collection of URLs (and relevant meta data) in just about any form and spit out a valid sitemap.xml file.
SitemapCFC Feature Overview
- Use a list, query or array (of structs) to initialize a sitemap object.
- Query column or struct key names used to initialize a Sitemap.cfc object are not important; an optional init() argument can be used to map to standard sitemaps.org protocol tag names.
- Write your sitemap.xml file to disk or dynamically send a sitemap XML document to the browser as binary page output (cfcontent type text/xml).
- Debugging methods available to access a Sitemap.cfc object's URL collection in the form of an array, an XML object or the raw XML string.
- XML document is schema validation ready.
- All initialization data values are cleaned (entity escaping, date/time format, valid string values, etc.) and validated.
- Date(/time) values for the <lastmod> tags can be passed in as any valid date/time string or object; they will be automatically converted to UTC in proper W3C Datetime format (again, per sitemaps.org protocol).
Read on for an [] of examples...
Code Usage Examples
Usage is quite simple. Generally, a Sitemap.cfc object will be initialized with a collection of URLs, and then a sitemap.xml file is written to disk.
List of URLs as Sitemap Source Collection
The most basic sitemap object is initialized simply with a list of URLs as its collection:
urlList = 'http://site.com/,http://site.com/page2.html';
sitemap = createObject('component', 'Sitemap').init(urlList);
sitemap.toFile( expandPath('sitemap.xml') );
</cfscript>
Source written to sitemap.xml file:
<urlset
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9"
url="http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://site.com/</loc>
</url>
<url>
<loc>http://site.com/page2.html</loc>
</url>
</urlset>
Array as Sitemap Source Collection
The following example shows initialization from an array of structs, with standard key names.
page1 = { loc='http://site.com/',
lastmod='2/27/2008',
changefreq='weekly',
priority='1.0'
};
page2 = { loc='http://site.com/page2.html',
lastmod='2006-5-1',
extraKeysIgnored='anything'
};
collection = [page1, page2];
sitemap = createObject('component', 'Sitemap').init(collection);
sitemap.toFile( expandPath('sitemap.xml') );
</cfscript>
Source written to sitemap.xml file:
<urlset
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9"
url="http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://site.com/</loc>
<lastmod>2008-02-27T05:00:00Z</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>http://site.com/page2.html</loc>
<lastmod>2006-05-01T04:00:00Z</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>
</urlset>
Query as Sitemap Source Collection
The following example shows initialization from a query with non-standard key names accounted for with a collectionKeyMap argument:
collection = queryNew('pageUrl,lastUpdated,other,columns,ignored');
collectionKeyMap = { loc='pageUrl', lastmod='lastUpdated' };
queryAddRow(collection);
querySetCell(collection, 'pageUrl', 'http://site.com/');
querySetCell(collection, 'lastUpdated', '2/27/2008 12:05 AM');
queryAddRow(collection);
querySetCell(collection, 'pageUrl', 'http://site.com/page2.html');
querySetCell(collection, 'lastUpdated', '2006-5-1 04:20 PM');
sitemap = createObject('component', 'Sitemap').init(collection, collectionKeyMap);
sitemap.toFile( expandPath('sitemap.xml') );
</cfscript>
Source written to sitemap.xml file:
<urlset
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9"
url="http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://site.com/</loc>
<lastmod>2008-02-27T05:05:00Z</lastmod>
</url>
<url>
<loc>http://site.com/page2.html</loc>
<lastmod>2006-05-01T20:20:00Z</lastmod>
</url>
</urlset>
Summary
So, if you have any sort of URL collection data that you can match up with the sitemaps.org protocol, then you should be able to use this Sitemap.cfc to quickly and easily generate your sitemap.xml file(s).
Again, I hope to have this code available via RIAForge.org very soon, but feel free to contact me if you don't see it there yet but would like to use it. I'll update this post as soon as it's up. Update: Get SitemapCFC at Google Code or at RIAForge.
I'd love to hear from you if you find this useful, lacking features, have suggestions, etc.
thanks.
Yes, Google has limits on the Sitemap XML files submitted. It's no more than 50,000 URLs in one file and no more than 10MB when uncompressed, as indicated here:
http://www.google.com/support/webmasters/bin/answe...
My SitemapCFC code does not currently do anything to automatically limit or break up the data/files, but I was thinking of placing a limit or warning. I also think it would be a great idea to add support (probably a couple more CFCs to the project) to be able to create multiple files and also generate a Sitemap Index File:
http://sitemaps.org/protocol.php#index
For now, you could probably get away with writing a very small amount of code to query 50,000 record chunks of URL data and create multiple Sitemap XML files. Assuming you would only have a handful of files, it will be quite easy to manually create your Sitemap Index XML file, as indicated in the link above.
I really do like the idea of adding this functionality to my project, so I'll do my best to find time. In the mean time, I'd be happy to send you a quick code sample to demonstrate what you could do now. Feel free to use this site's contact form or e-mail me at jamie at this domain name.
Best,
Jamie
I have a site with user-generated content. I have a script to generate the site map I want and have this cron'ed to happen nightly. My question is, when a user adds content and brand new URLs are added to the sitemap, will Googlebot and others know about the new content when they next crawl the site? Do they read my sitemap.xml, or just rely on the one submitted?
Thanks for any help.
-Levi
I honestly don't know much about Google's indexing frequency, but I believe they will revisit your sitemap.xml on a regular basis, and I believe the frequency varies with different sites. I suggest checking out <a href="http://www.google.com/support/webmasters/">... Webmaster Tools Help articles</a>.
Best,
Jamie