Sitemap: from Perl to Python

The Beginning

I had just downloaded Python's HTMLgen from Olivier Andrich's Python RPM collection. I saw a post on comp.lang.python from Matej Cepl. He was trying to figure out how to use HTMLgen, and he mentioned translating Eric S. Raymond's sitemap from Perl to Python using HTMLgen. I thought that it sounded like a fun little project for a weekend.

Sitemap is small and simple enough that I thought it would make a good first example of translating Perl to Python. In fact, I first thought of the idea for a Perl to Python journal while I was discussing the translation with Matej. I have tried to follow the four steps presented above so that I can discuss some general Python programming concepts in juxtaposition with the original Perl program.

The First Step: Understanding

Eric Raymond's original sitemap is a fairly straightforward program. It traverses a directory hierarchy, parsing .htm, .html, and .shtml files. It stores each file's name, title, and description. The description this script uses is embedded in a <META> tag with a "name=description" attribute. Then it prints a single HTML document with all of the files grouped first by their depth in the directory hierarchy, then by directory, and finally alphabetically.

The Second Step: Translating

For this one program, I will first give a nearly literal translation into Python. See the code for comments on the changes from the Perl version. (As in Perl, comments are indicated by a # character. Everything following the # on a line is a comment. In the html version, comments are blue.) In particular, note that I replaced that fake nested from the Perl version with a true nested list. I also implemented much of the work that regular expressions accomplished in the Perl version by using the string module in Python. Notice that Python refuses to handle all of our errors for us. Python won't make many assumptions about what we meant to write or what it should do with an error. (I also have a copy of the literal sitemap translation without comments if you'd just like to see what the code looks like without all of the comments to distract you.)

The Third Step: Pythonizing

While the version presented above made some modifications to the original sitemap, it wasn't a typical sample of Python code. In the first Pythonized version of sitemap, I replace the ugly code that parses the configuration file by a simple call to execfile(). Note that doing so also permits much greater customization in the user's .sitemaprc file. For example, he can redefine functions such as indsort to obtain a different sorting of the indexed entries. The generation of the header and footer has been moved to functions so that they can be redefined in the user's .sitemaprc file. Most of the sitemap.py file is simply defining module globals: the configuration dictionary, the functions, and the PageInfo class.

The "main" work of the program is done at the very end, guarded by a if __name__ == '__main__': conditional. Thus, the file can be imported without having side effects outside of its own namespace. The if __name__ == '__main__': conditional is a standard Python idiom. Using it helps you to think about code reuse. Instead of writing a flat script, move common operations to functions or classes. Then call those functions from the protected main part of your program. Later, if you need to perform a similar task, your script is ready to export its functions and classes to other programs. They simply import the script or names from the script. Of course, I should be thinking about reusability in every language, but Python makes it easy for me to focus on reusability and not simply how to use the language. And, as always, there really are times when you just need a quick hack... in which case a flat script will do. Just remember to throw it away before it becomes a 1000 line monstrosity that someone else has to debug. ;-) If you download a version, this one is probably a the one you should get.

The Fourth Step: Expanding

Of course, the original idea was to learn HTMLgen. (I haven't finished testing this program. I'll put it up later.)