Backing up your Google Docs

I've gradually become a big fan of Google Docs over the last few months. My process for writing most of these posts has been to write the first drafts on Google Docs, then copy things over to WordPress and clean up any lingering formatting issues. I still wouldn't consider it to be a full-fledged replacement for Microsoft Office (which is also how I feel about Open Office, a discussion for another day) but it meets my basic needs.

Being to able to access everything from any computer is also a nice improvement to my old process. I used to compose posts in gVim, keep the revisions in Subversion, and then copy over to WordPress. This ended up being pretty cumbersome when switching from machine to machine.

The biggest problem I have with Google Docs is that I don't entirely trust it yet. It's a black blox and the inner workings are only visible to a third-party. What would happen if Google Docs blew up or they decided to start charging for it or there was just some random glitch and all of my data disappeared? I don't ever want to be in a situation where the only copy of my data is in a place that I don't have full control over.

The easiest way to take care of that is to always have a copy of my data somewhere else, ideally back at my house where it can be included in my existing offsite backup strategy. Google Docs does have the ability to save a document as a file to a variety of formats (just right-click on it in the items list) but, as we all should know by now, manual backups don't work. I want something that will just sit in the background and download all of my Google Docs stuff automatically without ever having to think about it.

And it turns out that it's pretty easy to write something to do just that, thanks to the Google Documents List Data API. All of the code snippets are in Python (because I need some Python practice) but it all boils down to calling different URLs so a PHP version should be pretty easy to come up with.

Step 1: Authentication

I've talked about authenticating against a Google account using PHP. Here's a simple little snippet to get the Auth code using Python instead:

~~~~ {.python name="code"} def getAuthInfo( email, password, source, service = 'writely', accountType = 'GOOGLE'): loginUrl = 'https://www.google.com/accounts/ClientLogin' loginData = { 'accountType': accountType, 'Email': email, 'Passwd': password, 'service': service, 'source': source, 'session': 1 } req = urllib2.Request( loginUrl , urllib.urlencode(loginData)) res = urllib2.urlopen(req) data = res.read() authInfo = {} for item in data.split(): fields = item.split('=') authInfo[fields[0]] = fields[1] return authInfo wzxhzdk:0

You can see how the HTTP header is just a Python dictionary that we pass to the URL.

The response to this API call is a big block of XML containing a list of entry elements for each document. An <entry> element looks like this:

~~~~ {.xml name="code"} type="text/html" /> test.user test.user@gmail.com scheme="http://schemas.google.com/g/2005#kind" term="http://schemas.google.com/docs/2007#document" /> scheme="http://schemas.google.com/g/2005/labels" term="http://schemas.google.com/g/2005/labels#starred" /> http://docs.google.com/feeds/documents/private/full/document%3Adocument_id type="text/html" /> rel="self" type="application/atom+xml" /> Test Document 2007-07-03T18:02:50.338Z wzxhzdk:1

This code snippet that shows authenticating, getting the document list, and looping over each entry in the list. This version is extracting the document id from the URL in the <id> element. I need to check to see if the gd:etag attribute of the entry is also the document id because that would be a much cleaner way of getting the id.

This code also grabs the document type (document, presentation, or spreadsheet) and takes the document title and cleans it up so it's suitable to use as an output filename.

~~~~ {.python name="code"} authInfo = getAuthInfo( 'username@gmail.com', 'password', 'My Backup Script', 'writely') docListXML = getDocList( authInfo['Auth']) docList = minidom.parseString(docListXML) for entry in docList.getElementsByTagName('entry'): ids = entry.getElementsByTagName('id') categories = entry.getElementsByTagName('category') titles = entry.getElementsByTagName('title') docIDLink = ids[0].firstChild.nodeValue fields = docIDLink.split('%3A') docID = fields[-1] title = titles[0].firstChild.nodeValue cleanTitle = re.sub('[^aA-zZ0-9 ]', '', title) categoryLabel = categories[0].attributes['label'].value downloadDoc(docID, categoryLabel, cleanTitle) wzxhzdk:2

You just pass the function a document id, categoryLabel (document, presentation, or spreadsheet), and an output filename. This version defaults to Microsoft Word format for documents, Powerpoint for presentations, and Excel for spreadsheets.

This gives us the basic pieces for writing a very simple script to download all of the files from a Google Documents account. I've got it setup to run as a nightly cronjob. It's not what I would consider "production-quality" code but it's more than enough to give me some peace of mind.

Wishlist for the next version

  • Download spreadsheets
  • Get folder list from Google Docs and put downloaded files into appropriate folders
  • Don't download files that haven't changed since the last download