Migrating from Google Sites to WordPress

I recently migrated from google sites to wordpress. There is lots to like about google sites (high uptime, zero maintenance), and I have been quite happy there.

The primary reasons¬†for my¬†move¬†are:¬†I have a server anyway, and I would like to allow comments on some of my posts. I¬†could not find any tools that allowed me export & import google sites. I tried importing¬†from the RSS feed, but¬†eventually I¬†found¬†a much better option:¬†python (I don’t do much programming in python, but it is really a pleasure¬†to work in¬†for these types of tasks).

First pass

  • Parse¬†and traverse my google-sites pages (using beautifulsoup).
  • Clean the html of the main content using regexps.
  • Find all images in content. Download them to a local file.
  • Upload images to wordpress using¬†python-wordpress-xmlrpc, and replace image urls in content html.
  • Create new posts using¬†python-wordpress-xmlrpc¬†. In this pass you should aim to¬†set the¬†content, the title, the date, ¬†(I used¬†dateutil.parser to¬†set the date of the post)

Second pass

All the internal links on the new site are still pointing to the old site. I fixed this in a second pass.

  • Get a list of all posts using the XMLRPC. From that make a list of all the post titles.
  • Loop¬†each post, and for each url pointing to the old site, try to see if the basename¬†has a close match in the list of wordpress titles.¬†For that I¬†used difflib.get_close_matches.

I did not find a way to automatically download non-image google sites attachments (such as pdfs). Google appears to deliberately have made that difficult. I did not investigate further.

I’d love to share my code. Unfortunately I¬†had to go through a learning process and my approach¬†was not nearly as clean as¬†outlined above. So¬†you’ll have to make do with snippets to get you started. I am not a python programmer, so please excuse the style:

from bs4 import BeautifulSoup
import requests
import re
import shutil
from wordpress_xmlrpc import Client, WordPressPost
from wordpress_xmlrpc.compat import xmlrpc_client
from wordpress_xmlrpc.methods import media, posts
from urlparse import urlparse
from os.path import splitext, basename
from glob import glob
from difflib import get_close_matches
from dateutil import parser
import mimetypes
mimetypes.init()
wpUserName='username'
wpPassword='password'
wpUrl='http://blog.example.com/xmlrpc.php'
client = Client(wpUrl,wpUserName,wpPassword)
allposts=client.call(posts.GetPosts({'number':1000}))
posttitles=[post.title for post in allposts]
for post in allposts:
    print "-------------------------"
    print repr(post.title)
    txt=unicode(post.content)
    links=re.findall(r'href="[^>"]*google-sites.example.com[^>"]*/([^>"\?]*)', txt)
    for link in links:
        titlematches=get_close_matches(link,posttitles)
        if titlematches:
            matchingpost = next((i for i in allposts if i.title == titlematches[0]), None)
            txt=re.sub(r'"[^>\n"]*' + re.escape(link) + r'[^>\n"]*"','"'+ matchingpost.link +'"',txt)
            print repr(link) + " -> " + repr(matchingpost.link)
        else:
            print "NO MATCH FOR LINK:" + repr(link)
    post.content=re.sub(r"https?://blog.example.com/","/",txt);
    client.call(posts.EditPost(post.id,post))

update: I have since decided to migrate to a static website generator: Hugo.

Share