Here’s a little Python script that searches through a given file for pre
tags and encodes anything in between. This is useful for when escaping from syntax highlighting plugins and replacing every occurrence of the code shortcode with a pre tag.
import re, sys # First argument is the filename, output is filename.encoded filename = sys.argv[1] f = file(filename) output = open('%s.encoded' % filename, 'w+'); # Read the whole file, fire the regular expressions contents = f.read() expr = re.compile(r'<pre>(.*?)</pre>', re.MULTILINE|re.DOTALL) matches = expr.findall(contents) # Loop through each match and replace < > with < and > for match in matches: contents = contents.replace(match, match.replace('<', '<').replace('>', '>')); # Write output file and close both files output.write(contents) output.close() f.close()
Most syntax highlighting plugins will encode all entities on the fly for you so when you stop using them your code might break. Also, most highlighting plugins will render your TinyMCE visual editor useless when working with code, and I think it’s quite common to work with code using the visual editor in WordPress. At least Twenty Ten and Twenty Eleven understand that ;)
However, as seen from the replacement part, I don’t really encode all entities but rather replace the greater than and less than symbols. It’s enough for most cases but if you need a real entity encoding you should use the cgi.escape function which is similar to htmlspecialchars
in php.
Feed this script with your database dump and it’ll create a new file with an .encoded prefix which you can feed back to MySQL. Please note though that this script reads the entier input file which may lead to slow execution, high memory usage and swapping when working with large files. Worked fine on my 30 megabyte database though.
And how did this script handle <pre> blabla <pre> bla </pre> bla </pre>
Sergey, good question. Before running it I was pretty sure there were no nested <code>pre</code> elements but if you think you might have then you should run a check for right before replacement if you want to leave them as they are. If you'd like to encode them too I think it'll handle it just fine though, you should give it a test ;) Thanks for your comment!