Beautiful Soup tiny bug
Beautiful Soup is great for parsing random bits of crummy HTML. However, I think I’ve found a small bug, and I’m putting it up here just in case anyone else comes across the same thing. If the HTML specifies a charset of “windows-1252” in its meta header, then it isn’t changed to utf-8, though the content is. If you change the case of the encoding, or if you specify the same encoding manually, it’s fine. I’ve put a short transcript below to show the problem. To fix the bug, simply apply the following patch to BeautifulSoup.py (currently version 3.0.5):
@@ -1505,25 +1505,26 @@
if httpEquiv and contentType: # It's an interesting meta tag.
match = self.CHARSET_RE.search(contentType)
if match:
+ newCharset = match.group(3)
if getattr(self, 'declaredHTMLEncoding') or \
- (self.originalEncoding == self.fromEncoding):
+ self.originalEncoding == self.fromEncoding or \
+ self.originalEncoding.lower() == newCharset.lower():
# This is our second pass through the document, or
# else an encoding was specified explicitly and it
- # worked. Rewrite the meta tag.
+ # worked, or we're already the encoding the meta tag
+ # specifies. Rewrite the meta tag.
newAttr = self.CHARSET_RE.sub\
(lambda(match):match.group(1) +
"%SOUP-ENCODING%", value)
attrs[contentTypeIndex] = (attrs[contentTypeIndex][0],
newAttr)
tagNeedsEncodingSubstitution = True
- else:
+ elif newCharset:
# This is our first pass through the document.
# Go through it again with the new information.
- newCharset = match.group(3)
- if newCharset and newCharset != self.originalEncoding:
- self.declaredHTMLEncoding = newCharset
- self._feed(self.declaredHTMLEncoding)
- raise StopParsing
+ self.declaredHTMLEncoding = newCharset
+ self._feed(self.declaredHTMLEncoding)
+ raise StopParsing
tag = self.unknown_starttag("meta", attrs)
if tag and tagNeedsEncodingSubstitution:
tag.containsSubstitutions = True
Transcript showing problem
$ python Python 2.4.3 (#1, May 18 2006, 07:40:45) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> from BeautifulSoup import BeautifulSoup >>> doc = """<html> ... <meta http-equiv="Content-type" content="text/html; charset=Windows-1252"> ... Sacr\xe9 bleu! ... </html>""" >>> print BeautifulSoup(doc).prettify() <html> <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> Sacré bleu! </html> >>> doc = """<html> ... <meta http-equiv="Content-type" content="text/html; charset=windows-1252"> ... Sacr\xe9 bleu! ... </html>""" >>> print BeautifulSoup(doc).prettify() <html> <meta http-equiv="Content-type" content="text/html; charset=windows-1252" /> Sacré bleu! </html> >>> doc = """<html> ... <meta http-equiv="Content-type" content="text/html; charset=windows-1252"> ... Sacr\xe9 bleu! ... </html>""" >>> print BeautifulSoup(doc, fromEncoding='windows-1252').prettify() <html> <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> Sacré bleu! </html>