Saturday, July 26, 2008

Strip HTML tags using Python

We often need to strip HTML tags from string (or HTML source). I usually do it using a simple regular expression in Python. Here is my function to strip HTML tags:
def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

Here is another function to remove more than one consecutive white spaces:
def remove_extra_spaces(data):
    p = re.compile(r'\s+')
    return p.sub(' ', data)

Note that re module needs to be imported in order to use regular expression.

Here you can find an updated code that gets the text from html: http://love-python.blogspot.com/2011/04/html-to-text-in-python.html

20 comments:

Graham said...

The regex will kill most of the string, if it contains "well, if a < b, then blah <em>blah</em>"

Don't count on people using <, best to check for known tag names, and perhaps limit tag length to 10 characters.

how about

>>> re.sub("</?[^\W].{0,10}?>", "", "<a>what if 3 < 5 </>")

what if 3 < 5

subeen said...

Yes, your regexp is better. Thanks.

Graham said...

Oops... spotted a problem: it won't match if there's attributes.

The regex should probably check for a list of valid HTML tags, either closing the tag immediately after the tag name, or following it with a space and other characters (for attributes) until closing the tag.

Would still be one regex, but guess a good solution isn't one-line simple.

subeen said...

What do you think about this one:
p = re.compile(r'<[^<]*?>')

Graham said...

that's a good one - telling it to avoid a match if there's another "<" anywhere in the potential tag should let it parse "3 < 5" safely

Anpanman said...

I'm a n00b to programming in general and was just trying to write text from this website to a .txt file. Seems like whatever I do I keep getting all of the tags. I'm sure it's obvious but I don't get what's wrong. Any ideas?

import re
import os, sys, glob
from os import system
from urllib.request import urlopen


page = urlopen("http://love-python.blogspot.com/2008/07/strip-html-tags-using-python.html").read()
myfile = open('testfile.txt', 'w')
fileencoding = "iso-8859-1"
txt = page.decode(fileencoding)

def remove_html_tags(txt):
p = re.compile(r'<[^<]*?/>')
return p.sub('', txt)

myfile.write(txt)

Graham said...

Ok anpanman, there's a bug in that one too (didn't we test it?).

Fixed version:

x = re.compile(r'<[^<]*?/?>')

x.sub('', 'a <b style="blah">gsts</b>')

-> 'a gsts'



.sub('', 'a t')

Anpanman said...

Hmm. I don't really get it but I suspect it's more because of my lack of fundamental skill and knowledge than your reply. Thanks for your time Graham.

Graham said...

anpanman, the broken regex re.compile(r'<[^<]*?/>') requires the closing "/" and so only matches close-tags

the fixed one, re.compile(r'<[^<]*?/?>') matches open-tags as well

Graham said...

Also, the open-source app Kodos makes it much easier to test python regexesO

http://kodos.sourceforge.net/

Karthik said...

I don't believe it is necessary to take "a < b" into account. HTML strings should always have < replaced with (ampersand)lt; and > replaced with (ampersand)gt; if they aren't used for a tag.

Then again, it's still a great way to prevent an error against the few who don't change them. :)

Note: (ampersand) = &. I had to write it differently so that it wouldn't show up as < or >.

staff said...

thanks!!

David said...

Google Buzz Export to Twitter[...]I've written a python script to grab your Google Buzz feed (as detailed in the Buzz API), and automatically post your Buzz-es to Twitter. It includes a link back to the original Buzz URL[...]

NIket said...

In following example tag =
the regular ecpression is fails. because this is not a valid html tag.

Igor Partola said...

Please don't! Using regular expressions to parse HTML makes kittens cry: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

Graham said...

Igor Partola: the regex's purpose is to strip parts that vaguely resemble a pointy-bracket markup, not for parsing HTML.

How would an HTML parser work with "3 < 5 " ? By throwing an error? Not much use then.

If OP just wanted to display strings I suppose it would also work to encode the pointies as < and > , at the expense of messy output.

musicaonrails said...

It must be a joke that you are trying to use a Regex to parse HTML. Let's start with some Language Theory from the basic Computer Science classes.

Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regex on HTML is heuristics but that will not work on every condition. It is possible to find a problem in all regex that are trying to parse HTML. Please, go read "Chomsky hierarchy", and then, you are going to know that the context-free language set is bigger than the regular language set, and that is the WHY this discussion here makes no sense based on the principles of Computer Science!

Tamim Shahriar (Subeen) said...

@musicaonrails, yes, you are theoretically correct, but in practice, regex works very well specially when you are trying to parse some specific websites. :)

Graham Poulter said...

Remember, the regex does not in fact claim to parse HTML. What it does is strips any and all HTML tags with no regard for the grammar of HTML as a whole. It would thus work just as well in invalid tag soup as actual HTML.

rene jo said...

i need to extract the data between html tags. how do i do that?