HTML to TEXT in Python
I just wrote a small Python program. In the script there was a part where I needed to get the body of a web page and get rid of all the html tags, javascript, css styles, html comments etc. So I searched Google, found several threads in stackoverflow and then found this: http://www.aaronsw.com/2002/html2text/ This looks cool. But when I tested it against the 'about me' page of my blog, it didn't work because of some broken tags! Then I started to write the html to text function myself to get the plain text only. With help of regular expression I solved my problem (but may be I created more problems!). Here is my Python code: def html_to_text(data): # remove the newlines data = data.replace("\n", " ") data = data.replace("\r", " ") # replace consecutive spaces into a single one data = " ".join(data.split()) # get only the body content bodyPat = re.compile(r'< body[