Regular Expression not working in scraper?

This is a very common problem for the beginners who try to write web crawler / spider / scraper. The content is fetched but regex is not working right. :(

But the problem is not with the regular expression. You just need to add the following two lines after you fetch content of a web page:

content = content.replace("\n", "")
content = content.replace("\r", "")



Now the regex should work if everything else is ok!

Comments

Shiplu said…
Well, I always use multiline regular expression. they work.
Beside this Its good to use domxml to parse the content. It wont fail.
Shiplu said…
Another good technique would be using tidy and xslt for proper scrapping . . .
Tamim Shahriar said…
I use urllib2 for fetching content from a website.

Popular posts from this blog

Strip HTML tags using Python

lambda magic to find prime numbers

Convert text to ASCII and ASCII to text - Python code