Regular Expression not working in scraper?

July 05, 2008

This is a very common problem for the beginners who try to write web crawler / spider / scraper. The content is fetched but regex is not working right. :(

But the problem is not with the regular expression. You just need to add the following two lines after you fetch content of a web page:


content = content.replace("\n", "")
content = content.replace("\r", "")

Now the regex should work if everything else is ok!

Comments

Shiplu said…

Well, I always use multiline regular expression. they work.
Beside this Its good to use domxml to parse the content. It wont fail.

July 13, 2008 at 12:10 AM

Shiplu said…

Another good technique would be using tidy and xslt for proper scrapping . . .

July 13, 2008 at 12:12 AM

Tamim Shahriar said…

I use urllib2 for fetching content from a website.

July 13, 2008 at 10:09 AM

Search This Blog

life is short - you need Python!

Regular Expression not working in scraper?

Comments

Popular posts from this blog

Strip HTML tags using Python

lambda magic to find prime numbers

Convert text to ASCII and ASCII to text - Python code