get html source of an URL

February 12, 2008

I have been using Python to write web crawler/spider/scraper for a long time. And it's an interesting experience indeed. The good news is, I have decided to share my web crawler experience with you. I shall use the terms crawler, spider, scraper alternatively.

The most basic thing to write a web spider is to get the html source (i.e. content) of an URL. There are many ways to do it. Here I post a simple code that gets the html source from an url.


import urllib2

url = 'http://abc.com' # write the url here

usock = urllib2.urlopen(url)
data = usock.read()
usock.close()

print data

Urllib2 is a very useful module for the spiderman ;) so take a look at the documentation http://www.python.org/doc/current/lib/module-urllib2.html

Comments

Akdeniz said…

thanks man.it really helped.

January 25, 2009 at 8:24 PM

joepadz said…

Thanks! Great

March 30, 2012 at 10:11 AM

Unknown said…

Exactly what I was looking for! Thank you.

January 6, 2013 at 7:58 AM

shiva said…

Above mentioned that "There are many ways to read html content of a url" right ?. ok can u please tell the other modules/packages/ways that can be used to read html content of a url other than urllib2 in python ?

January 7, 2013 at 1:36 PM

Tamim Shahriar said…

Yes, there are many ways. For example, you can use requests module. For details, please check this post: http://love-python.blogspot.com/2012/12/python-requests-simple-crawler-scraper.html

January 7, 2013 at 2:24 PM

Search This Blog

life is short - you need Python!

get html source of an URL

Comments

Popular posts from this blog

Strip HTML tags using Python

lambda magic to find prime numbers

Convert text to ASCII and ASCII to text - Python code