get html source of an URL

I have been using Python to write web crawler/spider/scraper for a long time. And it's an interesting experience indeed. The good news is, I have decided to share my web crawler experience with you. I shall use the terms crawler, spider, scraper alternatively.

The most basic thing to write a web spider is to get the html source (i.e. content) of an URL. There are many ways to do it. Here I post a simple code that gets the html source from an url.


import urllib2

url = 'http://abc.com' # write the url here

usock = urllib2.urlopen(url)
data = usock.read()
usock.close()

print data


Urllib2 is a very useful module for the spiderman ;) so take a look at the documentation http://www.python.org/doc/current/lib/module-urllib2.html

Comments

Akdeniz said…
thanks man.it really helped.
Unknown said…
Exactly what I was looking for! Thank you.
shiva said…
Above mentioned that "There are many ways to read html content of a url" right ?. ok can u please tell the other modules/packages/ways that can be used to read html content of a url other than urllib2 in python ?
Tamim Shahriar said…
Yes, there are many ways. For example, you can use requests module. For details, please check this post: http://love-python.blogspot.com/2012/12/python-requests-simple-crawler-scraper.html

Popular posts from this blog

Strip HTML tags using Python

lambda magic to find prime numbers

Convert text to ASCII and ASCII to text - Python code