web scraper in python

Simple web scraper / crawler in Python

Simple web scraper / crawler in Python is one basic coding example. Therefore, this example demonstrates on how to make program that grabs information from web pages. Furthermore, this data can be used for additional calculations.

Reminder: If you need to brush up on your Python syntax check out Quick start with Python post.

In this simple example we want to read ads for apartments from portal called Willhaben and list some basic information about it.

Essentially, this simple web scraper in Python:

  1. Fetches a html data from willhaben link
  2. Gets all links from fetched URI
  3. Opens and crawls all of these pages (visiting each of them) and takes only data that we expect (post_code, price and number of rooms).
  4. At the end, program provides us with list of all apartments, with their areas, number of rooms and prices.

Web scraper in Python source code

from bs4 import BeautifulSoup
import urllib.request
import re

def get_department_details(url):
   fp = urllib.request.urlopen(url)

   mybytes = fp.read()
   mystr = mybytes.decode("ISO-8859-1")
   fp.close()

   soup = BeautifulSoup(mystr, 'html.parser')
   soupString = str(soup)

   post_code = re.search('post_code":"(.*?)"', soupString)
   price = re.search('price":"(.*?)"', soupString)
   rooms = re.search('rooms":"(.*?)"', soupString)

   print("This is apartment in " + post_code.group(1) + ", with " + rooms.group(1) + " rooms, price is " + price.group(1) + ".")
   
graz_apartment_links = []

fp = urllib.request.urlopen("https://www.willhaben.at/iad/immobilien/eigentumswohnung/eigentumswohnung-angebote?&rows=100&areaId=601&parent_areaid=6")

mybytes = fp.read()
mystr = mybytes.decode("ISO-8859-1")
fp.close()

soup = BeautifulSoup(mystr, 'html.parser')

for link in soup.find_all('a'):
    url = link.get('href')
    
    if url != "#" and url != None and url.startswith(("/iad/immobilien/d/eigentumswohnung/steiermark/graz/")):
        graz_apartment_links.append("https://www.willhaben.at" + url)
        

for url in graz_apartment_links:
    get_department_details(url)

How does this work?

Firstly, we use urllib.request to at line 23 to get all data from Willhaben search link.

Secondly, we turn these results into soap object with BeautifulSoup at line 29. BeautifulSoup allows us to search html DOM with simple already implemented commands. For example, at line 31 we fetch all of the ‘a’ elements (HTML links) and loop trough them.

Thirdly, we call local function get_apartment_details at line 39, as many times as there are links in the willhaben page we fetched. Then, inside get_apartment_details we open each of these urls at line 6, turn them to soup objects at line 12, and find data that we want at lines 15, 16 and 17.

Finally, at the end we simply print these information in console (we could store them in database or forward it to some other service as well).

References

  • More information about BeautifulSoup here.
  • Documentation about urllib.request can be found here.