Software Engineering

Simple web scraper / crawler in Python

Simple web scraper / crawler in Python is one basic coding example. Therefore, this example demonstrates on how to make program that grabs information from web pages. Furthermore, this data can be used for additional calculations.

Reminder: If you need to brush up on your Python syntax check out Quick start with Python post.

In this simple example we want to read ads for apartments from portal called Willhaben and list some basic information about it.

Table of Contents

Introduction

Essentially, this simple web scraper in Python:

  1. Fetches a html data from willhaben link
  2. Gets all links from fetched URI
  3. Opens and crawls all of these pages (visiting each of them) and takes only data that we expect (post_code, price and number of rooms).
  4. At the end, program provides us with list of all apartments, with their areas, number of rooms and prices.

Web scraper in Python source code

from bs4 import BeautifulSoup
import urllib.request
import re

def get_apartment_details(url):
   fp = urllib.request.urlopen(url)

   mybytes = fp.read()
   mystr = mybytes.decode("ISO-8859-1")
   fp.close()

   soup = BeautifulSoup(mystr, 'html.parser')
   soupString = str(soup)

   post_code = re.search('post_code":"(.*?)"', soupString)
   price = re.search('price":"(.*?)"', soupString)
   rooms = re.search('rooms":"(.*?)"', soupString)

   print("This is apartment in " + post_code.group(1) + ", with " + rooms.group(1) + " rooms, price is " + price.group(1) + ".")
   
graz_apartment_links = []

fp = urllib.request.urlopen("https://www.willhaben.at/iad/immobilien/eigentumswohnung/eigentumswohnung-angebote?&rows=100&areaId=601&parent_areaid=6")

mybytes = fp.read()
mystr = mybytes.decode("ISO-8859-1")
fp.close()

soup = BeautifulSoup(mystr, 'html.parser')

for link in soup.find_all('a'):
    url = link.get('href')
    
    if url != "#" and url != None and url.startswith(("/iad/immobilien/d/eigentumswohnung/steiermark/graz/")):
        graz_apartment_links.append("https://www.willhaben.at" + url)
        

for url in graz_apartment_links:
    get_apartment_details(url)

How does this work?

Firstly, we use urllib.request to at line 23 to get all data from Willhaben search link.

Secondly, we turn these results into soap object with BeautifulSoup at line 29. BeautifulSoup allows us to search html DOM with simple already implemented commands. For example, at line 31 we fetch all of the ‘a’ elements (HTML links) and loop trough them.

Thirdly, we call local function get_apartment_details at line 39, as many times as there are links in the willhaben page we fetched. Then, inside get_apartment_details we open each of these urls at line 6, turn them to soup objects at line 12, and find data that we want at lines 15, 16 and 17.

Finally, at the end we simply print these information in console (we could store them in database or forward it to some other service as well).

References

  • More information about BeautifulSoup here.
  • Documentation about urllib.request can be found here.
milan.latinovic

Senior PHP Engineer and Enterprise Architect at apilayer GmbH. Topics of interest: Software development, PHP, Java, Python, REST API, OpenApi, MySQL, Microservices, Integrations, Interfaces, Interoperability, Processes, Solution Architecture, LDAP, Azure

Recent Posts

Code Review Best Practices: Code reviewing and being reviewed

This article is about the code review best practices. It explains code review from the… Read More

2 years ago

What are the best Practices in REST API design

API design is an important aspect of modern software development. It allows different systems to… Read More

2 years ago

Next Industrial revolution: What is ChatGPT? Will it replace jobs?

This article sheds some light related to the question will ChatGPT or AIs in general… Read More

2 years ago

What is new in PHP 8.2: What are new features, what is deprecated?

This article provides an overview of new features and deprecations in PHP 8.2. PHP 8.0… Read More

2 years ago

Automation and AI in Software Engineering

This article is about Automation and Artificial Intelligence in Software Engineering: Experiences, Challenges, and Opportunities.… Read More

4 years ago

Enumerations in PHP 8.1 – with code example and references

PHP is getting more and more features. Enumerations in PHP are one of the latest… Read More

4 years ago