Simple web scraper / crawler in Python is one basic coding example. Therefore, this example demonstrates on how to make program that grabs information from web pages. Furthermore, this data can be used for additional calculations.
Reminder: If you need to brush up on your Python syntax check out Quick start with Python post.
In this simple example we want to read ads for apartments from portal called Willhaben and list some basic information about it.
Essentially, this simple web scraper in Python:
from bs4 import BeautifulSoup
import urllib.request
import re
def get_apartment_details(url):
fp = urllib.request.urlopen(url)
mybytes = fp.read()
mystr = mybytes.decode("ISO-8859-1")
fp.close()
soup = BeautifulSoup(mystr, 'html.parser')
soupString = str(soup)
post_code = re.search('post_code":"(.*?)"', soupString)
price = re.search('price":"(.*?)"', soupString)
rooms = re.search('rooms":"(.*?)"', soupString)
print("This is apartment in " + post_code.group(1) + ", with " + rooms.group(1) + " rooms, price is " + price.group(1) + ".")
graz_apartment_links = []
fp = urllib.request.urlopen("https://www.willhaben.at/iad/immobilien/eigentumswohnung/eigentumswohnung-angebote?&rows=100&areaId=601&parent_areaid=6")
mybytes = fp.read()
mystr = mybytes.decode("ISO-8859-1")
fp.close()
soup = BeautifulSoup(mystr, 'html.parser')
for link in soup.find_all('a'):
url = link.get('href')
if url != "#" and url != None and url.startswith(("/iad/immobilien/d/eigentumswohnung/steiermark/graz/")):
graz_apartment_links.append("https://www.willhaben.at" + url)
for url in graz_apartment_links:
get_apartment_details(url)
How does this work?
Firstly, we use urllib.request to at line 23 to get all data from Willhaben search link.
Secondly, we turn these results into soap object with BeautifulSoup at line 29. BeautifulSoup allows us to search html DOM with simple already implemented commands. For example, at line 31 we fetch all of the ‘a’ elements (HTML links) and loop trough them.
Thirdly, we call local function get_apartment_details at line 39, as many times as there are links in the willhaben page we fetched. Then, inside get_apartment_details we open each of these urls at line 6, turn them to soup objects at line 12, and find data that we want at lines 15, 16 and 17.
Finally, at the end we simply print these information in console (we could store them in database or forward it to some other service as well).
This article is about the code review best practices. It explains code review from the… Read More
API design is an important aspect of modern software development. It allows different systems to… Read More
This article sheds some light related to the question will ChatGPT or AIs in general… Read More
This article provides an overview of new features and deprecations in PHP 8.2. PHP 8.0… Read More
This article is about Automation and Artificial Intelligence in Software Engineering: Experiences, Challenges, and Opportunities.… Read More
PHP is getting more and more features. Enumerations in PHP are one of the latest… Read More