Web Scraping with Python

Posted on March 12, 2021July 7, 2022

Specifically, a Python module called Beautiful Soup

First, we need to know what web scraping is?

Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that one can use it in various applications.

And that is where Python comes in, Python has many modules for that purpose only, to convert unstructured HTML data to structured data/ python variables, and do with it as you please afterwards.

Many large websites, like Google, Twitter, Facebook, StackOverflow, etc., have API’s that allow you to access their data in a structured format. This is the best option, but other sites don’t allow users to access large amounts of data in a structured form or are simply not that technologically advanced. In that situation, it’s best to use Web Scraping to scrape the website for data.

Uses

I’ll just list out some of the uses in a concise manner:

Price Monitoring
Market Research
News Monitoring
Sentiment Analysis
Email Marketing

And any other kind of use I can think of, as long as the goal is on the open Internet, it can be achieved.

Today we won’t do any project (maybe in later blogs), but tell you the basic functions and syntax, so you can figure that out on your own.

Choices

Following are some popular Python modules to access the data on websites

Requests

“Requests” is too simple.

For web scraping, it can only be used in a limited capacity. One place where it is useful though, is to get images off of a link, but even that can be done with other modules.

Selenium

Selenium is good, but the main purpose of selenium is for web automation. Instead, so many easy things are complex to do.

Selenium really shines when we need to work with JavaScript(DOM) concepts or make a script that does a set of actions online. It can also handle online AJAX requests very well.

Scrapy

Scrapy is built for scraping a huge amount of data, so it is a bit more professional tool than what we are looking for in casual projects. It uses asynchronous system calls and none of the modules out there can beat scrapy in terms of performance.

Beautiful Soup

It is easy to learn and master. For example, if we want to extract all the links from the webpage, or values in a table, there are convenient functions built into it, which we’ll look into in the sections to come. Its documentation is pretty detailed with a lot of community support due to its popularity.

Before that, let’s see what such a module entails in the next section.

Beautiful Soup

Web scraping requires two parts, namely the crawler and the scraper. The crawler is an artificial intelligence algorithm that browses the web to search for the particular data required by following the links across the internet. Instead of a crawler, we will directly give the links to our program.

On the other hand, the scraper is a specific tool created to extract data from the website. So, Beautiful Soup will be our Scraper.

Installation

For Windows:

`pip install beautifulsoup4`

For Linux:

`apt install python3-bs4`

`pip3 install beautifulsoup4`

For Mac:

`easy_install beautifulsoup4`

`pip3 install beautifulsoup4`

Importing

# importing the useful class from the module

from bs4 import BeautifulSoup

NOTE: We also need an HTML parser to take the content out of the actual HTML on a website

The Good thing is Python has its own inbuilt HTML parser, which is decent.

However, if you want to install a third-party parser you are free to do so; following are some suggestions:

Parser	Advantages	Disadvantages
html.parser	Batteries includedDecent speedLenient	Not as fast as lxml, less lenient than html5lib.
lxml	Very fastLenient	External C dependency
html5lib	Extremely lenientParses pages the same way a web browser doesCreates valid HTML5	Very slowExternal Python dependency

I would personally recommend html5lib.

I would recommend html.parser if you don’t want the hassle of installing a 3rd-party parser.

lxml if you are dealing with a huge amount of data.

For lxml parser:

Linux : $ apt-get install python-lxml

Mac : $ easy_install lxml

Windows: $ pip install lxml

For html5lib parser:

Linux : $ apt-get install python-html5lib

Mac : $ easy_install html5lib

Windows: $ pip install html5lib

Now finally lets start!

Parser Choosing

# converting the text into a Soup object

soup = BeautifulSoup(story, ‘html.parser’)

soup = BeautifulSoup(story, ‘html5lib’)

soup = BeautifulSoup(story, ‘lxml’)

Sample

This is the sample HTML we will use for all the examples to come:

story.html :

<html><head><title>The Dormouse’s story</title></head>

<body>

<p class=”title” id=”heading”

>The Dormouse’s story

Once upon a time there were three little sisters; and their names were

<a href=”http://example.com/elsie” class=”sister” id=”link1″>Elsie</a>,

<a href=”http://example.com/lacie” class=”sister” id=”link2″>Lacie</a> and

<a href=”http://example.com/tillie” class=”sister” id=”link3″>Tillie</a>;

and they lived at the bottom of a well.

…

Code

TM8d3EIxzcFTF2DcbcjFmX8jVI Z2feAZaQrfLJ9gmPXl4ybj vK0mxJMZZxrL 4aMRTCAd85gV2W8 2ED2 TYTMrJkH0crJ32OKqiD9ioS2FdM6

Opening and Seeing the HTML

# opening the html file and reading the content

with open(“story.html”,”r”) as f:

story = f.read()

# just using the inbuilt html parser

soup = BeautifulSoup(story, ‘html.parser’)

print(soup.prettify())

Output:

<html>

<head>

<title>The Dormouse’s story</title>

</head>

<body>

The Dormouse’s story

Once upon a time there were three little sisters; and their names were

<a href=”http://example.com/elsie” class=”sister” id=”link1″>Elsie</a>,

<a href=”http://example.com/lacie” class=”sister” id=”link2″>Lacie</a> and

<a href=”http://example.com/tillie” class=”sister” id=”link3″>Tillie</a>;

and they lived at the bottom of a well.

…

Navigating

When there are more of the same tags and you want a specific one, not “the first on that matches”.

Using Names

We can get all the tags with that name in a list form with the below function:

print(soup.find_all(‘a’))

# [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>,

# <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>,

# <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>]

Using Relationships

Using .contents or .child

head_tag = soup.head

print(head_tag.contents)

# [<title>The Dormouse’s story</title>]

title_tag = head_tag.contents[0]

print(title_tag.contents)

# [‘The Dormouse’s story’]

for child in title_tag.children:

print(child)

# The Dormouse’s story

Using .parent

print(title_tag.parent)

# <head><title>The Dormouse’s story</title></head>

The parent of the HTML tag is None

Using .siblings

print(siblings.a.previous_sibling)

# Once upon a time there were three little sisters; and their names were

print(siblings.a.next_sibling.next_sibling)

# <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>

The siblings need not be the same type of tag, it can even be simple text.

Searching

We can do this using 2 functions one is find() and the other is select(), both have their own different quirks.

In find, we first pass in the tag name, and then the dictionary with the attributes.

Using Find

print(soup.find(‘a’,{“id”:”link2″}))

# <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>

We can also repurpose find_all() with a parameter called limit to return specific tags.

print(soup.find_all(‘a’, text=”Tillie”, limit=1))

# [<a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>]

The output in this case is a list, so index it by 0 to access the data.

Using CSS Selectors

print(soup.select(“p > #link1”))

# [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>]

print(soup.select(“head > title”))

# [<title>The Dormouse’s story</title>]

We can even get more than one tag.

print(soup.select(“#link1,#link2”))

# [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>,

# <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>]

We can also find tags by attribute value just like the find function but with a different syntax, and better regular expressions.

print(soup.select(‘a[href^=”http://example.com/”]’))

# [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>,

# <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>,

# <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>]

print(soup.select(‘a[href$=”tillie”]’))

# [<a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>]

print(soup.select(‘a[href*=”.com/el”]’))

# [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>

print(soup.select(‘p[what=”ok”]’))

# [A Story]

Conclusion

As you can see above, we have so many ways to reach a destination(the data in a tag), and web scraping is such an open field in terms of what is possible; it is also the basis of web automation.

It is an excellent skill and mixes well with other Python modules.

For example, you can scrap some statistics from a site and make graphs to represent data.