Specifically, a Python module called Beautiful Soup
First, we need to know what web scraping is?
Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that one can use it in various applications.
And that is where Python comes in, Python has many modules for that purpose only, to convert unstructured HTML data to structured data/ python variables, and do with it as you please afterwards.
Many large websites, like Google, Twitter, Facebook, StackOverflow, etc., have API’s that allow you to access their data in a structured format. This is the best option, but other sites don’t allow users to access large amounts of data in a structured form or are simply not that technologically advanced. In that situation, it’s best to use Web Scraping to scrape the website for data.
Uses
I’ll just list out some of the uses in a concise manner:
- Price Monitoring
- Market Research
- News Monitoring
- Sentiment Analysis
- Email Marketing
And any other kind of use I can think of, as long as the goal is on the open Internet, it can be achieved.
Today we won’t do any project (maybe in later blogs), but tell you the basic functions and syntax, so you can figure that out on your own.
Choices
Following are some popular Python modules to access the data on websites
- Requests
“Requests” is too simple.
For web scraping, it can only be used in a limited capacity. One place where it is useful though, is to get images off of a link, but even that can be done with other modules.
- Selenium
Selenium is good, but the main purpose of selenium is for web automation. Instead, so many easy things are complex to do.
Selenium really shines when we need to work with JavaScript(DOM) concepts or make a script that does a set of actions online. It can also handle online AJAX requests very well.
- Scrapy
Scrapy is built for scraping a huge amount of data, so it is a bit more professional tool than what we are looking for in casual projects. It uses asynchronous system calls and none of the modules out there can beat scrapy in terms of performance.
- Beautiful Soup
It is easy to learn and master. For example, if we want to extract all the links from the webpage, or values in a table, there are convenient functions built into it, which we’ll look into in the sections to come. Its documentation is pretty detailed with a lot of community support due to its popularity.
Before that, let’s see what such a module entails in the next section.
Beautiful Soup
Web scraping requires two parts, namely the crawler and the scraper. The crawler is an artificial intelligence algorithm that browses the web to search for the particular data required by following the links across the internet. Instead of a crawler, we will directly give the links to our program.
On the other hand, the scraper is a specific tool created to extract data from the website. So, Beautiful Soup will be our Scraper.
Installation
For Windows:
`pip install beautifulsoup4`
For Linux:
`apt install python3-bs4`
Or
`pip3 install beautifulsoup4`
For Mac:
`easy_install beautifulsoup4`
Or
`pip3 install beautifulsoup4`
Importing
# importing the useful class from the module
from bs4 import BeautifulSoup
NOTE: We also need an HTML parser to take the content out of the actual HTML on a website
The Good thing is Python has its own inbuilt HTML parser, which is decent.
However, if you want to install a third-party parser you are free to do so; following are some suggestions:
Parser | Advantages | Disadvantages |
---|---|---|
html.parser | Batteries includedDecent speedLenient | Not as fast as lxml, less lenient than html5lib. |
lxml | Very fastLenient | External C dependency |
html5lib | Extremely lenientParses pages the same way a web browser doesCreates valid HTML5 | Very slowExternal Python dependency |
I would personally recommend html5lib.
I would recommend html.parser if you don’t want the hassle of installing a 3rd-party parser.
lxml if you are dealing with a huge amount of data.
For lxml parser:
Linux : $ apt-get install python-lxml
Mac : $ easy_install lxml
Windows: $ pip install lxml
For html5lib parser:
Linux : $ apt-get install python-html5lib
Mac : $ easy_install html5lib
Windows: $ pip install html5lib
Now finally lets start!
Parser Choosing
# converting the text into a Soup object
soup = BeautifulSoup(story, ‘html.parser’)
soup = BeautifulSoup(story, ‘html5lib’)
soup = BeautifulSoup(story, ‘lxml’)
Sample
This is the sample HTML we will use for all the examples to come:
story.html :
<html><head><title>The Dormouse’s story</title></head>
<body>
<p class=”title” id=”heading”
><b>The Dormouse’s story</b></p>
<p class=”story”>Once upon a time there were three little sisters; and their names were
<a href=”http://example.com/elsie” class=”sister” id=”link1″>Elsie</a>,
<a href=”http://example.com/lacie” class=”sister” id=”link2″>Lacie</a> and
<a href=”http://example.com/tillie” class=”sister” id=”link3″>Tillie</a>;
and they lived at the bottom of a well.</p>
<p class=”story”>…</p>
Code
Opening and Seeing the HTML
# opening the html file and reading the content
with open(“story.html”,”r”) as f:
story = f.read()
# just using the inbuilt html parser
soup = BeautifulSoup(story, ‘html.parser’)
print(soup.prettify())
Output:
<html>
<head>
<title>The Dormouse’s story</title>
</head>
<body>
<p class=”title” id=””><b>The Dormouse’s story</b></p>
<p class=”story”>Once upon a time there were three little sisters; and their names were
<a href=”http://example.com/elsie” class=”sister” id=”link1″>Elsie</a>,
<a href=”http://example.com/lacie” class=”sister” id=”link2″>Lacie</a> and
<a href=”http://example.com/tillie” class=”sister” id=”link3″>Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class=”story”>…</p>
Tag
Just write the tag name, and it prints out the first tag that it finds of the name:
print(soup.b)
# <b>The Dormouse’s story</b>
Attributes
tag = soup.p
# to access all the attributes of a tag
print(tag.attrs)
# {‘class’: [‘title’], ‘id’: ‘heading’}
You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary.
tag[‘id’] = ‘header’
tag[‘what’] = ‘ok’
print(tag)
# <p class=”title” id=”header” what=”ok”><b>The Dormouse’s story</b></p>
Text
print(tag.string)
# ‘The Dormouse’s story’
You can’t edit a string in place, but you can replace one string with another, using the below function :
tag.string.replace_with(“A Story”)
print(tag.string)
# ‘A Story’
Navigating
When there are more of the same tags and you want a specific one, not “the first on that matches”.
Using Names
We can get all the tags with that name in a list form with the below function:
print(soup.find_all(‘a’))
# [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>,
# <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>,
# <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>]
Using Relationships
Using .contents or .child
head_tag = soup.head
print(head_tag.contents)
# [<title>The Dormouse’s story</title>]
title_tag = head_tag.contents[0]
print(title_tag.contents)
# [‘The Dormouse’s story’]
for child in title_tag.children:
print(child)
# The Dormouse’s story
Using .parent
print(title_tag.parent)
# <head><title>The Dormouse’s story</title></head>
The parent of the HTML tag is None
Using .siblings
print(siblings.a.previous_sibling)
# Once upon a time there were three little sisters; and their names were
print(siblings.a.next_sibling.next_sibling)
# <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>
The siblings need not be the same type of tag, it can even be simple text.
Searching
We can do this using 2 functions one is find() and the other is select(), both have their own different quirks.
In find, we first pass in the tag name, and then the dictionary with the attributes.
Using Find
print(soup.find(‘a’,{“id”:”link2″}))
# <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>
We can also repurpose find_all() with a parameter called limit to return specific tags.
print(soup.find_all(‘a’, text=”Tillie”, limit=1))
# [<a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>]
The output in this case is a list, so index it by 0 to access the data.
Using CSS Selectors
print(soup.select(“p > #link1”))
# [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>]
print(soup.select(“head > title”))
# [<title>The Dormouse’s story</title>]
We can even get more than one tag.
print(soup.select(“#link1,#link2”))
# [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>,
# <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>]
We can also find tags by attribute value just like the find function but with a different syntax, and better regular expressions.
print(soup.select(‘a[href^=”http://example.com/”]’))
# [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>,
# <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>,
# <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>]
print(soup.select(‘a[href$=”tillie”]’))
# [<a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>]
print(soup.select(‘a[href*=”.com/el”]’))
# [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>
print(soup.select(‘p[what=”ok”]’))
# [<p class=”title” id=”header” what=”ok”><b>A Story</b></p>]
Conclusion
As you can see above, we have so many ways to reach a destination(the data in a tag), and web scraping is such an open field in terms of what is possible; it is also the basis of web automation.
It is an excellent skill and mixes well with other Python modules.
For example, you can scrap some statistics from a site and make graphs to represent data.
Resources
https://github.com/AshuAhlawat/Python/tree/main/WebScrapping