Python Web Scraping - Beautiful Soup (Part 1)

Beautiful Soup is a python library which help in parsing of HTML documents.

If we use this library we need not to use regular expression while processing/parsing html

To install Beautiful Soup

pip install beautifulsoup4

Note: Notice the in beautifulsoup4 

 

To parse a remote HTML content

from bs4 import BeautifulSoup
import requests

html_doc = requests.get("https://www.firstbridgedata.com/").text

soup = BeautifulSoup(html_doc, 'html.parser')

As you can see in code above we first downloaded a remote html file using requests library and then converted into soup using BeautifulSoup library.

Two most common method in BeautifulSoup library are find and find_all

The main difference between these two methods are 

find method will only return one element based on the criteria

if no match is found , find method will return None

 

find_all method will return all elements with matching criteria and the response data type will be a list.

If no element is found with the match criteria, you will get an empty list

Now let's see this in action

Let's find first paragraph <p> tag (find method)

from bs4 import BeautifulSoup
import requests

html_doc = requests.get("https://www.firstbridgedata.com/").text

soup = BeautifulSoup(html_doc, 'html.parser')

#We will find first h2 tag
print(soup.find('h2'))

Output:

<h2 class="col-md-10 col-xs-12">Recent ETF Launches Globally</h2>

 

 

Now let's find all h2 tags (find_all method)

from bs4 import BeautifulSoup
import requests

html_doc = requests.get("https://www.firstbridgedata.com/").text

soup = BeautifulSoup(html_doc, 'html.parser')

#Now let's find all h2 tag
print(soup.find_all('h2'))

Output:

[<h2 class="col-md-10 col-xs-12">Recent ETF Launches Globally</h2>, <h2 class="col-md-10 col-xs-12">'Smart Beta' ETF Returns and Flows</h2>]

You can see in above example we have found two <h2> tags in this document.

 

Extracting all links from a given page

This is a very important task during web crawling since you will reach to other pages of same website / other website only when you find all links given on the current page on which you are already doing crawling

from bs4 import BeautifulSoup
import requests

html_doc = requests.get("https://www.firstbridgedata.com/").text

soup = BeautifulSoup(html_doc, 'html.parser')

# It will extract all anchor tags which have href attribute
links = soup.find_all("a", href=True)

for i in links:
    print(i['href'])

Output:

http://etf.firstbridgedata.com/etf-daily-holdings-and-product-list/
http://etf.firstbridgedata.com/etf-classifications-and-reference-data/
http://etf.firstbridgedata.com/flows/
https://firstbridge.atlassian.net/wiki/spaces/FBDUG/pages/36652520/FirstBridge+APIs+v1
#
http://tradeablebeta.com/
http://etf.firstbridgedata.com/rebalcal/
#
http://tradeablebeta.com/table/
/etflists
/compare
/deepdive/index/SPY
/page/about-us
/etflists
/compare
/deepdive/index/SPY
/docs/First_Bridge_Overview.pdf
https://etf.firstbridgedata.com/etf-lists-and-summary-reports/
http://etf.firstbridgedata.com/etf-daily-holdings-and-product-list/
http://etf.firstbridgedata.com/etf-classifications-and-reference-data/
http://etf.firstbridgedata.com/flows/
http://tradeablebeta.com/
http://etf.firstbridgedata.com/rebalcal/

 

Extract All Images from a given page

from bs4 import BeautifulSoup
import requests

html_doc = requests.get("https://www.firstpost.com/world/new-zealand-mosque-terror-attack-extremist-gunman-flashes-grin-in-courtroom-after-being-charged-with-murder-remanded-till-5-april-6271591.html").text

soup = BeautifulSoup(html_doc, 'html.parser')

# It will extract all anchor tags which have href attribute
images = soup.find_all("img", src=True)

for i in images:
    print(i["src"])

Output:

https://images.firstpost.com/wp-content/uploads/whatsapp_firstpost.png
https://images.firstpost.com/fpimages/940x355/fixed/jpg/2019/03/BRKING940_201903161259_940x355.jpeg
https://images.firstpost.com/fpimages/80x60/fixed/jpg/scalex16x9/2019/02/rahul-gandhi-INCIndia-150x150.jpg
https://images.firstpost.com/fpimages/80x60/fixed/jpg/scalex16x9/2019/03/Chandrasshekhar-Azad_PTI_380-150x150.jpeg

 

Finding tags using attribute values

Finding some html element using class name

from bs4 import BeautifulSoup
import requests

html_doc = requests.get("https://listexam.com").text

soup = BeautifulSoup(html_doc, 'html.parser')

# It will extract all anchor tags which have href attribute
images = soup.find_all("div", class_="panel-heading")

for i in images:
    print(i)
    print(i.text)

Output:

<div class="panel-heading">Programming</div>
Programming

 

Finding some element using its id

from bs4 import BeautifulSoup
import requests

html_doc = requests.get("https://listexam.com").text

soup = BeautifulSoup(html_doc, 'html.parser')

# It will extract all anchor tags which have href attribute
images = soup.find_all("div", id="myNavbar")

for i in images:
    print(i)

Output:

<div class="collapse navbar-collapse" id="myNavbar">
<ul class="nav navbar-nav">
<li><a href="/articlelist">Articles</a></li>
<li><a href="/quizlist">Quiz</a></li>
</ul>
<form action="/article/id" class="navbar-form navbar-right" method="post"><input name="csrfmiddlewaretoken" type="hidden" value="iq3H1T1NMWJAeEbit00hFOMxfRwdadaaHCpOOPGAg6w39mCi5gCQdagadagbbaLaR0GqhbSr6dsuf5X"/>
<div class="form-group">
<input class="form-control" min="1" name="article" placeholder="Search Lesson ID" required="" type="integer"/>
</div>
<button class="btn btn-default" type="submit">Submit</button>
</form>
</div>

 

Finding all Images with a given file extension

from bs4 import BeautifulSoup
import requests
import re

html_doc = requests.get("https://www.firstpost.com/world/new-zealand-mosque-terror-attack-extremist-gunman-flashes-grin-in-courtroom-after-being-charged-with-murder-remanded-till-5-april-6271591.html").text

soup = BeautifulSoup(html_doc, 'html.parser')

#extract all images with a given file extension
images = soup.find_all("img", src=re.compile('\.png$'))

for i in images:
    print(i["src"])

Output:

https://static.firstpost.com/assets/images/fp-logo_new.png
https://images.firstpost.com/wp-content/uploads/f-logo-v1.png
https://images.firstpost.com/wp-content/uploads/fp-print.png
https://xmlns.cricketnext.com/cktnxt/scorecard/crk_player_images/flags/90x50/13.png
https://xmlns.cricketnext.com/cktnxt/scorecard/crk_player_images/flags/90x50/21.png
https://xmlns.cricketnext.com/cktnxt/scorecard/crk_player_images/flags/90x50/8.png
https://xmlns.cricketnext.com/cktnxt/scorecard/crk_player_images/flags/90x50/1133.png
https://images.firstpost.com/wp-content/uploads/whatsapp_firstpost.png
https://static.firstpost.com/assets/images/fp-logo-footer.png
https://images.firstpost.com/wp-content/uploads/eighteen-nw.png

Find all paragraphs which contain a given text

from bs4 import BeautifulSoup
import requests
import re

html_doc = requests.get("https://www.firstpost.com/world/new-zealand-mosque-terror-attack-extremist-gunman-flashes-grin-in-courtroom-after-being-charged-with-murder-remanded-till-5-april-6271591.html").text

soup = BeautifulSoup(html_doc, 'html.parser')

#extract all paragraphs with a given text in it
para = soup.find_all("p", text=re.compile('terror'))

for i in para:
    print(i.text)
    print("----------------")

Output:

New Zealand mosque terror attack shooter Brenton Tarrant. AP
----------------
For many, the road to recovery will require multiple surgical procedures and many survivors said the mental scars may never fully heal. The attack on the Al Noor and Linwood mosques has been labelled terrorism by Prime Minister Jacinda Ardern and is thought to be the deadliest attack directed against Muslims in the West in modern times.

 

To search all tags using regular expression

In below example we are going to find all tags that start with h

from bs4 import BeautifulSoup
import requests
import re

html_doc = requests.get("https://listexam.com").text

soup = BeautifulSoup(html_doc, 'html.parser')

# To get all tags that start with h
tags = soup.find_all(re.compile('h'))

for tag in tags:
    print(tag.name)

Output:

html
head

 

If you want ot get all html tags whose text contain a certain word

from bs4 import BeautifulSoup
import requests
import re

html_doc = requests.get("https://www.firstpost.com/").text

soup = BeautifulSoup(html_doc, 'html.parser')

# To get all tags that contain a given text
tags = soup.find_all(True, text=re.compile('poll'))

for tag in tags:
    print(tag.name)

In above example we are trying to find all text which contain the word poll  in it

Output:

li
a
li
a
li
a
div
div