Introduction to BeautifulSoup Module
Introduction to BeautifulSoup Module
In this tutorial we will learn how we can use the BeautifulSoup module of python to parse the source code of webpage(which we can get using the requests module) and find various useful information from the source code like all the HTML table headings, or all the links on the webpage etc.
BeautifulSoup can search and return all occurences of an HTML tag, if we provide all the information to it about the HTML tag.
Before we jump into searching HTML tags and accessing information from a webpage, let's see how we can format the HTTP response content received to make it more readable.
BeautifulSoup: Prettify Content
The method prettify available in BeautifulSOup module can be used to format the HTTP response received using the requests module.
Below we have the code example, extending teh example from last tutorial:
import modules
import requests from fake_useragent import UserAgent
importing the beautifulsoup module
import bs4
send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")
creating BeautifulSoup object
soup = bs4.BeautifulSoup(response.content, "html.parser")
using 'prettify' method to print the content
print(soup.prettify())
In the code above we did the following:
* Imported the modules: **requests**, **fake\_useragent** and **bs4**.
* Get teh response from any URL you like.
* Create a **BeautifulSoup** object using the `BeautifulSoup` class.
* Print the response using the `prettify` method using the BeautifulSoup object.
If you are coming here after reading the previous tutorial, you must have seen how the response from the GET request made using the `requests` module looked like.
When we format that response using the `prettify` method, it looks like **this**(click on _this_ to download the file).
Now that the response is formatted, let's learn how can we use BeautifulSoup to access various HTML tags and related information from the HTTP response(source code).BeautifulSoup: Accessing HTML Tags
Using the BeautifulSoup module we can easily find and access the content of various HTML tags like head, title, div, p, h1 etc. Let's see a simple example where we will print the title tag of the webpage.
import modules
import requests from fake_useragent import UserAgent
importing the beautifulsoup module
import bs4
send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")
creating BeautifulSoup object
soup = bs4.BeautifulSoup(response.content, "html.parser")
getting 'title' tag from the google BeautifulSoup -> 'soup'
title_tag = soup.title print(title_tag)
\<title>Google\</title>
We can also get only the text enclosed within the opening and closing **title** tag:
```pythonimport modules
import requests from fake_useragent import UserAgent
importing the beautifulsoup module
import bs4
send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")
creating BeautifulSoup object
soup = bs4.BeautifulSoup(response.content, "html.parser")
getting 'title' tag from the google BeautifulSoup -> 'soup'
title_text = soup.title.text print(title_text)
**Output:**
Google
This is standard for all the HTML tags, for example to get the **head** tag, we can use `soup.head` like this,
```pythonimport modules
import requests from fake_useragent import UserAgent
importing the beautifulsoup module
import bs4
send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")
creating BeautifulSoup object
soup = bs4.BeautifulSoup(response.content, "html.parser")
getting 'head' tag from the google BeautifulSoup -> 'soup'
print(soup.head)
This will return the complete **head** tag from the page's source code.
\<head> \<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>\< meta content="/images/branding/googleg/1x/googleg\_standard\_color\_128dp.png" itemprop="image"/> \<title>Google\</title >\<script nonce="GWwjLi7M0YGkyNTLDmVPsQ=="> ... \<style> ... \</style> ... \</head>
We have not added the complete code in the output as it is huge. But as you can see that the **title** tag is inside the **head** tag and there is **style** tag too in there.
We can also get the **title** tag content via the **head** tag:
```pythongetting 'title' tag from the google BeautifulSoup -> 'soup'
print(soup.head.title.text)
**Output:**
Google
This is just to show you that as the BeautifulSoup follows the **tree traversal** technique to parse the HTML code, we can also access the tags by following their heirarchy.
Similarly let's access the **style** tag:
```pythongetting 'title' tag from the google BeautifulSoup -> 'soup'
print(soup.head.style.text)
Up until now we have covered basic HTML parsing and accessing the tags. In the next tutorial we will see some more methods of the BeautifulSoup module and some more ways of navigating through the HTML source code of any webpage to collect useful data.









