Understanding Exploring BeautifulSoup Methods
Exploring BeautifulSoup Methods
In this tutorial we will learn various different ways to access HTML tags using different methods of the BeautifulSoup module. For a basic introduction to the BeautifulSoup module, start from the previous tutorial.
BeautifulSoup: Accessing HTML Tags
The methods that we will cover in this section are used to traverse through different HTML tags considering HTML code as a tree.
Create a file sample_webpage.html and copy the following HTML code in it:
<!DOCTYPE html>
<html>
<head>
<title> Sample HTML Page</title>
<style>
* {
margin: 0;
padding: 0;
}
div {
width: 95%;
height: 75px;
margin: 10px 2.5%;
border: 1px dotted grey;
text-align: center;
}
p {
font-family: sans-serif;
font-size: 18px;
color: #000;
line-height: 75px;
}
a {
position: relative;
top: 25px;
}
</style>
</head>
<body>
<div id="first-div">
<p class="first">First Paragraph</p>
</div>
<div id="second-div">
<p class="second">Second Paragraph</p>
</div>
<div id="third-div">
<a href="https://www.studytonight.com">Studytonight</a>
<p class="third">Third Paragraph</p>
</div>
<div id="fourth-div">
<p class="fourth">Fourth Paragraph</p>
</div>
<div id="fifth-div">
<p class="fifth">Fifth Paragraph</p>
</div>
</body>
</html>Now to read the content of the above HTML file, use the following python code to store the content into a variable:
reading content from the file
with open("sample_webpage.html") as html_file: html = html_file.read()
Now we will use different methods of the BeautifulSoup module and see how they work.
For warmup, let's start with using the `prettify` method.
```python
import bs4reading content from the file
with open("sample_webpage.html") as html_file: html = html_file.read()
creating a BeautifulSoup object
soup = bs4.BeautifulSoup(html, "html.parser")
print(soup.prettify)
### **BeautifulSoup: Accessing HTML Tag Attributes**
We can retrieve the attributes of any HTML tag using the following syntax:
```html
TagName["AttributeName"]Let's extract the href attribute from the anchor tag in our HTML code.
import bs4reading content from the file
with open("sample_webpage.html") as html_file: html = html_file.read()
creating a BeautifulSoup object
soup = bs4.BeautifulSoup(html, "html.parser")
getting anchor tag
link = soup.a
printing the 'href' attribute of anchor tag
print(link["href"])
### **BeautifulSoup:** `contents` **method**
`contents` method is used to list out all the tags that are present in the parent tag. Let's list all the children HTML tags of the **body** tag using the `contents` method.
```python
body = soup.bodygetting all the children of 'body' using 'contents'
content_list = body.contents
printing all the children using for loop
for tag in content_list: if tag != "\n": print(tag) print("\n")
### **BeautifulSoup:** `children` **method**
`children` method is similar to the `contents` method, but `children` method returns an **iterator** while the `contents` method returns a **list** of all the children. Let's see an example:
```python
body = soup.bodywe can also convert iterator into list using the 'list(iterator)'
for tag in body.children: if tag != "\n": print(tag) print("\n")
### **BeautifulSoup:** `descendants` **method**
`descendants` method helps to retrieve all the child tags of a parent tag. You must be wondering that is what the two methods above also did. Well this method is different from `contents` and `children` method as this method extracts all the child tags and content up until the end. In simple words if we use it to extract the **body** tag then it will print the first **div** tag, then it will print the child of the **div** tag and then their child until it reaches the end, then it will move on to the next **div** tag and so on.
This method returns a **generator**. Let's see an example:
```python
body = soup.bodyfor tag in body.descendants: if tag != "\n": print(tag) print("\n")
Now you are familiar with most of the methods that are used in web scraping. In the following tutorial, we will learn how to find a specific tag from a bunch of similar tags.









