Beautiful soup

Sowjanya Sadashiva
2 min readMay 16, 2023

--

Beautiful soup is a Python library used for web scraping i.e., for pulling data out of HTML and XML files. This library makes it easy and quick to read the content of any webpage. It allows us to read the html content, indent it using prettify() method and navigate through the document tree. Prettify() converts the parsed tree back to the valid markup.

Install beautiful soup

pip install beautifulsoup4

Reading the html file using html.parser

with open(html_file) as files:
soup = BeautifulSoup(files, "html.parser")

Attributes and common methods of beautiful soup

“find()” and “find_all()” to locate the specific tag mentioned in the function. Eg: find(“a”) — will give a list of first anchor tag. find_all(“a”) — will give us list of all anchor tags in the html file.

“get_text” to get the text value in the tag

soup_object.outer_tag.inner_tag.string will also give the value of that tag.

Modify the content

Beautiful soup can be used to modify the content of the html file. Example to modify the content is shown in the GitHub page attached to the file.

Access parent tag:

.parent attribute will give us access to parent tag.

Access child tag:

.children lets us check if the tag has any child tag or not.

.Previous_element:

This function allows us to get the previous tag/element of the given element.

Examples for commonly used attributes and functions of beautiful soup are shown in the GitHub repo attached above.

Reference:

  1. https://beautiful-soup-4.readthedocs.io/en/latest/#
  2. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

--

--

Sowjanya Sadashiva
Sowjanya Sadashiva

Written by Sowjanya Sadashiva

I am a computer science enthusiast with Master's degree in Computer Science and Specialization in Data Science.

No responses yet