Course Content
Web Scraping with Python
Web Scraping with Python
Opening HTML File
Now that you're acquainted with the fundamental aspects of HTML
, let's explore the initial method of working with it in Python
.
One of the modules you can employ to handle HTML
files in Python
is urllib.request
. You'll need to import
the urlopen
method to access web pages. Simply provide the URL of the page you wish to open as a parameter to this method.
As seen in the example above, you receive an http.client.HTTPResponse
object as a result, which differs from what we intended. To obtain the HTML
structure, you should apply the .read()
and .decode("utf-8")
methods to the object you've acquired.
Note
The
decode("utf-8")
part is used to convert the raw binary data into a human-readable string, assuming that the webpage's content is encoded usingUTF-8
. This conversion enables us to work with the text data contained in the webpage in a meaningful manner, such as parsing or analyzing its content.
As a result of applying the .read()
and .decode()
methods, you obtain a string. This string contains the HTML
structure in a well-formatted manner, making it easily readable and allowing you to apply string methods to it.
If the .decode()
method weren't applied, you would receive a bytes object with the entire HTML
page represented as a single string with specific characters. Feel free to experiment with it!
Everything was clear?