topher nguyen project extract imdb

The internet is a vast repository of information, much of which is presented in the form of websites. For developers, researchers, and data enthusiasts, extracting data from websites—commonly known as web scraping—opens up a treasure trove of possibilities. In this guide, we'll explore how to extract data from websites using Python, focusing on practical steps and real-world examples.

What is Web Scraping?

Web scraping is the process of automatically extracting data from websites. This involves fetching the web page's HTML content and parsing it to retrieve the desired information. Python, with its rich ecosystem of libraries, offers powerful tools for web scraping, making it a popular choice for this task.

Tools of the Trade

To get started with web scraping in Python, you need the following tools:

Requests: A library for sending HTTP requests to fetch web pages.
BeautifulSoup: A library for parsing HTML and XML documents.
Pandas: A library for data manipulation and analysis (optional but useful for organizing the scraped data).

Step-by-Step Guide

Step 1: Installing Required Libraries

$ pip install requests 
$ pip install beautifulsoup4

Step 2: Fetching the Web Page

import requests
				
url = "https://www.imdb.com/chart/boxoffice/?ref_=hm_cht_sm"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/80.0.3987.162 Safari/537.36'}

response = requests.get(url, headers=headers)

if response.status_code == 200:
	print("Successfully fetched the web page!")
else:
	print(f"Failed to retrieve the web page. Status code: {response.status_code}")

Step 3: Parsing the HTML Content

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

Step 4: Extracting the Desired Data

get_top_10 = soup.find_all('div', class_='ipc-metadata-list-summary-item__tc')
movie_data = []

rank = 1 

for movie in get_top_10:
	movie_dic = {}
	
	movie_dic['Rank'] = rank
	rank += 1  


	movie_name = movie.find('h3', class_='ipc-title__text')
	movie_name = re.sub(r'^\d+\.\s', '', movie_name.text.strip())
	movie_dic['Movie Name'] = movie_name if movie_name else 'unknown movie name'


	box_office_data = movie.find('ul', {'data-testid': 'title-metadata-box-office-data-container'})
	if box_office_data:
		for li in box_office_data.find_all('li'):
			key = li.find('span').get_text(strip=True).replace(':', '')
			value = li.find('span', class_='sc-8f57e62c-2 elpuzG').get_text(strip=True)
			movie_dic[key] = value
	movie_dic['Weekend Gross'] = movie_dic.get('Weekend Gross', 'unknown weekend gross')
	movie_dic['Total Gross'] = movie_dic.get('Total Gross', 'unknown total gross')
	movie_dic['Weeks Released'] = movie_dic.get('Weeks Released', 'unknown weeks released')
	movie_data.append(movie_dic)

Step 5: Organizing the Data

import pandas as pd
df = pd.DataFrame(movie_data)
print(df)

Conclusion

Web scraping with Python is a powerful technique to automate data extraction from websites. By using libraries like Requests and BeautifulSoup, you can fetch and parse HTML content to retrieve the data you need. While this guide provides a basic introduction, web scraping can be as simple or complex as your project requires. Always remember to respect website terms of service and legal considerations when scraping data. You can see my code here on Github.

Happy scraping!

topher nguyen data scientist

Extracting Data from Websites Using Python: A Comprehensive Guide