简体   繁体   中英

How to scrape this table using python and beautiful soup?

I am trying to scrape https://m.the-numbers.com/market/2018/top-grossing-movies , specifically the table into a CSV. I am using Python and Beautiful Soup, but I am very new to this, and would love any tips any solutions. What are some simple ways to tackle this issue?

Thank you

This is my latest experiment below...

from bs4 import BeautifulSoup
import requests
import csv

source = requests.get('https://m.the-numbers.com/market/2018/top-grossing-movies').text

soup = BeautifulSoup(source, 'lxml')

csv_file = open('cms_scrape.csv', 'w')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['filmTitle', 'releasDate', 'distributor', 'genre', 'gross', 'ticketsSold'])

for tbody in soup.find_all('a', class_='table-responsive'):

    filmTitle = tbody.tr.td.b.a.text
    print(filmTitle)

    csv_writer.writerow([filmTitle])

csv_file.close()

assuming you already have the value of source , you could do this:

import pandas as pd
df = pd.read_html(source)[0]
df.to_csv('cms_scrape.csv', index=False)

Something like the code below would do the job.

Useful links on that topic:

import requests
from bs4 import BeautifulSoup
import csv

# Making get request
r = requests.get('https://m.the-numbers.com/market/2018/top-grossing-movies')

# Creating BeautifulSoup object
soup = BeautifulSoup(r.text, 'lxml')

# Localizing table from the BS object
table_soup = soup.find('div', id='page_filling_chart').find('div', class_='table-responsive').find('table')

# Iterating through all trs in the table except the first(header) and the last two(summary) rows
movies = []
for tr in table_soup.find_all('tr')[1:-2]:
    tds = tr.find_all('td')

    # Creating dict for each row and appending it to the movies list
    movies.append({
        'rank': tds[0].text.strip(),
        'movie': tds[1].text.strip(),
        'release_date': tds[2].text.strip(),
        'distributor': tds[3].text.strip(),
        'genre': tds[4].text.strip(),
        'gross': tds[5].text.strip(),
        'tickets_sold': tds[6].text.strip(),
    })

# Writing movies list of dicts to file using csv.DictWriter
with open('movies.csv', 'w', encoding='utf-8', newline='\n') as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=movies[0].keys())
    writer.writeheader()
    writer.writerows(movies)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM