简体   繁体   中英

Using the `.find_next_siblings` function in Beautiful Soup

I am attempting to write the output of web scraping to a CSV file, here is my code:

import bs4
import requests
import csv

#get webpage for Apple inc. September income statement
page = requests.get("https://au.finance.yahoo.com/q/is?s=AAPL")

#put into beautiful soup
soup = bs4.BeautifulSoup(page.content)

#select table that holds data of interest
table = soup.find("table", class_="yfnc_tabledata1")

#creates headers for table
headers = table.find('tr', class_="yfnc_modtitle1")

#creates generator that holds four values that are yearly revenues for company
total_revenue = headers.next_sibling
cost_of_revenue = total_revenue.next_sibling
gross_profit = cost_of_revenue.next_sibling.next_sibling
wang = headers.find_next_siblings("tr")

#iterates through generator from above and writes output to CSV file
with open('/home/kwal0203/Desktop/Apple.csv', 'a') as csvfile:
            writer = csv.writer(csvfile,delimiter="|")
            writer.writerow([value.get_text(strip=True).encode("utf-8") for value in headers])
            writer.writerow([value.get_text(strip=True).encode("utf-8") for value in total_revenue])
            writer.writerow([value.get_text(strip=True).encode("utf-8") for value in cost_of_revenue])
            writer.writerow([value.get_text(strip=True).encode("utf-8") for value in gross_profit])
            for dude in wang:
                writer.writerow([dude.get_text(strip=True).encode("utf-8")])

The problem is that I am repeating a lot of code when creating and writing each row to CSV. As you can see a keep repeating next_sibling to get to the next row of values. I found the .find_next_siblings() function in Beautiful Soup and it almost does what I want it to but each row that the functions reads gets outputted into one cell of the CSV file.

Any ideas? let me know if the question is not clear.

Thanks.

Okay, this would not be a perfect solution, I suppose, but the idea is to check the next siblings for the amounts and skip the rows without:

next_rows = [[td.get_text(strip=True).encode("utf-8") for td in row('td')] 
             for row in headers.find_next_siblings("tr")]

pattern = re.compile(r'^[\d,]+$')
data = [[item for item in l if pattern.match(item)] for l in next_rows]
data = [l for l in data if l]

with open('/home/kwal0203/Desktop/Apple.csv', 'a') as csvfile:
    writer = csv.writer(csvfile, delimiter="|")
    writer.writerows(data)

Produces:

42,123,000|37,432,000|45,646,000|57,594,000
26,114,000|22,697,000|27,699,000|35,748,000
16,009,000|14,735,000|17,947,000|21,846,000
1,686,000|1,603,000|1,422,000|1,330,000
3,158,000|2,850,000|2,932,000|3,053,000
11,165,000|10,282,000|13,593,000|17,463,000
307,000|202,000|225,000|246,000
11,472,000|10,484,000|13,818,000|17,709,000
11,472,000|10,484,000|13,818,000|17,709,000
3,005,000|2,736,000|3,595,000|4,637,000
8,467,000|7,748,000|10,223,000|13,072,000
8,467,000|7,748,000|10,223,000|13,072,000
8,467,000|7,748,000|10,223,000|13,072,000

These are basically all the amounts from the table.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM