将数据格式化为CSV文件

Question

I wrote this page scraper using python and beautiful soup to extract data from a table and now want to save it. 我使用python和漂亮的汤编写了此页面抓取器，以从表中提取数据，现在想保存它。 The area i scraped is the table on the right hand side of the website. 我抓取的区域是网站右侧的表格。 I need the bold part on the left side to correspond to the right side, so key people to correspond to ceo for example. 我需要左侧的粗体部分对应于右侧，因此关键人物例如对应CEO 。 New to this, need some advice on the best way to format this. 对此不熟悉，需要一些有关格式化格式的最佳建议。 Thank you. 谢谢。

import requests
import csv
from datetime import datetime
from bs4 import BeautifulSoup

# download the page
myurl = requests.get("https://en.wikipedia.org/wiki/Goodyear_Tire_and_Rubber_Company")
# create BeautifulSoup object
soup = BeautifulSoup(myurl.text, 'html.parser')

# pull the class containing all tire name
name = soup.find(class_ = 'logo')
# pull the div in the class
nameinfo = name.find('div')

# just grab text inbetween the div
nametext = nameinfo.text

# print information about goodyear logo on wiki page
#print(nameinfo)

# now, print type of company, private or public
#status  = soup.find(class_ = 'category')
#for link in soup.select('td.category a'):
    #print link.text

# now get the ceo information
#for employee in soup.select('td.agent a'):
    #print employee.text

# print area served
#area = soup.find(class_ = 'infobox vcard')
#print(area)


# grab information in bold on the left hand side
vcard = soup.find(class_ = 'infobox vcard')
rows = vcard.find_all('tr')
for row in rows:
    cols=row.find_all('th')
    cols=[x.text.strip() for x in cols]
    print cols

# grab information in bold on the right hand side
vcard = soup.find(class_ = 'infobox vcard')
rows = vcard.find_all('tr')
for row in rows:
    cols2=row.find_all('td')
    cols2=[x.text.strip() for x in cols2]
    print cols2

# save to csv file named index
with open('index.csv', 'w') as csv_file:
        writer = csv.writer(csv_file) # actually write to the file
        writer.writerow([cols,cols2 , datetime.now()]) # apprend time

Answer 1

You need to reorder your code a bit. 您需要对代码重新排序。 It is also possible to find both tr and th at the same time which would solve your problem of the two columns needing to be in sync: 也可以同时找到tr和th ，这将解决您需要同步两列的问题：

import requests
import csv
from datetime import datetime
from bs4 import BeautifulSoup

myurl = requests.get("https://en.wikipedia.org/wiki/Goodyear_Tire_and_Rubber_Company")
soup = BeautifulSoup(myurl.text, 'html.parser')
vcard = soup.find(class_='infobox vcard')

with open('output.csv', 'wb') as f_output:
    csv_output = csv.writer(f_output)

    for row in vcard.find_all('tr')[1:]:
        cols = row.find_all(['th', 'td'])
        csv_output.writerow([x.text.strip().replace('\n', ' ').encode('ascii', 'ignore') for x in cols] + [datetime.now()])

This would create an output.csv file such as: 这将创建一个output.csv文件，例如：

Type,Public,2018-03-27 17:12:45.146000
Tradedas,NASDAQ:GT S&P 500 Component,2018-03-27 17:12:45.147000
Industry,Manufacturing,2018-03-27 17:12:45.147000
Founded,"August29, 1898; 119 years ago(1898-08-29) Akron, Ohio, U.S.",2018-03-27 17:12:45.147000
Founder,Frank Seiberling,2018-03-27 17:12:45.147000
Headquarters,"Akron, Ohio, U.S.",2018-03-27 17:12:45.148000
Area served,Worldwide,2018-03-27 17:12:45.148000
Key people,"Richard J. Kramer (Chairman, President and CEO)",2018-03-27 17:12:45.148000
Products,Tires,2018-03-27 17:12:45.148000
Revenue,US$ 15.158 billion[1](2016),2018-03-27 17:12:45.149000
Operating income,US$ 1.52 billion[1](2016),2018-03-27 17:12:45.149000
Net income,US$ 1.264 billion[1](2016),2018-03-27 17:12:45.149000
Total assets,US$ 16.511 billion[1](2016),2018-03-27 17:12:45.150000
Total equity,US$ 4.507 billion[1](2016),2018-03-27 17:12:45.150000
Number of employees,"66,000[1](2017)",2018-03-27 17:12:45.150000
Subsidiaries,List of subsidiaries,2018-03-27 17:12:45.151000
Website,goodyear.com,2018-03-27 17:12:45.151000

将数据格式化为CSV文件

问题描述

1 个解决方案

解决方案1
0 2018-03-27 16:15:20

将数据格式化为CSV文件

问题描述

1 个解决方案

解决方案1 0 2018-03-27 16:15:20

解决方案1
0 2018-03-27 16:15:20