简体   繁体   English


[英]formatting data to csv file

I wrote this page scraper using python and beautiful soup to extract data from a table and now want to save it. 我使用python和漂亮的汤编写了此页面抓取器,以从表中提取数据,现在想保存它。 The area i scraped is the table on the right hand side of the website. 我抓取的区域是网站右侧的表格。 I need the bold part on the left side to correspond to the right side, so key people to correspond to ceo for example. 我需要左侧的粗体部分对应于右侧,因此关键人物例如对应CEO New to this, need some advice on the best way to format this. 对此不熟悉,需要一些有关格式化格式的最佳建议。 Thank you. 谢谢。

import requests
import csv
from datetime import datetime
from bs4 import BeautifulSoup

# download the page
myurl = requests.get("https://en.wikipedia.org/wiki/Goodyear_Tire_and_Rubber_Company")
# create BeautifulSoup object
soup = BeautifulSoup(myurl.text, 'html.parser')

# pull the class containing all tire name
name = soup.find(class_ = 'logo')
# pull the div in the class
nameinfo = name.find('div')

# just grab text inbetween the div
nametext = nameinfo.text

# print information about goodyear logo on wiki page

# now, print type of company, private or public
#status  = soup.find(class_ = 'category')
#for link in soup.select('td.category a'):
    #print link.text

# now get the ceo information
#for employee in soup.select('td.agent a'):
    #print employee.text

# print area served
#area = soup.find(class_ = 'infobox vcard')

# grab information in bold on the left hand side
vcard = soup.find(class_ = 'infobox vcard')
rows = vcard.find_all('tr')
for row in rows:
    cols=[x.text.strip() for x in cols]
    print cols

# grab information in bold on the right hand side
vcard = soup.find(class_ = 'infobox vcard')
rows = vcard.find_all('tr')
for row in rows:
    cols2=[x.text.strip() for x in cols2]
    print cols2

# save to csv file named index
with open('index.csv', 'w') as csv_file:
        writer = csv.writer(csv_file) # actually write to the file
        writer.writerow([cols,cols2 , datetime.now()]) # apprend time

You need to reorder your code a bit. 您需要对代码重新排序。 It is also possible to find both tr and th at the same time which would solve your problem of the two columns needing to be in sync: 也可以同时找到trth ,这将解决您需要同步两列的问题:

import requests
import csv
from datetime import datetime
from bs4 import BeautifulSoup

myurl = requests.get("https://en.wikipedia.org/wiki/Goodyear_Tire_and_Rubber_Company")
soup = BeautifulSoup(myurl.text, 'html.parser')
vcard = soup.find(class_='infobox vcard')

with open('output.csv', 'wb') as f_output:
    csv_output = csv.writer(f_output)

    for row in vcard.find_all('tr')[1:]:
        cols = row.find_all(['th', 'td'])
        csv_output.writerow([x.text.strip().replace('\n', ' ').encode('ascii', 'ignore') for x in cols] + [datetime.now()])

This would create an output.csv file such as: 这将创建一个output.csv文件,例如:

Type,Public,2018-03-27 17:12:45.146000
Tradedas,NASDAQ:GT S&P 500 Component,2018-03-27 17:12:45.147000
Industry,Manufacturing,2018-03-27 17:12:45.147000
Founded,"August29, 1898; 119 years ago(1898-08-29) Akron, Ohio, U.S.",2018-03-27 17:12:45.147000
Founder,Frank Seiberling,2018-03-27 17:12:45.147000
Headquarters,"Akron, Ohio, U.S.",2018-03-27 17:12:45.148000
Area served,Worldwide,2018-03-27 17:12:45.148000
Key people,"Richard J. Kramer (Chairman, President and CEO)",2018-03-27 17:12:45.148000
Products,Tires,2018-03-27 17:12:45.148000
Revenue,US$ 15.158 billion[1](2016),2018-03-27 17:12:45.149000
Operating income,US$ 1.52 billion[1](2016),2018-03-27 17:12:45.149000
Net income,US$ 1.264 billion[1](2016),2018-03-27 17:12:45.149000
Total assets,US$ 16.511 billion[1](2016),2018-03-27 17:12:45.150000
Total equity,US$ 4.507 billion[1](2016),2018-03-27 17:12:45.150000
Number of employees,"66,000[1](2017)",2018-03-27 17:12:45.150000
Subsidiaries,List of subsidiaries,2018-03-27 17:12:45.151000
Website,goodyear.com,2018-03-27 17:12:45.151000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM