简体   繁体   English

python web 抓取数据到 csv

[英]python web scraping data to csv

I was trying to use python web scraping then output a csv file but the print format is not matching csv format. I was trying to use python web scraping then output a csv file but the print format is not matching csv format.

output enter image description here output在此处输入图像描述

how to print this expecting results?如何打印这个预期结果? enter image description here在此处输入图像描述

Thanks谢谢

Below is my script下面是我的脚本

import urllib.request as req
import bs4
import csv
import pandas as pd
import re
from datetime import date, timedelta

def daterange(start_date, end_date):
    for n in range(int((end_date - start_date).days)):
        yield start_date + timedelta(n)

start_date = date(2021, 12, 10)
end_date = date(2021, 12, 15)
url="https://hkgoldprice.com/history/"

with open('gprice.csv','w',newline="") as f1:
    for single_date in daterange(start_date, end_date):
        udate = single_date.strftime("%Y/%m/%d")
        urld = url + single_date.strftime("%Y/%m/%d")
        writer=csv.writer(f1,delimiter = '\t',lineterminator='\n',)
        writer.writerows(udate)

        print(udate)
        with req.urlopen(urld) as response:
            data=response.read().decode("utf-8")
            root=bs4.BeautifulSoup(data, "html.parser")
            prices=root.find_all("div",class_="gp")
            gshops=root.find_all("div",class_="gshop")
            gpdate=root.find_all("div",class_="gp_date")
            for price in prices:
                print(price.text)
                row = price
                writer.writerows(row)

The first problem is you use "writerows", which will lead csv write become several rows as it can.第一个问题是您使用“writerows”,这将导致 csv 写入尽可能多的行。 So when your text is "2021/12/23", the converter will become ['2', '0', '2', '1', '/', '1', '2', '/', '2', '3'], and write each row with one char.所以当你的文本是“2021/12/23”时,转换器会变成['2', '0', '2', '1', '/', '1', '2', '/', '2', '3'],每行写一个字符。 Same problem as the price.和价格一样的问题。 So we use "writerow" and save row data as a list to prevent csv convert our data to multiple rows.所以我们使用“writerow”并将行数据保存为列表,以防止 csv 将我们的数据转换为多行。

The second is use .text in BeautifulSoup will record all the text including whitespaces and lead csv behavior unpredictable.第二种是在 BeautifulSoup 中使用.text将记录所有文本,包括空格并导致 csv 行为不可预测。 So I will delete all whitespace and # first to prevent this situation.所以我会先删除所有的空格和#来防止这种情况。

Here is the modified code这是修改后的代码

with open('gprice.csv','w',newline="") as f1:
    for single_date in daterange(start_date, end_date):
        udate = single_date.strftime("%Y/%m/%d")
        urld = url + single_date.strftime("%Y/%m/%d")
        #we will append row by row, so we just use default setting on csv write
        writer=csv.writer(f1)
        #define empty row list
        row_list = []
        #append datetime
        row_list.append(udate)
    
        with req.urlopen(urld) as response:
            data=response.read().decode("utf-8")
            root=bs4.BeautifulSoup(data, "html.parser")
            prices=root.find_all("div",class_="gp")
            gshops=root.find_all("div",class_="gshop")
            gpdate=root.find_all("div",class_="gp_date")
            for price in prices:
                #get inner text and delete '#'
                row = price.text.replace('#', '')
                #delete all whitespaces and append price
                row_list.append("".join(row.split()))
        #we only append one row data, so use "writerow" instad of "writerows"
        writer.writerow(row_list)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM