简体   繁体   English

需要网络抓取帮助并使用 csv 将其保存到 excel

[英]Need help for web scraping and save it to excel using csv

I need to web scrape url and save it to excel like image I uploaded我需要网络抓取 url 并将其保存到 excel 中,就像我上传的图片一样

but I don't no what is wrong with my code但我不知道我的代码有什么问题

I get only one row in my excel file.我的 excel 文件中只有一行。 Help me plz.请帮助我。

import requests
from bs4 import BeautifulSoup
import csv


for i in range(10):
    payload={'pageIndex':i}
    r=requests.post(url, params=payload)
    soup=BeautifulSoup(r.text, 'html.parser')
    table=soup.find('table')
    rows=table.find('tbody').find_all('tr')

    for j in range(len(rows)):
        col=rows[j].find_all('td')
        result=[]
        for item in col:
            result.append(item.get_text())

with open(r"C:\Users\lwt04\Desktop\TheaterInfo.csv","w",newline='') as out:
    theater = csv.writer(out)

with open(r"C:\Users\lwt04\Desktop\TheaterInfo.csv","a",newline='') as out:
    theater = csv.writer(out)
    theater.writerow(result)

save the results to another list and write that list to csvfile.results保存到另一个列表并将该列表写入 csvfile。

import requests
from bs4 import BeautifulSoup
import csv

url='http://www.kobis.or.kr/kobis/business/mast/thea/findTheaterInfoList.do'
headers = ['City','District','Code','Name','NumScreen','NumSeats', 
           'Permanent', 'Registered', 'License','OpenDate','Run']

data=[]
for i in range(1,10):
    payload={'pageIndex':i}
    r=requests.post(url, params=payload)
    soup=BeautifulSoup(r.text, 'html.parser')
    table=soup.find("table", class_="tbl_comm")
    rows=table.find('tbody').find_all('tr')
    for row in rows:
        result=[]
        for cell in row.find_all(['td', 'th']):
            result.append(cell.get_text())
        if result:
            data.append(result)

with open(r"C:\Users\lwt04\Desktop\TheaterInfo.csv", 'w') as fp:
    writer = csv.writer(fp)
    writer.writerow(headers)
    writer.writerows(data)

Your code only stores the last theater - it is a logical error.您的代码仅存储最后一个剧院 - 这是一个逻辑错误。 You need to store each theater result row in a list for all theaters and write that to the file:您需要将每个剧院result行存储在所有theaters的列表中,并将其写入文件:

# ... your code snipped fro brevity  ...

theaters = []  # collect all theaters here

for i in range(10):
    payload={'pageIndex':i}

    # ... snipp ...

    for j in range(len(rows)):
        col=rows[j].find_all('td')
        result=[]
        for item in col:
            result.append(item.get_text())

        theaters.append(result)

    # ... snipp ...

headers = ['City','District','Code','Name','NumScreen','NumSeats', 
           'Permanent', 'Registered', 'License','OpenDate','Run']

# no need for 2 context's unless you have an existing file you want to delete
# every time you run your script
with open(r"C:\Users\lwt04\Desktop\TheaterInfo.csv","w",newline='') as out:
    theater = csv.writer(out)
    theater.writerow(headers)
    theater = csv.writer(out)
    theater.writerows(theaters)  # writerowS here

If you want to maybe append, else create look into Check a file exists or not without try-catch block and consider setting the opening mode as variable to 'w' or 'a' depending on wether the file is 'w' write the header, else only write data.如果你想追加,否则创建查看检查文件是否存在而不使用 try-catch 块并考虑将打开模式设置为变量'w''a'取决于文件是否为'w'写标题, 否则只写入数据。


Addendum - you are not writing to excel, you are writing a CSV file that can be opened by excel.附录 - 你不是在写 excel,你是在写一个可以用 excel 打开的 CSV 文件。 To directly write excel, use appropriate modules - like fe this here: https://openpyxl.readthedocs.io/en/stable/要直接编写 excel,请使用适当的模块 - 就像这里的 fe: https ://openpyxl.readthedocs.io/en/stable/

HTH HTH

You can also use pandas for this purpose.您也可以为此目的使用pandas You just have to do for result .你只需要为result做。

import pandas as pd
df = pd.DataFrame([result], columns=['City','District','Code','Name','NumScreen','NumSeats', 'Permanent', 'Registered', 'License','OpenDate','Run'])

df.to_csv('filename.csv', delimiter=',')

For CSV对于CSV

You can use simply for result as it is only single row for data.您可以简单地使用result ,因为它只是数据的单行。 If you would do listofresult for multiple entries can be handled.如果你愿意为多个条目做listofresult就可以处理。

listofresult = []
   for i in range(10):
    payload={'pageIndex':i}
    r=requests.post(url, params=payload)
    soup=BeautifulSoup(r.text, 'html.parser')
    table=soup.find('table')
    rows=table.find('tbody').find_all('tr')

    for j in range(len(rows)):
        col=rows[j].find_all('td')
        result=[]
        for item in col:
            result.append(item.get_text())
listofresult.append(result)

with open('filename.csv', 'w') as f:
    writer = csv.writer(f)
    # Write the headers
    headers = ['City','District','Code','Name','NumScreen','NumSeats', 
           'Permanent', 'Registered', 'License','OpenDate','Run']
    writer.writerow(headers)
    writer.writerows([result]) # Per current
    writer.writerows(listofresult) ## For multiple list

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM