简体   繁体   English

使用dataframe和xlsxwriter内部的for循环将整个Beautifulsoup数组保存到excel中

[英]Saving whole Beautifulsoup array into excel using dataframe and xlsxwriter inside for loop

After going through a lot of documentation and looking for stackoverflow for answers I just couldn't find a solution to my issue. 在浏览了许多文档并寻找stackoverflow的答案之后,我只是找不到解决我问题的方法。

Basically I am using beautifulsoup to scrape a list of data from a website and then store it into excel. 基本上,我正在使用beautifulsoup从网站上抓取数据列表,然后将其存储到excel中。 Scraping works fine. 刮擦效果很好。

When I run my script it will print out all of the items to the terminal. 当我运行脚本时,它将把所有项目打印到终端。 However when I try to save this result into dataframe and save it to Excel it will only execute last line only and saves that one to excel. 但是,当我尝试将结果保存到数据框并将其保存到Excel时,它将仅执行最后一行并将该行保存为excel。

I've tried storing the code inside the loop but same result. 我试过将代码存储在循环内,但结果相同。 I've tried converting the list back into array inside for loop but same issue. 我试过将列表转换回for循环内的数组,但同样的问题。 Still last line only gets saved into Excel 仍然只将最后一行保存到Excel中

I think I am missing a logical approach here. 我认为我在这里缺少合乎逻辑的方法。 If someone could link me what to look for I would appreciate it a lot. 如果有人可以链接我要寻找的内容,我将不胜感激。

        soup = BeautifulSoup(html, features="lxml")
        soup.find_all("div", {"id":"tbl-lock"})

        for listing in soup.find_all('tr'):

            listing.attrs = {}

            assetTime = listing.find_all("td", {"class": "locked"})
            assetCell = listing.find_all("td", {"class": "assetCell"})
            assetValue = listing.find_all("td", {"class": "assetValue"})

            for data in assetCell:

                array = [data.get_text()]

                ### Excel Heading + data
                df = pd.DataFrame({'Cell': array
                                    })
               print(array)
                # In here it will print all of the data


        ### Now we need to save the data to excel
        ### Create a Pandas Excel writer using XlsxWriter as the Engine
        writer = pd.ExcelWriter(filename+'.xlsx', engine='xlsxwriter')

        ### Convert the dataframe to an XlsxWriter Excel object and skip first row for custom header
        df.to_excel(writer, sheet_name='SheetName', startrow=1, header=False)

        ### Get the xlsxwritert workbook and worksheet objects

        workbook = writer.book
        worksheet = writer.sheets['SheetName']

        ### Custom header for Excel
        header_format = workbook.add_format({
            'bold': True,
            'text_wrap': True,
            'valign': 'top',
            'fg_color': '#D7E4BC',
            'border': 1
        })

        ### Write the column headers with the defined add_format
        print(df) ### In here it will print only 1 line
        for col_num, value in enumerate(df):

            worksheet.write(0, col_num +1, value, header_format)

            ### Close Pandas Excel writer and output the Excel file
            writer.save()

This line is the problem df = pd.DataFrame({'Cell': array}) Here you're overwriting df, hence only storing the last line. 这行是问题df = pd.DataFrame({'Cell': array})在这里,您将覆盖df,因此仅存储最后一行。

Instead, initialize df as df = pd.DataFrame(columns=['cell']) and in the loop do this 而是将df初始化为df = pd.DataFrame(columns=['cell'])然后在循环中执行此操作

df = df.append(pd.DataFrame({'Cell': array}),ignore_index=True)

EDIT : 编辑:

Try this 尝试这个

soup = BeautifulSoup(html, features="lxml")
soup.find_all("div", {"id":"tbl-lock"})

df = pd.DataFrame(columns=['cell'])
for listing in soup.find_all('tr'):

        listing.attrs = {}

        assetTime = listing.find_all("td", {"class": "locked"})
        assetCell = listing.find_all("td", {"class": "assetCell"})
        assetValue = listing.find_all("td", {"class": "assetValue"})

        for data in assetCell:

            array = [data.get_text()]

            ### Excel Heading + data
            df = df.append(pd.DataFrame({'Cell': array}),ignore_index=True)
            ##Or this
            #df = df.append(pd.DataFrame({'Cell': array}))   

            print(array)
            # In here it will print all of the data

. . . . Rest of Code 其余代码

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM