简体   繁体   English

从for循环内追加到数据框的下一行

[英]Appending to next row in dataframe, from within a for loop

I've created a web scraper that scrapes the Yahoo Finance Summary and Statistics page of a stock for Python programming educational purposes only. 我创建了一个网络抓取工具,该抓取工具仅针对Python编程教育目的来抓取股票的Yahoo财务摘要和统计信息页面。 It reads from the '1stocklist.csv' in the programs directory which looks like this: 它从程序目录中的“ 1stocklist.csv”读取,如下所示:

Symbols
SNAP
KO

From there, it adds the new information to new columns in the dataframe as it should. 从那里,它将新信息按原样添加到数据框的新列中。 There are a lot of 'for' loops in there and I'm still tweaking it as it's not grabbing some data correctly, but it's fine for now. 那里有很多“ for”循环,我仍在对其进行调整,因为它不能正确地获取某些数据,但是现在还可以。

My problem is trying to save the dataframe to a new .csv file. 我的问题是尝试将数据框保存到新的.csv文件。 The way it outputs right now as you'll see is something like this: 如您所见,它现在的输出方式是这样的:

错误的输出

The SNAP row should begin with the 14.02 and everything right, and the next row should be KO beginning with the 51.39 and over. SNAP行应从14.02开始,然后一切都正确,下一行应为KO,从51.39开始。

Any ideas? 有任何想法吗? Just create a 1stocklist.csv file that looks like the above and try it. 只需创建一个类似于上面的1stocklist.csv文件,然后尝试即可。 Thanks! 谢谢!

# Import dependencies
from bs4 import BeautifulSoup
import re, random, time, requests, datetime, csv
import pandas as pd
import numpy as np


# Use Pandas to read the "1stocklist.csv" file. We'll use Pandas so that we can append a 'dataframe' with new
# information we get from the Zacks site to work with in the program and output to the 'data(date).csv' file later
maindf = pd.read_csv('1stocklist.csv', skiprows=1, names=[
# The .csv header names
    "Symbols"
    ]) #, delimiter = ',')

# Setting a time delay will help keep scraping suspicion down and server load down when scraping the Zacks site
timeDelay = random.randrange(2, 8)


# Start scraping Yahoo
print('Beginning to scrape Yahoo Finance site for information ...')
tickerlist = len(maindf['Symbols']) # for progress bar


# Create a progress counter to display how far along in the zacks rank scraping it is
zackscounter = 1

# For every ticker in the stocklist dataframe
for ticker in maindf['Symbols']:

# Print the progress
    print(zackscounter, ' of ', tickerlist, ' - ', ticker) # for seeing which stock it's currently on

# The list of URL's for the stock's different pages to scrape the information from
    summaryurl = 'https://ca.finance.yahoo.com/quote/' + ticker
    statsurl = 'https://ca.finance.yahoo.com/quote/' + ticker + '/key-statistics'

# Define the headers to use in Beautiful Soup 4
    headers = requests.utils.default_headers()
    headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'

# Employ random time delay now before starting with the (next) ticker
    time.sleep(timeDelay)





# Use Beautiful Soup 4 to get the info from the first Summary URL page
    page = requests.get(summaryurl, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')

    counter = 0 # used to tell which 'td' it's currently looking at
    table = soup.find('div', {'id' :'quote-summary'})
    for i in table.find_all('span'):
        counter += 1
        if counter % 2 == 0: # All Even td's are the metrics/numbers we want
            data_point = i.text
            #print(data_point)
            maindf[column_name] = data_point # Add the data point to the right column
        else:                # All odd td's are the header names
            column_name = i.text
            #print(column_name)





# Use Beautiful Soup 4 to get the info from the second stats URL page
    page = requests.get(statsurl, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')
    time.sleep(timeDelay)
# Get all the data in the tables
    counter = 0 # used to tell which 'td' it's currently looking at
    table = soup.find('section', {'data-test' :'qsp-statistics'})
    for i in table.find_all('td'):
        counter += 1
        if counter % 2 == 0: # All Even td's are the metrics/numbers we want
            data_point = i.text
            #print(data_point)
            maindf[column_name] = data_point # Add the data point to the right column
        else:                # All odd td's are the header names
            column_name = i.text
            #print(column_name)





    file_name = 'data_raw.csv'
    if zackscounter == 1:
        maindf.to_csv(file_name, index=False)
    else:
        maindf.to_csv(file_name, index=False, header=False, mode='a')

    zackscounter += 1
    continue

UPDATE: 更新:

I know it's something to do with how I'm trying to append the dataframe to the .csv file at the end. 我知道这与我试图在最后将数据帧附加到.csv文件有关。 My beginning dataframe is just one column with all the ticker symbols in it, then it's trying to add each new column to the dataframe as the program goes along, and fills down to the bottom of the ticker list. 我的起始数据帧只是其中包含所有股票代码的一列,然后它尝试在程序进行时将每个新列添加到数据帧,并填充到股票代码列表的底部。 What I'm wanting to happen is to just add the column_name header as it should, and then append the appropriate data specific to the one ticker and do that for each ticker in the “Symbols” column of my dataframe. 我想要发生的就是只是添加column_name标头,然后将特定于特定数据的数据附加到一个代码,然后在我的数据框的“符号”列中为每个代码执行此操作。 Hope that provides some clarity to the issue? 希望可以使问题更加清晰吗?

I've tried using .loc in various ways but to no success. 我尝试以各种方式使用.loc,但没有成功。 Thanks! 谢谢!

UPDATE WITH ANSWER 用答案更新

I was able to figure it out! 我能够弄清楚!

Basically, I changed the first dataframe that reads from the 1stocklist.csv to be its own dataframe, then created a new blank one to work with from within the first for loop. 基本上,我将从1stocklist.csv读取的第一个数据框更改为其自己的数据框,然后创建了一个新的空白框,以便在第一个for循环中使用。 Here is the updated head that I created: 这是我创建的更新后的头像:

# Use Pandas to read the "1stocklist.csv" file. We'll use Pandas so that we can append a 'dataframe' with new
# information we get from the Zacks site to work with in the program and output to the 'data(date).csv' file later
opening_dataframe = pd.read_csv('1stocklist.csv', skiprows=1, names=[
# The .csv header names
    "Symbols"
    ]) #, delimiter = ',')

# Setting a time delay will help keep scraping suspicion down and server load down when scraping the Zacks site
timeDelay = random.randrange(2, 8)


# Start scraping Yahoo
print('Beginning to scrape Yahoo Finance site for information ...')
tickerlist = len(opening_dataframe['Symbols']) # for progress bar


# Create a progress counter to display how far along in the zacks rank scraping it is
zackscounter = 1

# For every ticker in the stocklist dataframe
for ticker in opening_dataframe['Symbols']:

    maindf = pd.DataFrame(columns=['Symbols'])

    maindf.loc[len(maindf)] = ticker

# Print the progress
    print(zackscounter, ' of ', tickerlist, ' - ', ticker) # for seeing which stock it's currently on

# The list of URL's for the stock's different pages to scrape the information from
    summaryurl = 'https://ca.finance.yahoo.com/quote/' + ticker
    statsurl = 'https://ca.finance.yahoo.com/quote/' + ticker + '/key-statistics'
......
......
......

Notice the "opening_dataframe = ..." name change, and the 请注意,“ opening_dataframe = ...”名称已更改,并且

maindf = pd.DataFrame(columns=['Symbols']) maindf.loc[len(maindf)] = ticker maindf = pd.DataFrame(columns = ['Symbols'])maindf.loc [len(maindf)] =代码

part. 部分。 I also utilize the .loc to add to the next available row in the dataframe. 我还利用.loc将其添加到数据框中的下一个可用行。 Hopefully this helps someone! 希望这可以帮助某人!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM