它仅打印该站点的最后一篇文章。需要全部打印

Question

below is my code.下面是我的代码。 print(news_csv) works fine and prints all the article I want but news_csv.to_csv('bbb.csv') prints out only the last article. print(news_csv) 工作正常，可以打印我想要的所有文章，但 news_csv.to_csv('bbb.csv') 只打印最后一篇文章。

import pandas as pd
import requests
from bs4 import BeautifulSoup

source = requests.get('https://www.vanglaini.org/').text
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('article'):
    if article.a is None:
        continue
    headline = article.a.text
    summary=article.p.text
    link = "https://www.vanglaini.org" +article.a['href']
    #print(headline)
    #print(summary)
    #print(link)
    news_csv = pd.DataFrame({'Headline': [headline],
                             'Summary': [summary],
                             'Link': [link],
                             })
    print(news_csv)
    news_csv.to_csv('bbb.csv')

#print()

Only the last article is printed in CSV help. CSV帮助中只打印了最后一篇文章。

Answer 1

You have defined the variable news_csv inside the for loop.您已经在 for 循环中定义了变量news_csv 。 That means it will be overwritten each time you iterate over articles.这意味着每次迭代文章时它都会被覆盖。 That's why only the last article will be present in the csv file.这就是 csv 文件中只有最后一篇文章的原因。 In fact, the file is constantly overwritten.事实上，该文件不断被覆盖。

Instead, your content should be appended to the container object and then saved as csv only once the for loop has completed.相反，您的内容应该附加到容器 object 中，然后仅在 for 循环完成后保存为 csv。

If you really want to use pandas DataFrame, you should follow the very last example provided in the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html If you really want to use pandas DataFrame, you should follow the very last example provided in the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html

Append all articles content in a list and then generate the DataFrame object using pd.concat(). Append 将所有文章内容放在一个列表中，然后使用 pd.concat() 生成 DataFrame object。

Here is how I would write it:这是我的写法：

import pandas as pd
import requests
from bs4 import BeautifulSoup

source = requests.get('https://www.vanglaini.org/').text
soup = BeautifulSoup(source, 'lxml')
articles = []
for article in soup.find_all('article'):
    if article.a is None:
        continue
    headline = article.a.text
    summary=article.p.text
    link = "https://www.vanglaini.org" +article.a['href']

    articles.append((headline, summary, link))
    print(f'Headline: {headline}\nSummary: {summary}\nLink: {link}')
    #print(news_csv)

news_dataframe = pd.concat([pd.DataFrame([article], columns='Headline Summary Link'.split()) for article in articles ], ignore_index=True)

news_dataframe.to_csv('bbb.csv')

它仅打印该站点的最后一篇文章。需要全部打印

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-10-13 09:32:39

它仅打印该站点的最后一篇文章。 需要全部打印

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-10-13 09:32:39

它仅打印该站点的最后一篇文章。需要全部打印

解决方案1
0 已采纳 2019-10-13 09:32:39