简体   繁体   English

我正在尝试从python网站提取数据

[英]I am trying to extract data from a website in python

def convert():
    for url in url_list:
        news=Article(url)
        news.download()
        while news.download_state != 2:
            time.sleep(1)
        news.parse()
        l.append(
            {'Title':news.title, 'Text': news.text.replace('\n',' '), 'Date':news.publish_date, 'Author':news.authors}
        )

convert()
df = pd.DataFrame.from_dict(l)
df.to_csv('Amazon_try2'+'.csv',encoding='utf-8', index=False)

The function convert() goes through a list of url and process each of them. 函数convert()遍历URL列表并处理每个URL。 Each url is a link to an article. 每个网址都是文章的链接。 I am fetching the important attributes of articles such as author, text etc and then storing this in a data frame. 我正在获取诸如作者,文本等文章的重要属性,然后将其存储在数据框中。 After that, I am converting data frame to a csv file. 之后,我将数据帧转换为csv文件。 The script ran for about 5 hours as there were 589 urls in url_list. 该脚本运行了大约5个小时,因为url_list中有589个URL。 But I still couldn't get the csv file. 但是我仍然无法获取csv文件。 Can somebody spot out where I am going wrong. 有人可以找出我要去哪里。

Assuming this is your whole program, you need to return l from convert. 假设这是您的整个程序,则需要从convert返回l。

def convert():
    for url in url_list:
        news=Article(url)
        news.download()
        while news.download_state != 2:
            time.sleep(1)
        news.parse()
        l.append(
            {'Title':news.title, 'Text': news.text.replace('\n',' '), 'Date':news.publish_date, 'Author':news.authors}
        )
    return l 

l = convert()
df = pd.DataFrame.from_dict(l)
df.to_csv('Amazon_try2'+'.csv',encoding='utf-8', index=False)

probably your function stops here: 可能您的功能在这里停止:

    while news.download_state != 2:
        time.sleep(1)

it is waiting for the change of the download state but it never happens. 它正在等待下载状态的更改,但从未发生。 your function should also return a list 您的函数还应该返回一个列表

something like this should work: 这样的事情应该工作:

def convert():
    for url in url_list:
        news=Article(url)
        news.download()

        news.parse()
        l.append(
            {'Title':news.title, 'Text': news.text.replace('\n',' '), 'Date':news.publish_date, 'Author':news.authors}
        )
    return l 

l = convert()
df = pd.DataFrame.from_dict(l)
df.to_csv('Amazon_try2'+'.csv',encoding='utf-8', index=False)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我正在尝试使用 python 从该网站下载年度数据,但我不知道如何处理它? - I am trying to download the Yearly data from this website using python but i am not sure how to approach it? 我正在尝试使用 request 和 bs4 python 模块从网站提取数据。 当我尝试访问此代码时,收到以下错误消息给 json - I am trying extract data from website using request and bs4 python module. when i try to access this code, got below error message to json 我正在尝试使用 pdfminer 在 python 中将数据提取为 HTML 元素 - I am trying to extract data as HTML elements in python using pdfminer 我正在尝试从网站中提取特定表格,但我在执行此操作时遇到问题 - i am trying to extract the specific table from the website but i am having problem doing it 我正在尝试在python中使用和不使用def语句(函数)从Excel工作表中提取数据 - I am trying to extract data from excel sheet with and without def statement (functions) in python 为什么在 excel 中替换为 A 而在 python 中从网站抓取数据,我试图用 .replace(' ',"") 解决它,但仍然无法正常工作 - Why &nbsp replace as A in excel while scrape data from website in python , I am trying to solve it with .replace(' ',"") but still not working 如何使用 selenium 和 python 抓取数据,我正在尝试提取标题 div 标签中的所有数据 - How to scrape data using selenium and python, I am trying to extract all the data which is in title div tag 我正在尝试使用 python 从 postgressql 中的列中提取一个值。 但我总是收到这个错误: - I am trying to extract a values from column in postgressql with python. But i always get this error : 我有一个名称列表,我正在尝试从python列表中提取名字和姓氏 - I have a list of names and i am trying to extract first name and last name from the list in python 我正在尝试使用python通过请求将数据提交到网站。 如何通过确认对话框? - Using python i am trying to submit data to a website via requests. How to pass the confirmation dialog?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM