[英]I am trying to extract data from a website in python
def convert():
for url in url_list:
news=Article(url)
news.download()
while news.download_state != 2:
time.sleep(1)
news.parse()
l.append(
{'Title':news.title, 'Text': news.text.replace('\n',' '), 'Date':news.publish_date, 'Author':news.authors}
)
convert()
df = pd.DataFrame.from_dict(l)
df.to_csv('Amazon_try2'+'.csv',encoding='utf-8', index=False)
The function convert() goes through a list of url and process each of them. 函数convert()遍历URL列表并处理每个URL。 Each url is a link to an article.
每个网址都是文章的链接。 I am fetching the important attributes of articles such as author, text etc and then storing this in a data frame.
我正在获取诸如作者,文本等文章的重要属性,然后将其存储在数据框中。 After that, I am converting data frame to a csv file.
之后,我将数据帧转换为csv文件。 The script ran for about 5 hours as there were 589 urls in url_list.
该脚本运行了大约5个小时,因为url_list中有589个URL。 But I still couldn't get the csv file.
但是我仍然无法获取csv文件。 Can somebody spot out where I am going wrong.
有人可以找出我要去哪里。
Assuming this is your whole program, you need to return l from convert. 假设这是您的整个程序,则需要从convert返回l。
def convert():
for url in url_list:
news=Article(url)
news.download()
while news.download_state != 2:
time.sleep(1)
news.parse()
l.append(
{'Title':news.title, 'Text': news.text.replace('\n',' '), 'Date':news.publish_date, 'Author':news.authors}
)
return l
l = convert()
df = pd.DataFrame.from_dict(l)
df.to_csv('Amazon_try2'+'.csv',encoding='utf-8', index=False)
probably your function stops here: 可能您的功能在这里停止:
while news.download_state != 2:
time.sleep(1)
it is waiting for the change of the download state but it never happens. 它正在等待下载状态的更改,但从未发生。 your function should also return a list
您的函数还应该返回一个列表
something like this should work: 这样的事情应该工作:
def convert():
for url in url_list:
news=Article(url)
news.download()
news.parse()
l.append(
{'Title':news.title, 'Text': news.text.replace('\n',' '), 'Date':news.publish_date, 'Author':news.authors}
)
return l
l = convert()
df = pd.DataFrame.from_dict(l)
df.to_csv('Amazon_try2'+'.csv',encoding='utf-8', index=False)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.