[英]How to save multiple output in multiple file where each file has a different title coming from an object in python?
I'm scraping rss feed from a web site ( http://www.gfrvitale.altervista.org/index.php/autismo-in?format=feed&type=rss ). 我正在从网站( http://www.gfrvitale.altervista.org/index.php/autismo-in?format=feed&type=rss )中抓取RSS提要。 I have wrote down a script to extract and purifie the text from every of the feed.
我写下了一个脚本,从每个提要中提取和纯化文本。 My main problem is to save each text of each item in a different file, I also need to name each file with it's proper title exctractet from the item.
我的主要问题是将每个项目的每个文本保存在不同的文件中,我还需要使用每个项目的正确标题摘录来命名每个文件。 My code is:
我的代码是:
for item in myFeed["items"]:
time_structure=item["published_parsed"]
dt = datetime.fromtimestamp(mktime(time_structure))
if dt>t:
link=item["link"]
response= requests.get(link)
doc=Document(response.text)
doc.summary(html_partial=False)
# extracting text
h = html2text.HTML2Text()
# converting
h.ignore_links = True #ignoro i link
h.skip_internal_links=True #ignoro i link esterni
h.inline_links=True
h.ignore_images=True #ignoro i link alle immagini
h.ignore_emphasis=True
h.ignore_anchors=True
h.ignore_tables=True
testo= h.handle(doc.summary()) #testo estratto
s = doc.title()+"."+" "+testo #contenuto da stampare nel file finale
tit=item["title"]
# save each file with it's proper title
with codecs.open("testo_%s", %tit "w", encoding="utf-8") as f:
f.write(s)
f.close()
The error is: 错误是:
File "<ipython-input-57-cd683dec157f>", line 34 with codecs.open("testo_%s", %tit "w", encoding="utf-8") as f:
^
SyntaxError: invalid syntax
You need to put the comma after %tit
您需要在
%tit
之后加上逗号
should be: 应该:
#save each file with it's proper title
with codecs.open("testo_%s" %tit, "w", encoding="utf-8") as f:
f.write(s)
f.close()
However, if your file name has invalid characters it will return an error (ie [Errno 22]
) 但是,如果您的文件名包含无效字符,它将返回错误(即
[Errno 22]
)
You can try this code: 您可以尝试以下代码:
...
tit = item["title"]
tit = tit.replace(' ', '').replace("'", "").replace('?', '') # Not the best way, but it could help for now (will be better to create a list of stop characters)
with codecs.open("testo_%s" %tit, "w", encoding="utf-8") as f:
f.write(s)
f.close()
Other way using nltk
: 使用
nltk
其他方式:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tit = item["title"]
tit = tokenizer.tokenize(tit)
tit = ''.join(tit)
with codecs.open("testo_%s" %tit, "w", encoding="utf-8") as f:
f.write(s)
f.close()
First off, you misplaced the comma, it should be after the %tit
not before. 首先,您放错了逗号,应该在
%tit
之后,而不是之前。
Secondly, you don't need to close the file because the with
statement that you use, does it automatically for you. 其次,您不需要关闭文件,因为您使用的
with
语句会自动为您完成文件。 And where did the codecs came from? 编解码器是从哪里来的? I don't see it anywhere else.... anyway, the correct
with
statement would be: 我在其他任何地方都看不到...。无论如何,正确的
with
语句是:
with open("testo_%s" %tit, "w", encoding="utf-8") as f:
f.write(s)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.