[英]Reading multiple urls from text file and find RESOLVE results using pycurl
[英]Reading multiple urls from text file and processing web page
脚本的输入是一个文本文件,其中包含来自网页的多个URL。 脚本中的预期步骤如下:
这是输入文件urloutshort.txt
的内容;
这是脚本:
import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import html5lib
import re
def clean_me(htmldoc):
soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
for s in soup(['script', 'style']):
s.decompose()
return ' '.join(soup.stripped_strings)
with open('urloutshort.txt', 'r') as filein:
for url in filein:
page = requests.get(url.strip())
fname=(url.replace('http://',' '))
fname = fname.replace ('/',' ')
print (fname)
cln = clean_me(page)
with open (fname +'.txt', 'w') as outfile:
outfile.write(cln +"\n")
这是错误消息;
python : Traceback (most recent call last):
At line:1 char:1
+ python webpage_A.py
+ ~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError
File "webpage_A.py", line 43, in <module>
with open (fname +'.txt', 'w') as outfile:
OSError: [Errno 22] Invalid argument: ' feedproxy.google.com ~r autonews ColumnistsAndBloggers ~3 6HV2TNAKqGk
diesel-with-no-nox-emissions-it-may-be-possible\n.txt'
该问题似乎与从文本文件中读取url有关,因为如果我绕过脚本来读取输入文件并仅对其中一个URL进行硬编码,则该脚本将处理网页并将结果保存到txt文件中从网址中提取的名称。 我已经搜索了关于SO的主题,但是没有找到解决方案。
对此问题的帮助将不胜感激。
问题在于以下代码:
with open (fname +'.txt', 'w') as outfile:
outfile.write(cln +"\n")
fname包含“ \\ n”,该名称不能是要打开的有效文件名。 您需要做的就是将其更改为此
with open (fname.rstrip() +'.txt', 'w') as outfile:
outfile.write(cln +"\n")
完整的代码修复包括:
import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import re
import html5lib
def clean_me(htmldoc):
soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
for s in soup(['script', 'style']):
s.decompose()
return ' '.join(soup.stripped_strings)
with open('urloutshort.txt', 'r') as filein:
for url in filein:
if "http" in url:
page = requests.get(url.strip())
fname = (url.replace('http://', ''))
fname = fname.replace('/', ' ')
print(fname)
cln = clean_me(page)
with open(fname.rstrip() + '.txt', 'w') as outfile:
outfile.write(cln + "\n")
希望这可以帮助
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.