从文本文件中读取多个网址并处理网页

Question

脚本的输入是一个文本文件，其中包含来自网页的多个URL。 脚本中的预期步骤如下：

从文本文件中读取网址
去除URL以将其用作输出文件的名称（fname）
使用正则表达式“ clean_me”清理URL /网页的内容。
将内容写入文件（fname）
对输入文件中的每个文件重复此操作。

这是输入文件urloutshort.txt的内容；

http://feedproxy.google.com/~r/autonews/ColumnistsAndBloggers/~3/6HV2TNAKqGk/diesel-with-no-nox-emissions-it-may-be-possible

http://feedproxy.google.com/~r/entire-site-rss/~3/3j3Hyq2TJt0/kyocera-corp-opens-its-largest-floating-solar-power-plant-in-japan.html

http://feedproxy.google.com/~r/entire-site-rss/~3/KRhGaT-UH_Y/crews-replace-rhode-island-pole-held-together-with-duct-tape.html

这是脚本：

import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import html5lib
import re

def clean_me(htmldoc):
    soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
    for s in soup(['script', 'style']):
    s.decompose()       
    return ' '.join(soup.stripped_strings)
with open('urloutshort.txt', 'r') as filein:
    for url in filein:
        page = requests.get(url.strip())
        fname=(url.replace('http://',' '))
        fname = fname.replace ('/',' ')
        print (fname)
        cln = clean_me(page)
        with open (fname +'.txt', 'w') as outfile:              
        outfile.write(cln +"\n")

这是错误消息；

python : Traceback (most recent call last):
At line:1 char:1
+ python webpage_A.py
+ ~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

  File "webpage_A.py", line 43, in <module>
    with open (fname +'.txt', 'w') as outfile:                              
OSError: [Errno 22] Invalid argument: ' feedproxy.google.com ~r autonews ColumnistsAndBloggers ~3 6HV2TNAKqGk 
diesel-with-no-nox-emissions-it-may-be-possible\n.txt'

该问题似乎与从文本文件中读取url有关，因为如果我绕过脚本来读取输入文件并仅对其中一个URL进行硬编码，则该脚本将处理网页并将结果保存到txt文件中从网址中提取的名称。 我已经搜索了关于SO的主题，但是没有找到解决方案。

对此问题的帮助将不胜感激。

Answer 1

问题在于以下代码：

    with open (fname +'.txt', 'w') as outfile:              
    outfile.write(cln +"\n")

fname包含“ \\ n”，该名称不能是要打开的有效文件名。 您需要做的就是将其更改为此

    with open (fname.rstrip() +'.txt', 'w') as outfile:              
    outfile.write(cln +"\n")

完整的代码修复包括：

import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import re
import html5lib

def clean_me(htmldoc):
    soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
    for s in soup(['script', 'style']):
        s.decompose()
        return ' '.join(soup.stripped_strings)


with open('urloutshort.txt', 'r') as filein:
    for url in filein:
        if "http" in url:
            page = requests.get(url.strip())
            fname = (url.replace('http://', ''))
            fname = fname.replace('/', ' ')
            print(fname)
            cln = clean_me(page)
            with open(fname.rstrip() + '.txt', 'w') as outfile:
                outfile.write(cln + "\n")

希望这可以帮助

从文本文件中读取多个网址并处理网页

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-04-09 20:52:32

从文本文件中读取多个网址并处理网页

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-04-09 20:52:32

解决方案1
2 已采纳 2018-04-09 20:52:32