繁体   English   中英

从文本文件中读取多个网址并处理网页

[英]Reading multiple urls from text file and processing web page

脚本的输入是一个文本文件,其中包含来自网页的多个URL。 脚本中的预期步骤如下:

  • 从文本文件中读取网址
  • 去除URL以将其用作输出文件的名称(fname)
  • 使用正则表达式“ clean_me”清理URL /网页的内容。
  • 将内容写入文件(fname)
  • 对输入文件中的每个文件重复此操作。

这是输入文件urloutshort.txt的内容;

http://feedproxy.google.com/~r/autonews/ColumnistsAndBloggers/~3/6HV2TNAKqGk/diesel-with-no-nox-emissions-it-may-be-possible

http://feedproxy.google.com/~r/entire-site-rss/~3/3j3Hyq2TJt0/kyocera-corp-opens-its-largest-floating-solar-power-plant-in-japan.html

http://feedproxy.google.com/~r/entire-site-rss/~3/KRhGaT-UH_Y/crews-replace-rhode-island-pole-held-together-with-duct-tape.html

这是脚本:

import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import html5lib
import re

def clean_me(htmldoc):
    soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
    for s in soup(['script', 'style']):
    s.decompose()       
    return ' '.join(soup.stripped_strings)
with open('urloutshort.txt', 'r') as filein:
    for url in filein:
        page = requests.get(url.strip())
        fname=(url.replace('http://',' '))
        fname = fname.replace ('/',' ')
        print (fname)
        cln = clean_me(page)
        with open (fname +'.txt', 'w') as outfile:              
        outfile.write(cln +"\n")

这是错误消息;

python : Traceback (most recent call last):
At line:1 char:1
+ python webpage_A.py
+ ~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

  File "webpage_A.py", line 43, in <module>
    with open (fname +'.txt', 'w') as outfile:                              
OSError: [Errno 22] Invalid argument: ' feedproxy.google.com ~r autonews ColumnistsAndBloggers ~3 6HV2TNAKqGk 
diesel-with-no-nox-emissions-it-may-be-possible\n.txt'

该问题似乎与从文本文件中读取url有关,因为如果我绕过脚本来读取输入文件并仅对其中一个URL进行硬编码,则该脚本将处理网页并将结果保存到txt文件中从网址中提取的名称。 我已经搜索了关于SO的主题,但是没有找到解决方案。

对此问题的帮助将不胜感激。

问题在于以下代码:

    with open (fname +'.txt', 'w') as outfile:              
    outfile.write(cln +"\n")

fname包含“ \\ n”,该名称不能是要打开的有效文件名。 您需要做的就是将其更改为此

    with open (fname.rstrip() +'.txt', 'w') as outfile:              
    outfile.write(cln +"\n")

完整的代码修复包括:

import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import re
import html5lib

def clean_me(htmldoc):
    soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
    for s in soup(['script', 'style']):
        s.decompose()
        return ' '.join(soup.stripped_strings)


with open('urloutshort.txt', 'r') as filein:
    for url in filein:
        if "http" in url:
            page = requests.get(url.strip())
            fname = (url.replace('http://', ''))
            fname = fname.replace('/', ' ')
            print(fname)
            cln = clean_me(page)
            with open(fname.rstrip() + '.txt', 'w') as outfile:
                outfile.write(cln + "\n")

希望这可以帮助

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM