在python中抓取页面时保留非拉丁字符

Question

I have a program that scrapes a page, parses it for any links, then downloads the pages linked to (sounds like a crawler, but it's not) and saves each one in a separate file. 我有一个程序刮擦页面，解析任何链接，然后下载链接到的页面（听起来像一个爬虫，但它不是），并将每个页面保存在一个单独的文件中。 The file name used to save is part of the url of the page. 用于保存的文件名是页面URL的一部分。 So for instance, if I find a link to www.foobar.com/foo, I would download the page and save it in a file entitled foo.xml. 例如，如果我找到www.foobar.com/foo的链接，我会下载页面并将其保存在名为foo.xml的文件中。

Later, I need to loop through all such files and re-download them, using the file name as the last part of the url. 稍后，我需要遍历所有这些文件并重新下载它们，使用文件名作为url的最后一部分。 (All pages are from a single site.) （所有页面都来自一个站点。）

It works well, until I encounter a non-Latin character in a url. 它运作良好，直到我在网址中遇到非拉丁字符。 The site uses utf-8, so when I download the original page and decode it, it works fine. 该网站使用utf-8，所以当我下载原始页面并解码时，它工作正常。 But when I try to use the decoded url to download the corresponding page, it doesn't work, because, I assume, the encoding is wrong. 但是当我尝试使用解码的url下载相应的页面时，它不起作用，因为我认为编码是错误的。 I've tried using .encode() on the filename to change it back, but it doesn't change anything. 我已尝试在文件名上使用.encode（）将其更改回来，但它不会改变任何内容。

I know this must be very simple and a result of my not understanding encoding issues properly, but I've been cracking my head on it for a long time. 我知道这一定非常简单，这是我对编码问题的理解不正确的结果，但我已经开始了很长时间。 I've read Joel Spolsky's introduction to encoding several times, but I still can't quite work out what to do here. 我已经多次阅读过Joel Spolsky对编码的介绍，但我还是不知道该怎么做。 Can anyone help me? 谁能帮我？

Thanks a lot, bsg 非常感谢，bsg

Here's some code. 这是一些代码。 I don't get any errors; 我没有任何错误; but when I try to download the page using the pagename as part of the url, I get told that that page doesn't exist. 但是当我尝试使用pagename作为url的一部分下载页面时，我被告知该页面不存在。 Of course it doesn't - there's no such page as abc/x54. 当然它没有 - 没有像abc / x54这样的页面。

To clarify: I download the html of a page which includes a link to www.foobar.com/Mehmet Kenan Dalbaşar , eg, but it shows up as Mehmet_Kenan_Dalba%C5%9Far. 澄清：我下载了一个页面的html，其中包含一个链接到www.foobar.com/MehmetKenanDalbaşar，例如，但它显示为Mehmet_Kenan_Dalba％C5％9Far。 When I try to download the page www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far, the page is blank. 当我尝试下载页面www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far时，页面为空白。 How do I keep www.foobar.com/Mehmet Kenan Dalbaşar and return it to the site when I need to? 我如何保留www.foobar.com/MehmetKenanDalbaşar并在需要时将其返回网站？

try:
    params = urllib.urlencode({'title': 'Foo', 'action': 'submit'})
    req = urllib2.Request(url='foobar.com',data=params, headers=headers)
    f = urllib2.urlopen(req)

    encoding = f.headers.getparam('charset')

    temp = f.read() .decode(encoding)

    #lots of code to parse out the links

    for line in links:
    try:
        pagename = line
        pagename = pagename.replace('\n', '')
        print pagename

        newpagename = pagename.replace(':', '_')
        newpagename = newpagename.replace('/', '_')
        final = os.path.join(fullpath, newpagename)
        print final
        final = final.encode('utf-8')
        print final

         ##only download the page if it hasn't already been downloaded
        if not os.path.exists(final + ".xml"):
                print "doesn't exist"
                save = open(final + ".xml", 'w')
                save.write(f.read())
                save.close()

Answer 1

As you said, you can use requests instead of urllib. 如您所说，您可以使用请求而不是urllib。

Let's say you get the url "www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far", and then just pass it to requests as an argument as follows: 假设您获取了网址“www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far”，然后将其作为参数传递给请求，如下所示：

import requests
r=requests.get("www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far")

Now you can get the content using r.text. 现在，您可以使用r.text获取内容。

Answer 2

如果您有一个包含例如代码'％C5'的网址并希望使用实际字符\\ xC5获取它，请在网址上调用urllib.unquote() 。

在python中抓取页面时保留非拉丁字符

问题描述

2 个解决方案

解决方案1
1 2012-12-21 03:41:58

解决方案2
0 已采纳 2013-01-12 10:05:50

在python中抓取页面时保留非拉丁字符

问题描述

2 个解决方案

解决方案1 1 2012-12-21 03:41:58

解决方案2 0 已采纳 2013-01-12 10:05:50

解决方案1
1 2012-12-21 03:41:58

解决方案2
0 已采纳 2013-01-12 10:05:50