简体   繁体   English

在python中抓取页面时保留非拉丁字符

[英]Keep non-Latin characters when scraping page in python

I have a program that scrapes a page, parses it for any links, then downloads the pages linked to (sounds like a crawler, but it's not) and saves each one in a separate file. 我有一个程序刮擦页面,解析任何链接,然后下载链接到的页面(听起来像一个爬虫,但它不是),并将每个页面保存在一个单独的文件中。 The file name used to save is part of the url of the page. 用于保存的文件名是页面URL的一部分。 So for instance, if I find a link to www.foobar.com/foo, I would download the page and save it in a file entitled foo.xml. 例如,如果我找到www.foobar.com/foo的链接,我会下载页面并将其保存在名为foo.xml的文件中。

Later, I need to loop through all such files and re-download them, using the file name as the last part of the url. 稍后,我需要遍历所有这些文件并重新下载它们,使用文件名作为url的最后一部分。 (All pages are from a single site.) (所有页面都来自一个站点。)

It works well, until I encounter a non-Latin character in a url. 它运作良好,直到我在网址中遇到非拉丁字符。 The site uses utf-8, so when I download the original page and decode it, it works fine. 该网站使用utf-8,所以当我下载原始页面并解码时,它工作正常。 But when I try to use the decoded url to download the corresponding page, it doesn't work, because, I assume, the encoding is wrong. 但是当我尝试使用解码的url下载相应的页面时,它不起作用,因为我认为编码是错误的。 I've tried using .encode() on the filename to change it back, but it doesn't change anything. 我已尝试在文件名上使用.encode()将其更改回来,但它不会改变任何内容。

I know this must be very simple and a result of my not understanding encoding issues properly, but I've been cracking my head on it for a long time. 我知道这一定非常简单,这是我对编码问题的理解不正确的结果,但我已经开始了很长时间。 I've read Joel Spolsky's introduction to encoding several times, but I still can't quite work out what to do here. 我已经多次阅读过Joel Spolsky对编码的介绍,但我还是不知道该怎么做。 Can anyone help me? 谁能帮我?

Thanks a lot, bsg 非常感谢,bsg

Here's some code. 这是一些代码。 I don't get any errors; 我没有任何错误; but when I try to download the page using the pagename as part of the url, I get told that that page doesn't exist. 但是当我尝试使用pagename作为url的一部分下载页面时,我被告知该页面不存在。 Of course it doesn't - there's no such page as abc/x54. 当然它没有 - 没有像abc / x54这样的页面。

To clarify: I download the html of a page which includes a link to www.foobar.com/Mehmet Kenan Dalbaşar , eg, but it shows up as Mehmet_Kenan_Dalba%C5%9Far. 澄清:我下载了一个页面的html,其中包含一个链接到www.foobar.com/MehmetKenanDalbaşar,例如,但它显示为Mehmet_Kenan_Dalba%C5%9Far。 When I try to download the page www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far, the page is blank. 当我尝试下载页面www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far时,页面为空白。 How do I keep www.foobar.com/Mehmet Kenan Dalbaşar and return it to the site when I need to? 我如何保留www.foobar.com/MehmetKenanDalbaşar并在需要时将其返回网站?

try:
    params = urllib.urlencode({'title': 'Foo', 'action': 'submit'})
    req = urllib2.Request(url='foobar.com',data=params, headers=headers)
    f = urllib2.urlopen(req)

    encoding = f.headers.getparam('charset')

    temp = f.read() .decode(encoding)

    #lots of code to parse out the links

    for line in links:
    try:
        pagename = line
        pagename = pagename.replace('\n', '')
        print pagename

        newpagename = pagename.replace(':', '_')
        newpagename = newpagename.replace('/', '_')
        final = os.path.join(fullpath, newpagename)
        print final
        final = final.encode('utf-8')
        print final

         ##only download the page if it hasn't already been downloaded
        if not os.path.exists(final + ".xml"):
                print "doesn't exist"
                save = open(final + ".xml", 'w')
                save.write(f.read())
                save.close()

As you said, you can use requests instead of urllib. 如您所说,您可以使用请求而不是urllib。

Let's say you get the url "www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far", and then just pass it to requests as an argument as follows: 假设您获取了网址“www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far”,然后将其作为参数传递给请求,如下所示:

import requests
r=requests.get("www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far")

Now you can get the content using r.text. 现在,您可以使用r.text获取内容。

如果您有一个包含例如代码'%C5'的网址并希望使用实际字符\\ xC5获取它,请在网址上调用urllib.unquote()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python:对包含非拉丁字符的单词调用upper() - Python: calling upper() on words containing non-latin characters 当文件包含非拉丁字符和 output 时,如何解析 Python 中的 JSON 文件作为列表列表? - How do I parse a JSON file in Python when the file contains non-latin characters and output it as a list of lists? 非拉丁文本在Python中输出为无意义 - Non-latin text outputting as nonsense in Python Python在CSV中删除非拉丁文字行 - Python remove non-latin textlines in csv 无法在包含非拉丁字符的Python / Django中执行查询 - Unable to execute a query in Python/Django which contains non-latin characters 使用python 2.7对非拉丁字符进行编码,同时将文本自动转换为csv文件的列 - Encoding non-latin characters while doing auto-text to columns to csv file using python 2.7 按字符串顺序对非拉丁字符集进行排序? - Sorting sets of non-latin characters in the order of a string? 使用非拉丁字符串作为键的Python中的Tuple排序 - Sorting with Tuple in Python with non-Latin strings as keys 使用python逐字打印乌尔都语(非拉丁语) - Printing urdu (non-latin language) word by word using python python使用正则表达式替换字符串中的非拉丁词 - python find-replace non-latin word in string with regex
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM