[英]Keep non-Latin characters when scraping page in python
I have a program that scrapes a page, parses it for any links, then downloads the pages linked to (sounds like a crawler, but it's not) and saves each one in a separate file. 我有一个程序刮擦页面,解析任何链接,然后下载链接到的页面(听起来像一个爬虫,但它不是),并将每个页面保存在一个单独的文件中。 The file name used to save is part of the url of the page.
用于保存的文件名是页面URL的一部分。 So for instance, if I find a link to www.foobar.com/foo, I would download the page and save it in a file entitled foo.xml.
例如,如果我找到www.foobar.com/foo的链接,我会下载页面并将其保存在名为foo.xml的文件中。
Later, I need to loop through all such files and re-download them, using the file name as the last part of the url. 稍后,我需要遍历所有这些文件并重新下载它们,使用文件名作为url的最后一部分。 (All pages are from a single site.)
(所有页面都来自一个站点。)
It works well, until I encounter a non-Latin character in a url. 它运作良好,直到我在网址中遇到非拉丁字符。 The site uses utf-8, so when I download the original page and decode it, it works fine.
该网站使用utf-8,所以当我下载原始页面并解码时,它工作正常。 But when I try to use the decoded url to download the corresponding page, it doesn't work, because, I assume, the encoding is wrong.
但是当我尝试使用解码的url下载相应的页面时,它不起作用,因为我认为编码是错误的。 I've tried using .encode() on the filename to change it back, but it doesn't change anything.
我已尝试在文件名上使用.encode()将其更改回来,但它不会改变任何内容。
I know this must be very simple and a result of my not understanding encoding issues properly, but I've been cracking my head on it for a long time. 我知道这一定非常简单,这是我对编码问题的理解不正确的结果,但我已经开始了很长时间。 I've read Joel Spolsky's introduction to encoding several times, but I still can't quite work out what to do here.
我已经多次阅读过Joel Spolsky对编码的介绍,但我还是不知道该怎么做。 Can anyone help me?
谁能帮我?
Thanks a lot, bsg 非常感谢,bsg
Here's some code. 这是一些代码。 I don't get any errors;
我没有任何错误; but when I try to download the page using the pagename as part of the url, I get told that that page doesn't exist.
但是当我尝试使用pagename作为url的一部分下载页面时,我被告知该页面不存在。 Of course it doesn't - there's no such page as abc/x54.
当然它没有 - 没有像abc / x54这样的页面。
To clarify: I download the html of a page which includes a link to www.foobar.com/Mehmet Kenan Dalbaşar , eg, but it shows up as Mehmet_Kenan_Dalba%C5%9Far. 澄清:我下载了一个页面的html,其中包含一个链接到www.foobar.com/MehmetKenanDalbaşar,例如,但它显示为Mehmet_Kenan_Dalba%C5%9Far。 When I try to download the page www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far, the page is blank.
当我尝试下载页面www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far时,页面为空白。 How do I keep www.foobar.com/Mehmet Kenan Dalbaşar and return it to the site when I need to?
我如何保留www.foobar.com/MehmetKenanDalbaşar并在需要时将其返回网站?
try:
params = urllib.urlencode({'title': 'Foo', 'action': 'submit'})
req = urllib2.Request(url='foobar.com',data=params, headers=headers)
f = urllib2.urlopen(req)
encoding = f.headers.getparam('charset')
temp = f.read() .decode(encoding)
#lots of code to parse out the links
for line in links:
try:
pagename = line
pagename = pagename.replace('\n', '')
print pagename
newpagename = pagename.replace(':', '_')
newpagename = newpagename.replace('/', '_')
final = os.path.join(fullpath, newpagename)
print final
final = final.encode('utf-8')
print final
##only download the page if it hasn't already been downloaded
if not os.path.exists(final + ".xml"):
print "doesn't exist"
save = open(final + ".xml", 'w')
save.write(f.read())
save.close()
As you said, you can use requests instead of urllib. 如您所说,您可以使用请求而不是urllib。
Let's say you get the url "www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far", and then just pass it to requests as an argument as follows: 假设您获取了网址“www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far”,然后将其作为参数传递给请求,如下所示:
import requests
r=requests.get("www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far")
Now you can get the content using r.text. 现在,您可以使用r.text获取内容。
如果您有一个包含例如代码'%C5'的网址并希望使用实际字符\\ xC5获取它,请在网址上调用urllib.unquote()
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.