简体   繁体   中英

Scrapy not accepting japanese characters in spider

Here is a part of the source code of the website i am trying to scrape.

<th>会社名</th>
<td colspan="2">
    <p class="realtorName">
        <ruby>株式会社エリア・エステート 川崎店</ruby>
    </p>
</td>

And this is just a test spider to see if scrapy is fetching any data

# -*- coding: utf-8 -*-
import scrapy


class TestSpider(scrapy.Spider):
    name = "test"
    allowed_domains = ["homes.co.jp"]
    start_urls = ['http://www.homes.co.jp/realtor/mid-122457hNYEJwIO7kDs/']

    def parse(self, response):
        yield{
            'FAX':response.xpath('//*[@id="anchor_realtorOutline"]/div[1]/table/tbody/tr/th[contains(text(), "FAX")]/following-sibling::td/text()').extract(),
            'Company_Name':response.xpath('//*[@id="anchor_realtorOutline"]/div[1]/table/tbody/tr/th[contains(text(), "会社名")]/following-sibling::td/p[1]/ruby/text()').extract(),
            'TEl':response.xpath('//*[@id="anchor_realtorOutline"]/div[1]/table/tbody/tr/th[contains(text(), "TEL")]/following-sibling::td/text()').extract(),



            }

The 'TEL' and 'FAX' fields would return data but scrapy throws an error for the field 'Company_Name'

Error:

All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters.

What i wanted to do was match that string in Japanese and obtain the text from the sibling tag as mentioned in the above source code.

And the strange fact is that it ran yesterday and scraped data. Now it's returning errors.

Do i need to do something to include the Japanese characterset?

尝试用u追加字符串

'Company_Name':response.xpath(u'//*[@id="anchor_realtorOutline"]/div[1]/table/tbody/tr/th[contains(text(), "会社名")]/following-sibling::td/p[1]/ruby/text()').extract(),

The reason of why your xpath doesn't work is becuase of tbody . You have to remove it and check if you get that result that you want.

You can read this in scrapy documentation: http://doc.scrapy.org/en/0.14/topics/firefox.html

Firefox, in particular, is known for adding <tbody> elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won't be able to extract any data if you use <tbody> in your XPath expressions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM