简体   繁体   English

如何将 CSS 选定字段转换为普通 python 字符串

[英]How to convert CSS selected field into normal python string

My scrapy project is giving me a strange encoding for items when using CSS selectors.我的 scrapy 项目在使用 CSS 选择器时给了我一个奇怪的项目编码。

Here is the relevent code:以下是相关代码:

Once the scrapy request is made and the webpage is downloaded, parse_page is called with the response...一旦发出 scrapy 请求并下载网页,就会使用响应调用 parse_page...

    def parse_page(self, response):

        # Using Selenium WebDriver to select elements
        records = self.driver.find_elements_by_css_selector('#searchResultsTable > tbody > tr')

        for record in records:

            # Convert selenium object into scrapy.Selector object (necessary to use .add_css methods)  
            sel = Selector(text=record.get_attribute('outerHTML'))

            # Instantiate RecordLoader (a custom item loader)
            il = RecordLoader(item=Record(), selector=sel)

            # Select element and pass to example_field's input processor
            il.add_css('example_field', 'td:nth-child(2) > a::text')

il.add_css() passes the result of the CSS selector to example_field's input processor which for demonstration purposes is only print statements and shows the issue... il.add_css()将 CSS 选择器的结果传递给 example_field 的输入处理器,用于演示目的只是打印语句并显示问题...

def example_field_input_processor(text_html):
    print(text_html)
    print(type(text_html))
    print(text_html.encode('utf-8'))

Output: Output:

'\xa0\xa004/29/2020 10:50:24 AM,\xa0\xa0\xa0'

<class 'str'>

b'\xc2\xa0\xc2\xa004/29/2020 10:50:24 AM,\xc2\xa0\xc2\xa0\xc2\xa0'

Here are my questions:以下是我的问题:

1) Why is it that the CSS selector didn't just give me a normal Python string? 1) 为什么 CSS 选择器不只是给我一个正常的 Python 字符串? Does it have to do with the CSS selector casting to text with ::text .它是否与 CSS 选择器使用::text转换为文本有关。 Is it because the webpage is in a different encoding?是因为网页的编码不同吗? I checked if there was a <meta> tag that specified the site's encoding but there wasn't one.我检查了是否有指定站点编码的<meta>标记,但没有。

2) When I force an encoding of 'utf-8' why don't I get a normal python string instead of a bytes string that shows all the Unicode characters? 2)当我强制编码“utf-8”时,为什么我得不到一个普通的 python 字符串而不是显示所有 Unicode 字符的字节字符串?

3) My goal is to have just a normal python string (No prepended b, no weird characters) that I can parse. 3)我的目标是只有一个我可以解析的普通 python 字符串(没有前置 b,没有奇怪的字符)。 How?如何?

While scraping you sometimes have to clean your results from unicode characters在刮擦时,您有时必须从 unicode 字符中清除结果

They are usually as a result of spaces tabs and sometimes span它们通常是spaces tabs的结果,有时span

As a common practice clean all texts you scrape:作为一种常见的做法,清理你抓取的所有文本:

def string_cleaner(rouge_text):
    return ("".join(rouge_text.strip()).encode('ascii', 'ignore').decode("utf-8"))

Explaination:说明:

Use split() and join to translate the characters and clear it of unicodes.使用split()join来翻译字符并清除 unicodes。

This part of the code "".join(rouge_text.strip())这部分代码"".join(rouge_text.strip())

Then encode it to ascii and decode it to utf-8 to remove special characters然后将其编码为ascii并解码为utf-8以去除特殊字符

This part of the code .encode('ascii','ignore').decode("utf-8"))这部分代码.encode('ascii','ignore').decode("utf-8"))

How you would apply it in your code你将如何在你的代码中应用它

print(string_cleaner(text_html))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM