[英]How to convert CSS selected field into normal python string
My scrapy project is giving me a strange encoding for items when using CSS selectors.我的 scrapy 项目在使用 CSS 选择器时给了我一个奇怪的项目编码。
Here is the relevent code:以下是相关代码:
Once the scrapy request is made and the webpage is downloaded, parse_page is called with the response...一旦发出 scrapy 请求并下载网页,就会使用响应调用 parse_page...
def parse_page(self, response):
# Using Selenium WebDriver to select elements
records = self.driver.find_elements_by_css_selector('#searchResultsTable > tbody > tr')
for record in records:
# Convert selenium object into scrapy.Selector object (necessary to use .add_css methods)
sel = Selector(text=record.get_attribute('outerHTML'))
# Instantiate RecordLoader (a custom item loader)
il = RecordLoader(item=Record(), selector=sel)
# Select element and pass to example_field's input processor
il.add_css('example_field', 'td:nth-child(2) > a::text')
il.add_css()
passes the result of the CSS selector to example_field's input processor which for demonstration purposes is only print statements and shows the issue... il.add_css()
将 CSS 选择器的结果传递给 example_field 的输入处理器,用于演示目的只是打印语句并显示问题...
def example_field_input_processor(text_html):
print(text_html)
print(type(text_html))
print(text_html.encode('utf-8'))
Output: Output:
'\xa0\xa004/29/2020 10:50:24 AM,\xa0\xa0\xa0'
<class 'str'>
b'\xc2\xa0\xc2\xa004/29/2020 10:50:24 AM,\xc2\xa0\xc2\xa0\xc2\xa0'
Here are my questions:以下是我的问题:
1) Why is it that the CSS selector didn't just give me a normal Python string? 1) 为什么 CSS 选择器不只是给我一个正常的 Python 字符串? Does it have to do with the CSS selector casting to text with
::text
.它是否与 CSS 选择器使用
::text
转换为文本有关。 Is it because the webpage is in a different encoding?是因为网页的编码不同吗? I checked if there was a
<meta>
tag that specified the site's encoding but there wasn't one.我检查了是否有指定站点编码的
<meta>
标记,但没有。
2) When I force an encoding of 'utf-8' why don't I get a normal python string instead of a bytes string that shows all the Unicode characters? 2)当我强制编码“utf-8”时,为什么我得不到一个普通的 python 字符串而不是显示所有 Unicode 字符的字节字符串?
3) My goal is to have just a normal python string (No prepended b, no weird characters) that I can parse. 3)我的目标是只有一个我可以解析的普通 python 字符串(没有前置 b,没有奇怪的字符)。 How?
如何?
While scraping you sometimes have to clean your results from unicode characters在刮擦时,您有时必须从 unicode 字符中清除结果
They are usually as a result of spaces
tabs
and sometimes span
它们通常是
spaces
tabs
的结果,有时span
As a common practice clean all texts you scrape:作为一种常见的做法,清理你抓取的所有文本:
def string_cleaner(rouge_text):
return ("".join(rouge_text.strip()).encode('ascii', 'ignore').decode("utf-8"))
Explaination:说明:
Use split()
and join
to translate the characters and clear it of unicodes.使用
split()
和join
来翻译字符并清除 unicodes。
This part of the code
"".join(rouge_text.strip())
这部分代码
"".join(rouge_text.strip())
Then encode it to ascii
and decode it to utf-8
to remove special characters然后将其编码为
ascii
并解码为utf-8
以去除特殊字符
This part of the code
.encode('ascii','ignore').decode("utf-8"))
这部分代码
.encode('ascii','ignore').decode("utf-8"))
How you would apply it in your code你将如何在你的代码中应用它
print(string_cleaner(text_html))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.