在scrapy中去除\\n \\t \\r

Question

I'm trying to strip \\r \\n \\t characters with a scrapy spider, making then a json file.我试图用一个爬虫蜘蛛去除 \\r \\n \\t 字符，然后制作一个 json 文件。

I have a "description" object which is full of new lines, and it doesn't do what I want: matching each description to a title.我有一个充满新行的“描述”对象，它没有做我想要的：将每个描述与标题匹配。

I tried with map(unicode.strip()) but it doesn't really works.我尝试使用 map(unicode.strip()) 但它并没有真正起作用。 Being new to scrapy I don't know if there's another simpler way or how map unicode really works.作为scrapy的新手，我不知道是否有另一种更简单的方法或者map unicode是如何真正工作的。

This is my code:这是我的代码：

def parse(self, response):
    for sel in response.xpath('//div[@class="d-grid-main"]'):
        item = xItem()
        item['TITLE'] = sel.xpath('xpath').extract()
        item['DESCRIPTION'] = map(unicode.strip, sel.xpath('//p[@class="class-name"]/text()').extract())

I tried also with:我也试过：

item['DESCRIPTION'] = str(sel.xpath('//p[@class="class-name"]/text()').extract()).strip()

But it raised an error.但它引发了一个错误。 What's the best way?最好的方法是什么？

Answer 1

unicode.strip only deals with whitespace characters at the beginning and end of strings unicode.strip只处理字符串开头和结尾的空白字符

Return a copy of the string with the leading and trailing characters removed.返回删除前导和尾随字符的字符串副本。

not with \\n , \\r , or \\t in the middle.中间没有\\n 、 \\r或\\t 。

You can either use a custom method to remove those characters inside the string (using the regular expression module), or even use XPath's normalize-space()您可以使用自定义方法删除字符串中的那些字符（使用正则表达式模块），甚至可以使用XPath 的normalize-space()

returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space .返回带有通过去除前导和尾随空格并用单个空格替换空格字符序列而标准化的空格的参数字符串。

Example python shell session:示例 python shell 会话：

>>> text='''<html>
... <body>
... <div class="d-grid-main">
... <p class="class-name">
... 
...  This is some text,
...  with some newlines \r
...  and some \t tabs \t too;
... 
... <a href="http://example.com"> and a link too
...  </a>
... 
... I think we're done here
... 
... </p>
... </div>
... </body>
... </html>'''
>>> response = scrapy.Selector(text=text)
>>> response.xpath('//div[@class="d-grid-main"]')
[<Selector xpath='//div[@class="d-grid-main"]' data=u'<div class="d-grid-main">\n<p class="clas'>]
>>> div = response.xpath('//div[@class="d-grid-main"]')[0]
>>> 
>>> # you'll want to use relative XPath expressions, starting with "./"
>>> div.xpath('.//p[@class="class-name"]/text()').extract()
[u'\n\n This is some text,\n with some newlines \r\n and some \t tabs \t too;\n\n',
 u"\n\nI think we're done here\n\n"]
>>> 
>>> # only leading and trailing whitespace is removed by strip()
>>> map(unicode.strip, div.xpath('.//p[@class="class-name"]/text()').extract())
[u'This is some text,\n with some newlines \r\n and some \t tabs \t too;', u"I think we're done here"]
>>> 
>>> # normalize-space() will get you a single string on the whole element
>>> div.xpath('normalize-space(.//p[@class="class-name"])').extract()
[u"This is some text, with some newlines and some tabs too; and a link too I think we're done here"]
>>>

Answer 2

I'm a python, scrapy newbie, I've had a similar issue today, solved this with the help of the following module/function w3lib.html.replace_escape_chars I've created a default input processor for my item loader and it works without any issues, you can bind this on the specific scrapy.Field() also, and the good thing it works with css selectors and csv feed exports:我是一个 python，scrapy 新手，我今天遇到了类似的问题，在以下模块/函数w3lib.html.replace_escape_chars的帮助下解决了这个问题我为我的项目加载器创建了一个默认的输入处理器，它没有任何问题，您也可以将其绑定到特定的 scrapy.Field() 上，它与 css 选择器和 csv 提要导出一起使用的好处是：

from w3lib.html import replace_escape_chars
yourloader.default_input_processor = MapCompose(relace_escape_chars)

Answer 3

As paul trmbrth suggests in his answer ,正如paul trmbrth在他的回答中所暗示的那样，

div.xpath('normalize-space(.//p[@class="class-name"])').extract()

is likely to be what you want.很可能就是你想要的。 However, normalize-space also condenses whitespace contained within the string into a single space.但是， normalize-space也将包含在字符串中的空格压缩为一个空格。 If you want only to remove \\r , \\n , and \\t without disturbing the other whitespace you can use translate() to remove characters.如果您只想删除\\r 、 \\n和\\t而不打扰其他空格，您可以使用translate()来删除字符。

trans_table = {ord(c): None for c in u'\r\n\t'}
item['DESCRIPTION] = ' '.join(s.translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract())

This will still leave leading and trailing whitespace that is not in the set \\r , \\n , or \\t .这仍然会留下不在集合\\r 、 \\n或\\t前导和尾随空格。 If you also want to be rid of that just insert a call to strip() :如果您还想摆脱它，只需插入对strip()的调用：

item['DESCRIPTION] = ' '.join(s.strip().translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract())

Answer 4

从 alibris.com 中提取价格的最简单示例是

response.xpath('normalize-space(//td[@class="price"]//p)').get()

Answer 5

When I use scrapy to crawl a web page, I encounter the same problem.I have two ways to solve this problem.当我使用scrapy抓取网页时，遇到同样的问题，我有两种方法可以解决这个问题。 First use replace() function.首先使用replace()函数。 AS "response.xpath" return a list format but replace function only operate string format.so i fetch each item of the list as a string by using a for loop, replace '\\n''\\t' in each item,and than append to a new list. AS“response.xpath”返回列表格式，但替换函数仅操作字符串格式。所以我使用for循环将列表的每个项目作为字符串获取，替换每个项目中的'\\n''\\t'，然后追加到新列表。

import re
test_string =["\n\t\t", "\n\t\t\n\t\t\n\t\t\t\t\t", "\n", "\n", "\n", "\n", "Do you like shopping?", "\n", "Yes, I\u2019m a shopaholic.", "\n", "What do you usually shop for?", "\n", "I usually shop for clothes. I\u2019m a big fashion fan.", "\n", "Where do you go shopping?", "\n", "At some fashion boutiques in my neighborhood.", "\n", "Are there many shops in your neighborhood?", "\n", "Yes. My area is the city center, so I have many choices of where to shop.", "\n", "Do you spend much money on shopping?", "\n", "Yes and I\u2019m usually broke at the end of the month.", "\n", "\n\n\n", "\n", "\t\t\t\t", "\n\t\t\t\n\t\t\t", "\n\n\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t"]
print(test_string)
        # remove \t \n    
a = re.compile(r'(\t)+')     
b = re.compile(r'(\n)+')
text = []
for n in test_string:
    n = a.sub('',n)
    n = b.sub('',n)
    text.append(n)
print(text)
        # remove all ''
while '' in text:
    text.remove('')
print(text)

The second method use map() and strip.The map() function directly processes the list and get the original format.'Unicode' is used in python2 and changed to 'str' in python3, as following:第二种方法使用map()和strip。map()函数直接处理列表，得到原始格式。python2中使用'Unicode'，python3中改为'str'，如下：

text = list(map(str.strip, test_string))
print(text)

The strip function only removes the \\n\\t\\r from the beginning and end of the string, not the middle of the string.It different from remove function. strip 函数只删除字符串开头和结尾的\\n\\t\\r，而不是字符串中间的。它与remove 函数不同。

Answer 6

If you want to preserve the list instead all joint strings, there is no need to add extra steps, you could just simply do call the getall() instead get() :如果您想保留列表而不是所有联合字符串，则无需添加额外的步骤，您只需调用getall()代替get() ：

response.xpath('normalize-space(.//td[@class="price"]/text())').getall()

Also, you should add the text() at the end.此外，您应该在最后添加text() 。

Hope it helps anybody!希望它可以帮助任何人！

在scrapy中去除\\n \\t \\r

问题描述

6 个解决方案

解决方案1
22 已采纳 2016-02-09 09:54:41

解决方案2
7 2017-09-23 20:30:00

解决方案3
3 2016-02-09 10:16:56

解决方案4
1 2020-02-09 17:39:16

解决方案5
0 2020-03-10 06:50:59

解决方案6
0 2020-11-25 05:00:18

在scrapy中去除\\n \\t \\r

问题描述

6 个解决方案

解决方案1 22 已采纳 2016-02-09 09:54:41

解决方案2 7 2017-09-23 20:30:00

解决方案3 3 2016-02-09 10:16:56

解决方案4 1 2020-02-09 17:39:16

解决方案5 0 2020-03-10 06:50:59

解决方案6 0 2020-11-25 05:00:18

解决方案1
22 已采纳 2016-02-09 09:54:41

解决方案2
7 2017-09-23 20:30:00

解决方案3
3 2016-02-09 10:16:56

解决方案4
1 2020-02-09 17:39:16

解决方案5
0 2020-03-10 06:50:59

解决方案6
0 2020-11-25 05:00:18