简体   繁体   中英

Strip \n \t \r in scrapy

I'm trying to strip \\r \\n \\t characters with a scrapy spider, making then a json file.

I have a "description" object which is full of new lines, and it doesn't do what I want: matching each description to a title.

I tried with map(unicode.strip()) but it doesn't really works. Being new to scrapy I don't know if there's another simpler way or how map unicode really works.

This is my code:

def parse(self, response):
    for sel in response.xpath('//div[@class="d-grid-main"]'):
        item = xItem()
        item['TITLE'] = sel.xpath('xpath').extract()
        item['DESCRIPTION'] = map(unicode.strip, sel.xpath('//p[@class="class-name"]/text()').extract())

I tried also with:

item['DESCRIPTION'] = str(sel.xpath('//p[@class="class-name"]/text()').extract()).strip()

But it raised an error. What's the best way?

unicode.strip only deals with whitespace characters at the beginning and end of strings

Return a copy of the string with the leading and trailing characters removed.

not with \\n , \\r , or \\t in the middle.

You can either use a custom method to remove those characters inside the string (using the regular expression module), or even use XPath's normalize-space()

returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space .

Example python shell session:

>>> text='''<html>
... <body>
... <div class="d-grid-main">
... <p class="class-name">
... 
...  This is some text,
...  with some newlines \r
...  and some \t tabs \t too;
... 
... <a href="http://example.com"> and a link too
...  </a>
... 
... I think we're done here
... 
... </p>
... </div>
... </body>
... </html>'''
>>> response = scrapy.Selector(text=text)
>>> response.xpath('//div[@class="d-grid-main"]')
[<Selector xpath='//div[@class="d-grid-main"]' data=u'<div class="d-grid-main">\n<p class="clas'>]
>>> div = response.xpath('//div[@class="d-grid-main"]')[0]
>>> 
>>> # you'll want to use relative XPath expressions, starting with "./"
>>> div.xpath('.//p[@class="class-name"]/text()').extract()
[u'\n\n This is some text,\n with some newlines \r\n and some \t tabs \t too;\n\n',
 u"\n\nI think we're done here\n\n"]
>>> 
>>> # only leading and trailing whitespace is removed by strip()
>>> map(unicode.strip, div.xpath('.//p[@class="class-name"]/text()').extract())
[u'This is some text,\n with some newlines \r\n and some \t tabs \t too;', u"I think we're done here"]
>>> 
>>> # normalize-space() will get you a single string on the whole element
>>> div.xpath('normalize-space(.//p[@class="class-name"])').extract()
[u"This is some text, with some newlines and some tabs too; and a link too I think we're done here"]
>>> 

I'm a python, scrapy newbie, I've had a similar issue today, solved this with the help of the following module/function w3lib.html.replace_escape_chars I've created a default input processor for my item loader and it works without any issues, you can bind this on the specific scrapy.Field() also, and the good thing it works with css selectors and csv feed exports:

from w3lib.html import replace_escape_chars
yourloader.default_input_processor = MapCompose(relace_escape_chars)

As paul trmbrth suggests in his answer ,

div.xpath('normalize-space(.//p[@class="class-name"])').extract()

is likely to be what you want. However, normalize-space also condenses whitespace contained within the string into a single space. If you want only to remove \\r , \\n , and \\t without disturbing the other whitespace you can use translate() to remove characters.

trans_table = {ord(c): None for c in u'\r\n\t'}
item['DESCRIPTION] = ' '.join(s.translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract())

This will still leave leading and trailing whitespace that is not in the set \\r , \\n , or \\t . If you also want to be rid of that just insert a call to strip() :

item['DESCRIPTION] = ' '.join(s.strip().translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract())

从 alibris.com 中提取价格的最简单示例是

response.xpath('normalize-space(//td[@class="price"]//p)').get()

When I use scrapy to crawl a web page, I encounter the same problem.I have two ways to solve this problem. First use replace() function. AS "response.xpath" return a list format but replace function only operate string format.so i fetch each item of the list as a string by using a for loop, replace '\\n''\\t' in each item,and than append to a new list.

import re
test_string =["\n\t\t", "\n\t\t\n\t\t\n\t\t\t\t\t", "\n", "\n", "\n", "\n", "Do you like shopping?", "\n", "Yes, I\u2019m a shopaholic.", "\n", "What do you usually shop for?", "\n", "I usually shop for clothes. I\u2019m a big fashion fan.", "\n", "Where do you go shopping?", "\n", "At some fashion boutiques in my neighborhood.", "\n", "Are there many shops in your neighborhood?", "\n", "Yes. My area is the city center, so I have many choices of where to shop.", "\n", "Do you spend much money on shopping?", "\n", "Yes and I\u2019m usually broke at the end of the month.", "\n", "\n\n\n", "\n", "\t\t\t\t", "\n\t\t\t\n\t\t\t", "\n\n\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t"]
print(test_string)
        # remove \t \n    
a = re.compile(r'(\t)+')     
b = re.compile(r'(\n)+')
text = []
for n in test_string:
    n = a.sub('',n)
    n = b.sub('',n)
    text.append(n)
print(text)
        # remove all ''
while '' in text:
    text.remove('')
print(text)

The second method use map() and strip.The map() function directly processes the list and get the original format.'Unicode' is used in python2 and changed to 'str' in python3, as following:

text = list(map(str.strip, test_string))
print(text)

The strip function only removes the \\n\\t\\r from the beginning and end of the string, not the middle of the string.It different from remove function.

If you want to preserve the list instead all joint strings, there is no need to add extra steps, you could just simply do call the getall() instead get() :

response.xpath('normalize-space(.//td[@class="price"]/text())').getall()

Also, you should add the text() at the end.

Hope it helps anybody!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM