简体   繁体   中英

Using scrapy to extract and structure table data

I'm new to python and scrapy and thought I'd try out a simple review site to scrape. While most of the site structure is straight forward, I'm having trouble extracting the content of the reviews. This portion is visually laid out in sets of 3 (the text to the right of 良 (good), 悪 (bad), 感 (impressions) fields), but I'm having trouble pulling this content and associating it with a reviewer or section of review due to the use of generic divs,
, /n and other formatting.

Any help would be appreciated.

Here's the site and code I've tried for the grabbing them, with some results. http://www.psmk2.net/ps2/soft_06/rpg/p3_log1.html

(1):

response.xpath('//tr//td[@valign="top"]//text()').getall()

This returns the entire set of reviews, but it contains newline markup and, more of a problem, it renders each line as a separate entry. Due to this, I can't figure out where the good, bad, and impression portions end, nor can I easily parse each separate review as entry length varies.

['\\n弱点をついた時のメリット、つかれたときのデメリットがはっきりしてて良い', '\\nコミュをあげるのが楽しい', '\\n仲間が多くて誰を連れてくか迷う', '\\n難易度はやさしめなので遊びやすい', '\\nタルタロスしかダンジョンが無くて飽きる。'........and so forth

(2) As an alternative, I tried:

response.xpath('//tr//td[@valign="top"]')[0].get()

Which actually comes close to what I'd like, save for the markup. Here it seems that it returns the entire field of a review section. Every third element should be the "good" points of each separate review (I've replaced the <> with () to show the raw return).

(td valign="top")\\n精一杯考えました(br)\\n(br)\\n戦闘が面白いですね
\\n主人公だけですが・・・・(br)\\n従来のプレスターンバトルの進化なので(br)\\n(br)\\n以上です(/td)

(3) Figuring I might be able to get just the text, I then tried:

response.xpath('//tr//td[@valign="top"]//text()')[0].get()

But that only provides each line at a time, with the \\n at the front. As with (1), a line by line rendering makes it difficult to attribute reviews to reviewers and the appropriate section in their review.

From these (2) seems the closest to what I want, and I was hoping I could get some direction in how to grab each section for each review without the markup. I was thinking that since these sections come in sets of 3, if these could be put in a list that would make pulling them easier in the future (ie all "good" reviews follow 0, 0+3; all "bad" ones 1, 1+3 ... etc.)...but first I need to actually get the elements.

I've thought about, and tried, iterating over each line with an "if" conditional (something like:)

i = 0
if i <= len(response.xpath('//tr//td[@valign="top"]//text()').getall()):
    yield {response.xpath('//tr//td[@valign="top"]')[i].get()}
    i + 1

to pull these out, but I'm a bit lost on how to implement something like this. Not sure where it should go. I've briefly looked at Item Loader, but as I'm new to this, I'm still trying to figure it out.

Here's the block where the review code is.

    def parse(self, response):
    for table in response.xpath('body'):
        yield {
            #code for other elements in review
            'date': response.xpath('//td//div[@align="left"]//text()').getall(),
            'name': response.xpath('//td//div[@align="right"]//text()').getall(),

            #this includes the above elements, and is regualr enough I can systematically extract what I want
            'categories': response.xpath('//tr//td[@class="koumoku"]//text()').getall(),
            'scores': response.xpath('//tr//td[@class="tokuten_k"]//text()').getall(),
            'play_time': response.xpath('//td[@align="right"]//span[@id="setumei"]//text()').getall(),
            #reviews code here

        }

Pretty simple task using a part of text as anchor (I used string to get text content for a whole td ):

for review_node in response.xpath('//table[@width="645"]'):
    good = review_node.xpath('string(.//td[b[starts-with(., "良")]]/following-sibling::td[1])').get()
    bad= review_node.xpath('string(.//td[b[starts-with(., "悪")]]/following-sibling::td[1])').get()
...............

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM