简体   繁体   English

使用正则表达式/列表获取正确的数据

[英]Obtaining correct data using Regex/List

I am parsing the following code using a regex (not ideal I know, but that is a story for another day): 我正在使用正则表达式解析以下代码(我知道这不理想,但这是另一回事了):

data:{
            url: 'stage-team-stat'
        },
        defaultParams: {
            stageId : 9155,
            field: 2,
            teamId: 26
        }
    };

This is being parsed using the following code (where var is the above code): 正在使用以下代码(其中var是以上代码)进行解析:

import re

    stagematch = re.compile("data:\s*{\s*url:\s*'stage-team-stat'\s*},\s*defaultParams:\s*{\s*(.*?),.*},",re.S)

    stagematch2 = re.search(stagematch, var)

        if stagematch2 is not None:
            stagematch3 = stagematch2.group(1)

            stageid = int(stagematch3.split(':', 1)[1])
            stageid = str(stageid)

            teamid = int(stagematch3.split(':', 3)[1])
            teamid = str(teamid)

            print stageid
            print teamid

In this example I would expect stageid to be '9155' and teamid to be '32', however they are both coming back as '9155'. 在此示例中,我希望stageid为“ 9155”,而teamid为“ 32”,但是它们都将返回为“ 9155”。

Can anyone see what I am doing wrong? 谁能看到我在做什么错?

Thanks 谢谢

An alternative solution would be not to dive into regexes, but parse javascript code with a javascript code parser. 另一种解决方案是不深入正则表达式,而是使用JavaScript代码解析器解析JavaScript代码。 Example using slimit : 使用slimit示例:

SlimIt is a JavaScript minifier written in Python. SlimIt是用Python编写的JavaScript压缩程序。 It compiles JavaScript into more compact code so that it downloads and runs faster. 它将JavaScript编译为更紧凑的代码,以便下载和运行更快。

SlimIt also provides a library that includes a JavaScript parser, lexer, pretty printer and a tree visitor. SlimIt还提供了一个库,其中包括JavaScript解析器,词法分析器,漂亮的打印机和树访问器。

from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor

data = """
var defaultTeamStatsConfigParams = {
        data:{
            url: 'stage-team-stat'
        },
        defaultParams: {
            stageId : 9155,
            field: 2,
            teamId: 32
        }
    };

    DataStore.prime('stage-team-stat', defaultTeamStatsConfigParams.defaultParams, [{"RegionId":252,"RegionCode":"gb-eng","TournamentName":"Premier League","TournamentId":2,"StageId":9155,"Field":{"Value":2,"DisplayName":"Overall"},"TeamName":"Manchester United","TeamId":32,"GamesPlayed":4,"Goals":6,"Yellow":7,"Red":0,"TotalPasses":2480,"Possession":247,"AccuratePasses":2167,"AerialWon":61,"AerialLost":49,"Rating":7.01,"DefensiveRating":7.01,"OffensiveRating":6.79,"ShotsConcededIBox":13,"ShotsConcededOBox":21,"TotalTackle":75,"Interceptions":71,"Fouls":54,"WasFouled":46,"TotalShots":49,"ShotsBlocked":9,"ShotsOnTarget":19,"Dribbles":44,"Offsides":3,"Corners":17,"Throws":73,"Dispossesed":36,"TotalClearance":78,"Turnover":0,"Ranking":0}]);

    var stageStatsConfig = {
        id: 'team-stage-stats',
        singular: true,
        filter: {
                instanceType: WS.Filter,
                id: 'team-stage-stats-filter',
                categories: { data: [{ value: 'field' }] },
                singular: true
        },
        params: defaultTeamStatsConfigParams,
        content: {
            instanceType: TeamStageStats,
            view: {
                renderTo: 'team-stage-stats-content'
            }
        }
    };

    var stageStats = new WS.Panel(stageStatsConfig);
    stageStats.load();
"""

parser = Parser()
tree = parser.parse(data)
fields = {getattr(node.left, 'value', ''): getattr(node.right, 'value', '')
          for node in nodevisitor.visit(tree)
          if isinstance(node, ast.Assign)}

print fields['stageId'], fields['field'], fields['teamId']

Prints 9155 2 32 . 打印9155 2 32

Here we are iterating over the syntax tree nodes and constructing a dictionary from all assignments. 在这里,我们遍历语法树节点并根据所有分配构造一个字典。 Among them we have stageId , fields and teamId . 其中我们有stageIdfieldsteamId


Here is how you can apply the solution to your scrapy spider: 这是将解决方案应用于刮y蜘蛛的方法:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector

from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor


def get_fields(data):
    parser = Parser()
    tree = parser.parse(data)
    return {getattr(node.left, 'value', ''): getattr(node.right, 'value', '')
            for node in nodevisitor.visit(tree)
            if isinstance(node, ast.Assign)}


class ExampleSpider(CrawlSpider):
    name = "goal2"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/Teams/32/Statistics/England-Manchester-United"]
    download_delay = 5

    rules = [Rule(SgmlLinkExtractor(allow=('http://www.whoscored.com/Teams/32/Statistics/England-Manchester-United'),deny=('/News', '/Graphics', '/Articles', '/Live', '/Matches', '/Explanations', '/Glossary', 'ContactUs', 'TermsOfUse', 'Jobs', 'AboutUs', 'RSS'),), follow=False, callback='parse_item')]

    def parse_item(self, response):
        sel = Selector(response)
        titles = sel.xpath("normalize-space(//title)")
        myheader = titles.extract()[0]

        script = sel.xpath('//div[@id="team-stage-stats"]/following-sibling::script/text()').extract()[0]
        script_fields = get_fields(script)
        print script_fields['stageId'], script_fields['field'], script_fields['teamId']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM