简体   繁体   English

Python正则表达式提取前瞻

[英]Python Regular Expression Extract Lookahead

I have been trying to extract transport node names and location coordinate strings from a webpage scrape (that I have permission to scrape). 我一直在尝试从刮刮的网页中提取运输节点名称和位置坐标字符串(我有权刮刮)。 The names and locations are in cdata blocks of javascript. 名称和位置在javascript的cdata块中。 See here: http://pastebin.com/6Vtup2dE 看到这里: http : //pastebin.com/6Vtup2dE

Using regular expressions in python 在python中使用正则表达式

re.findall("(?:\(new\sMicrosoft\.Maps\.Location\()(.+?(?=\)\,))(?:.+?(?=new\ssimpleInfo\(\\\'))(.+?(?=\\)))", test_str)

I get 我懂了

[(u'55.86527,-4.2517133',
  u"new simpleInfo('Buchanan Bus Station','Glasgow, Buchanan Bus Station - entrance to station is situated on Killermont Street. It is a short walk from George Square and within easy reach of Glasgow?s main shopping and leisure areas. Please check the bus station passenger displays for stance information for megabus services.'"),
 (u'55.86068,-4.257852', u"new simpleInfo('Central Train Station',''"),
 (u'51.492653,-0.14765126',
  u"new simpleInfo('Victoria, Buckingham Palace Rd, Stop 10','London Victoria, Buckingham Palace Road - at the corner of Elizabeth Bridge and diagonally across from the main entrance to Victoria Coach Station. megabus Oxford Tube services leave from Stop 10.'"),
 (u'51.492596,-0.14985295',
  u"new simpleInfo('Victoria Coach Station','London Victoria Coach Station is situated on Buckingham Palace Rd at the junction with Elizabeth St. megabus services depart from Stands 15-20, located in the departures area of North West terminal '"),
 (u'51.503437,-0.112076715',
  u"new simpleInfo('Waterloo Train Station','London Waterloo - London Waterloo Station is located on Station Approach, SE1 London - just behind the London Eye. The station is a terminus for trains serving the south-west of England and Eurostar services. Waterloo is the largest station in the UK by area. Its spacious, curved concourse is lined with shops and all the modern amenities.\\n'"),
 (u'51.53062,-0.12585254',
  u"new simpleInfo('St Pancras International Train Station','For East Midlands Trains services only. London St Pancras International, London - St Pancras Station is located on Pancras Rd NW1 between the national Library and Kings Cross station. The station is the terminus for trains serving East Midlands and South Yorkshire. It is also the new London terminal for the Eurostar services to the continent. Kings Cross St Pancras tube station provides links via the London underground to other London destinations.'"),
 (u'51.52678,-0.13297649',
  u"new simpleInfo('Euston Train Station','For Virgin Trains Services Only. London Euston - The station is the main terminal for trains to London from the West Midlands and North West England. It is connected to Euston Tube Station for easy access to the London Underground network'"),
 (u'51.52953,-0.12506014',
  u"new simpleInfo('St Pancras, Coach Road','In some instances megabusplus services which operate as coach only will pick up from Coach Road, outside London St Pancras.'"),
 (u'55.86527,-4.2517133',
  u"new simpleInfo('Buchanan Bus Station','Glasgow, Buchanan Bus Station - entrance to station is situated on Killermont Street. It is a short walk from George Square and within easy reach of Glasgow?s main shopping and leisure areas. Please check the bus station passenger displays for stance information for megabus services.'"),
 (u'55.86068,-4.257852', u"new simpleInfo('Central Train Station',''")]

But what I am trying to get is just: 但是我想要得到的只是:

[(u'55.86527,-4.2517133','Buchanan Bus Station'),
     (u'55.86068,-4.257852', 'Central Train Station'),
     (u'51.492653,-0.14765126','Victoria, Buckingham Palace Rd, Stop 10'),
     (u'51.492596,-0.14985295','Victoria Coach Station')....etc]

I've written plenty of regex in my time but I've never had problems like this. 我曾经写过很多正则表达式,但是我从来没有遇到过这样的问题。 I am trying (believe it or not) to hide everything up to and including "new simpleInfo(' and then grab the string up to the next "'" but I can't work it out. help! 我正在尝试(信不信由你)隐藏所有内容,包括“ new simpleInfo('),然后将字符串保留到下一个“'”,但我无法解决。帮助!

Try this: 尝试这个:

re.findall(r"(?:\(new\sMicrosoft\.Maps\.Location\(([^)]+)\).+?new\ssimpleInfo\(\\?'(.+?)\\?')", test_str)

This regex find all occurences whether there is \\'Buchanan Bus Station\\' or 'Buchanan Bus Station' . 这个正则表达式查找所有事件,无论是\\'Buchanan Bus Station\\'还是'Buchanan Bus Station'

Here is the demo 这是演示

(?:\(new\sMicrosoft\.Maps\.Location\()(.+?(?=\)\,))(?:.+?).*?new\ssimpleInfo\(\\'([^'\\]+)

Try this.This should give you what you want. 试试这个。这应该给你你想要的。

import re
p = re.compile(ur'(?:\(new\sMicrosoft\.Maps\.Location\()(.+?(?=\)\,))(?:.+?).*?new\ssimpleInfo\(\\\'([^\'\\]+)')
test_str = u"jQuery(function(){ jQuery(\'#JourneyPlanner_txtOutboundDate\').datepicker({dateFormat: \'dd/mm/yy\', firstDay: 1, beforeShowDay: function(dte){ return [((dte >= new Date(2014,9,29) && dte <= new Date(2015,0,4)) || false)]; }, minDate: new Date(2014,9,29), maxDate: new Date(2015,0,4),buttonImage: \"/images/icon_calendar.gif\", showOn: \"both\", buttonImageOnly: true}); });\njQuery(function(){ jQuery(\'#JourneyPlanner_txtReturnDate\').datepicker({dateFormat: \'dd/mm/yy\', firstDay: 1,buttonImage: \"/images/icon_calendar.gif\", showOn: \"both\", buttonImageOnly: true}); });\nEmperorBing.addCallback(function(){ var map = new Microsoft.Maps.Map(document.getElementById(\'confirm1_Map1\'), {credentials:\'Aodb7Wd7D9Kq5gKNryfW6V29yf8aw2Sbu-tXAlkH7OLJtm8zG2bQzzhDKK5zM9FE\',height: 320,width: 299, zoom: 13, mapTypeId: Microsoft.Maps.MapTypeId.auto, enableClickableLogo: false , enableSearchLogo: false , showDashboard: true, showCopyright: true, showScalebar: true, showMapTypeSelector: true});\r\nEmperorBing.addMarker(map, new Microsoft.Maps.Pushpin(new Microsoft.Maps.Location(55.86527,-4.2517133), { undefined: undefined, icon:\'/images/mapmarker.gif\', width:42, height:42, anchor: new Microsoft.Maps.Point(21,21)}),new simpleInfo(\'Buchanan Bus Station\',\'Glasgow, Buchanan Bus Station - entrance to station is situated on Killermont Street. It is a short walk from George Square and within easy reach of Glasgow?s main shopping and leisure areas. Please check the bus station passenger displays for stance information "

re.findall(p, test_str)

See demo. 参见演示。

http://regex101.com/r/dP9rO4/9 http://regex101.com/r/dP9rO4/9

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM