簡體   English   中英

Python正則表達式提取前瞻

[英]Python Regular Expression Extract Lookahead

我一直在嘗試從刮刮的網頁中提取運輸節點名稱和位置坐標字符串(我有權刮刮)。 名稱和位置在javascript的cdata塊中。 看到這里: http : //pastebin.com/6Vtup2dE

在python中使用正則表達式

re.findall("(?:\(new\sMicrosoft\.Maps\.Location\()(.+?(?=\)\,))(?:.+?(?=new\ssimpleInfo\(\\\'))(.+?(?=\\)))", test_str)

我懂了

[(u'55.86527,-4.2517133',
  u"new simpleInfo('Buchanan Bus Station','Glasgow, Buchanan Bus Station - entrance to station is situated on Killermont Street. It is a short walk from George Square and within easy reach of Glasgow?s main shopping and leisure areas. Please check the bus station passenger displays for stance information for megabus services.'"),
 (u'55.86068,-4.257852', u"new simpleInfo('Central Train Station',''"),
 (u'51.492653,-0.14765126',
  u"new simpleInfo('Victoria, Buckingham Palace Rd, Stop 10','London Victoria, Buckingham Palace Road - at the corner of Elizabeth Bridge and diagonally across from the main entrance to Victoria Coach Station. megabus Oxford Tube services leave from Stop 10.'"),
 (u'51.492596,-0.14985295',
  u"new simpleInfo('Victoria Coach Station','London Victoria Coach Station is situated on Buckingham Palace Rd at the junction with Elizabeth St. megabus services depart from Stands 15-20, located in the departures area of North West terminal '"),
 (u'51.503437,-0.112076715',
  u"new simpleInfo('Waterloo Train Station','London Waterloo - London Waterloo Station is located on Station Approach, SE1 London - just behind the London Eye. The station is a terminus for trains serving the south-west of England and Eurostar services. Waterloo is the largest station in the UK by area. Its spacious, curved concourse is lined with shops and all the modern amenities.\\n'"),
 (u'51.53062,-0.12585254',
  u"new simpleInfo('St Pancras International Train Station','For East Midlands Trains services only. London St Pancras International, London - St Pancras Station is located on Pancras Rd NW1 between the national Library and Kings Cross station. The station is the terminus for trains serving East Midlands and South Yorkshire. It is also the new London terminal for the Eurostar services to the continent. Kings Cross St Pancras tube station provides links via the London underground to other London destinations.'"),
 (u'51.52678,-0.13297649',
  u"new simpleInfo('Euston Train Station','For Virgin Trains Services Only. London Euston - The station is the main terminal for trains to London from the West Midlands and North West England. It is connected to Euston Tube Station for easy access to the London Underground network'"),
 (u'51.52953,-0.12506014',
  u"new simpleInfo('St Pancras, Coach Road','In some instances megabusplus services which operate as coach only will pick up from Coach Road, outside London St Pancras.'"),
 (u'55.86527,-4.2517133',
  u"new simpleInfo('Buchanan Bus Station','Glasgow, Buchanan Bus Station - entrance to station is situated on Killermont Street. It is a short walk from George Square and within easy reach of Glasgow?s main shopping and leisure areas. Please check the bus station passenger displays for stance information for megabus services.'"),
 (u'55.86068,-4.257852', u"new simpleInfo('Central Train Station',''")]

但是我想要得到的只是:

[(u'55.86527,-4.2517133','Buchanan Bus Station'),
     (u'55.86068,-4.257852', 'Central Train Station'),
     (u'51.492653,-0.14765126','Victoria, Buckingham Palace Rd, Stop 10'),
     (u'51.492596,-0.14985295','Victoria Coach Station')....etc]

我曾經寫過很多正則表達式,但是我從來沒有遇到過這樣的問題。 我正在嘗試(信不信由你)隱藏所有內容,包括“ new simpleInfo('),然后將字符串保留到下一個“'”,但我無法解決。幫助!

嘗試這個:

re.findall(r"(?:\(new\sMicrosoft\.Maps\.Location\(([^)]+)\).+?new\ssimpleInfo\(\\?'(.+?)\\?')", test_str)

這個正則表達式查找所有事件,無論是\\'Buchanan Bus Station\\'還是'Buchanan Bus Station'

這是演示

(?:\(new\sMicrosoft\.Maps\.Location\()(.+?(?=\)\,))(?:.+?).*?new\ssimpleInfo\(\\'([^'\\]+)

試試這個。這應該給你你想要的。

import re
p = re.compile(ur'(?:\(new\sMicrosoft\.Maps\.Location\()(.+?(?=\)\,))(?:.+?).*?new\ssimpleInfo\(\\\'([^\'\\]+)')
test_str = u"jQuery(function(){ jQuery(\'#JourneyPlanner_txtOutboundDate\').datepicker({dateFormat: \'dd/mm/yy\', firstDay: 1, beforeShowDay: function(dte){ return [((dte >= new Date(2014,9,29) && dte <= new Date(2015,0,4)) || false)]; }, minDate: new Date(2014,9,29), maxDate: new Date(2015,0,4),buttonImage: \"/images/icon_calendar.gif\", showOn: \"both\", buttonImageOnly: true}); });\njQuery(function(){ jQuery(\'#JourneyPlanner_txtReturnDate\').datepicker({dateFormat: \'dd/mm/yy\', firstDay: 1,buttonImage: \"/images/icon_calendar.gif\", showOn: \"both\", buttonImageOnly: true}); });\nEmperorBing.addCallback(function(){ var map = new Microsoft.Maps.Map(document.getElementById(\'confirm1_Map1\'), {credentials:\'Aodb7Wd7D9Kq5gKNryfW6V29yf8aw2Sbu-tXAlkH7OLJtm8zG2bQzzhDKK5zM9FE\',height: 320,width: 299, zoom: 13, mapTypeId: Microsoft.Maps.MapTypeId.auto, enableClickableLogo: false , enableSearchLogo: false , showDashboard: true, showCopyright: true, showScalebar: true, showMapTypeSelector: true});\r\nEmperorBing.addMarker(map, new Microsoft.Maps.Pushpin(new Microsoft.Maps.Location(55.86527,-4.2517133), { undefined: undefined, icon:\'/images/mapmarker.gif\', width:42, height:42, anchor: new Microsoft.Maps.Point(21,21)}),new simpleInfo(\'Buchanan Bus Station\',\'Glasgow, Buchanan Bus Station - entrance to station is situated on Killermont Street. It is a short walk from George Square and within easy reach of Glasgow?s main shopping and leisure areas. Please check the bus station passenger displays for stance information "

re.findall(p, test_str)

參見演示。

http://regex101.com/r/dP9rO4/9

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM