简体   繁体   English

如何使用正则表达式提取信息页面

[英]How to extract pages of information using regular expression

I'm having trouble capturing the contents of "name": he often appears before "pluralName" other later. 我在捕获“名称”的内容时遇到了麻烦:他经常出现在“ pluralName”之前。 What better way of doing this? 有什么更好的方法? (best way in terms of performance). (就性能而言最好的方法)。 Thank you for your help! 谢谢您的帮助!

Note: I am using python 注意:我正在使用python

The chunk of the page that has the information I need: 页面中包含我需要的信息的块:

{"count":0,"items":[]},"shortUrl":"http:\/\/4sq.com\/11nP13T","likes":{"count":22,"groups":[{"type":"others","count":22,"items":[]}],"summary":"22 Likes"},"ratingColor":"FF9600","id":"5172311be4b0ecc0a12a9953","canonicalPath":"\/v\/kee-hiong-klang-bak-kut-teh\/5172311be4b0ecc0a12a9953","canonicalUrl":"https:\/\/foursquare.com\/v\/kee-hiong-klang-bak-kut-teh\/5172311be4b0ecc0a12a9953","rating":5.3,"categories":[**{"pluralName":"Chinese Restaurants","name":"Chinese Restaurant",**"icon":{"prefix":"https:\/\/ss3.4sqi.net\/img\/categories_v2\/food\/asian_","mapPrefix":"https:\/\/ss3.4sqi.net\/img\/categories_map\/food\/chinese","suffix":".png"},"id":"4bf58dd8d48988d145941735","shortName":"Chinese","primary":true},{"pluralName":"Asian Restaurants","name":"Asian Restaurant","icon":{"prefix":"https:\/\/ss3.4sqi.net\/img\/categories_v2\/food\/asian_","mapPrefix":"https:\/\/ss3.4sqi.net\/img\/categories_map\/food\/asian","suffix":".png"},"id":"4bf58dd8d48988d142941735","shortName":"Asian"}],"createdAt":1366438171,"tips":{"count":25,"groups":[{"count":25,"items":[{"logView":true,"text":"Portion is quite small and expensive. Service attitude is so so. The BKT taste is not my preference.One of the up car restaurants in SS2 which I'll never go back again. 👎","likes":{"count":1,"groups":[{"type":"others","count":1,"items":[{"photo":{"prefix":"https:\/\/irs0.4sqi.net\/img\/user\/","suffix":"\/43964080-5LYADRF2EEP2RWPL.jpg"},"lastName":".w","firstName":"Jackie","id":"43964080","canonicalPath":"\/user\/43964080","canonicalUrl":"https:\/\/foursquare.com\/user\/43964080","gender":"female"}]}],"summary":"1 like"},"id":"541c2b73498eb0cfe1f76b9e","canonicalPath":"\/item\/541c2b73498eb0cfe1f76b9e","canonicalUrl":"https:\/\/foursquare.com\/item\/541c2b73498eb0cfe1f76b9e","createdAt":1.411132275E9,"todo":{"count":0},"user":{"photo":{"prefix":"https:\/\/irs1.4sqi.net\/img\/user\/","suffix":"\/5765949-NW4BAJWFBCVLRR1M.jpg"}
(?:"pluralName":"[^"]*","name":"([^"]*))|(?:"name":"([^"]*)","pluralName")

Try this with re.findall .See demo. re.findall一起尝试。请re.findall演示。

https://regex101.com/r/hR7tH4/4 https://regex101.com/r/hR7tH4/4

print re.findall(r'(?:"pluralName":"[^"]*","name":"([^"]*))|(?:"name":"([^"]*)","pluralName")',test_str)

Don't use a regexp at all. 根本不要使用正则表达式。

Instead, use a JSON parser, and access the resulting object. 而是使用JSON解析器,并访问生成的对象。 That is much more robust. 那更健壮。

import json # part of python
o = json.loads(str)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM