简体   繁体   English

Python:正则表达式——提取中文文本

[英]Python: Regular expressions - extract Chinese text

I'm trying to pull out the province and city names from the following text (this is html, but I removed some of the escape characters).我试图从以下文本中提取省和城市名称(这是 html,但我删除了一些转义字符)。 However, the regular expression I wrote returns a blank list.但是,我写的正则表达式返回一个空白列表。

When I tested the code on a re website (for example, https://regex101.com/ ), it seems to work, but it doesn't work when I write it in the script.当我在 re 网站(例如https://regex101.com/ )上测试代码时,它似乎可以工作,但是当我在脚本中编写它时却不起作用。

Here is a shortened version of my code (the html dump is much longer).这是我的代码的缩短版本(html 转储要长得多)。

Any help would be appreciated.任何帮助,将不胜感激。

import re
text = 'try  window.getAreaStat = [provinceName:湖北省,provinceShortName:湖北,confirmedCount:3554,suspectedCount:0,curedCount:80,deadCount:125,comment:待明确地区:治愈 30,cities:[cityName:武汉,confirmedCount:1905,suspectedCount:0,curedCount:47,deadCount:104,cityName:黄冈,confirmedCount:324,suspectedCount:0,curedCount:2,deadCount:5,cityName:孝感,confirmedCount:274,suspectedCount:0,curedCount:0,deadCount:3,cityName:荆门,confirmedCount:142,suspectedCount:0,curedCount:0,deadCount:4,cityName:襄阳,confirmedCount:131,suspectedCount:0,curedCount:0,deadCount:0,cityName:随州,confirmedCount:116,suspectedCount:0,curedCount:0,deadCount:0,cityName:咸宁,confirmedCount:112,suspectedCount:0,curedCount:0,deadCount:0,cityName:荆州,confirmedCount:101,suspectedCount:0,curedCount:1,deadCount:2,cityName:十堰,confirmedCount:88,suspectedCount:0,curedCount:0,deadCount:0,cityName:黄石,confirmedCount:86,suspectedCount:0,curedCount:0,deadCount:1,cityName:鄂州,confirmedCount:84,suspectedCount:0,curedCount:0,deadCount:1,cityName:宜昌,confirmedCount:63,suspectedCount:0,curedCount:0,deadCount:1,cityName:恩施州,confirmedCount:51,suspectedCount:0,curedCount:0,deadCount:0,cityName:天门,confirmedCount:34,suspectedCount:0,curedCount:0,deadCount:3,cityName:仙桃,confirmedCount:32,suspectedCount:0,curedCount:0,deadCount:0,cityName:潜江,confirmedCount:8,suspectedCount:0,curedCount:0,deadCount:1,cityName:神农架林区,confirmedCount:3,suspectedCount:0,curedCount:0,deadCount:0],provinceName:浙江省,provinceShortName:浙江,confirmedCount:296,suspectedCount:0,curedCount:3,deadCount:0,comment:,cities:[cityName:温州,confirmedCount:114,suspectedCount:0,curedCount:3,deadCount:0,cityName:杭州,confirmedCount:51,suspectedCount:0,curedCount:0,deadCount:0,cityName:台州,confirmedCount:40,suspectedCount:0,curedCount:0,deadCount:0,cityName:宁波,confirmedCount:20,suspectedCount:0,curedCount:0,deadCount:0,cityName:绍兴,confirmedCount:19,suspectedCount:0,curedCount:0,deadCount:0,cityName:嘉兴,confirmedCount:14,suspectedCount:0,curedCount:0,deadCount:0,cityName:金华,confirmedCount:13,suspectedCount:0,curedCount:0,deadCount:0,cityName:衢州,confirmedCount:8,suspectedCount:0,curedCount:0,deadCount:0,cityName:舟山,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:丽水,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:湖州,confirmedCount:5,suspectedCount:0,curedCount:0,deadCount:0],provinceName:广东省,provinceShortName:广东,confirmedCount:241,suspectedCount:0,curedCount:5,deadCount:0,comment:,cities:[cityName:广州,confirmedCount:63,suspectedCount:0,curedCount:0,deadCount:0,cityName:深圳,confirmedCount:63,suspectedCount:0,curedCount:4,deadCount:0,cityName:佛山,confirmedCount:18,suspectedCount:0,curedCount:0,deadCount:0,cityName:珠海,confirmedCount:14,suspectedCount:0,curedCount:0,deadCount:0,cityName:惠州,confirmedCount:12,suspectedCount:0,curedCount:1,deadCount:0,cityName:中山,confirmedCount:12,suspectedCount:0,curedCount:0,deadCount:0,cityName:阳江,confirmedCount:10,suspectedCount:0,curedCount:0,deadCount:0,cityName:湛江,confirmedCount:7,suspectedCount:0,curedCount:0,deadCount:0,cityName:东莞,confirmedCount:7,suspectedCount:0,curedCount:0,deadCount:0,cityName:清远,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:汕头,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:揭阳,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:肇庆,confirmedCount:5,suspectedCount:0,curedCount:0,deadCount:0,cityName:韶关,confirmedCount:4,suspectedCount:0,curedCount:0,deadCount:0,cityName:梅州,confirmedCount:4,suspectedCount:0,curedCount:0,deadCount:0,cityName:茂名,confirmedCount:2,suspectedCount:0,curedCount:0,deadCount:0,cityName:汕尾,confirmedCount:1,suspectedCount:0,curedCount:0,deadCount:0,cityName:河源'

regex = "((?<=provinceName:)|(?<=cityName:)).*?(?=,)"
province = re.findall(regex, text)

print(province)
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

From this answer , re.findall will return all the captured groups.从这个答案re.findall将返回所有捕获的组。 I tried your regex in https://regexr101.com and it all return blank captured group.我在https://regexr101.com 中尝试了你的正则表达式,它都返回了空白的捕获组。

You can use non-capturing group by adding (?:...)您可以通过添加(?:...)来使用非捕获组

regex = "(?:(?<=provinceName:)|(?<=cityName:)).*?(?=,)"

Preview on Repl.it在 Repl.it 上预览

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM