[英]Regex : How can I capture multiple text from this string?
我有来自日志文件的文本,格式如下:
{s:9:\\\\"batch_num\\\\";s:16:\\\\"4578123645712459\\\\";s:9:\\\\"full_name\\\\";s:8:\\\\"John Doe\\\\"; s:6:\\\\"手机\\\\";s:12:\\\\"123456784512\\\\";s:7:\\\\"地址\\\\";s:5:\\\\"已编辑"\\\\";s :11:\\\\"create_time\\\\";s:19:\\\\"2017-09-10 12:45:01\\\\";s:6:\\\\"gender\\\\";s:1:\\\\ "1\\\\";s:9:\\\\"生日\\\\";s:10:\\\\"1996-03-09\\\\";s:11:\\\\"contact_num\\\\";s:1: \\\\"0\\\\";s:8:\\\\"身份\\\\";s:1:\\\\"2\\\\";s:6:\\\\"学校\\\\";N;s:14: \\\\"school_city_id\\\\";N;s:17:\\\\"profile_pic\\\\";s:43:\\\\"profile\\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg\\\\";s:14:\\\\"school_address\\\\" ;N;s:17:\\\\"enter_school_date\\\\";N;s:10:\\\\"专业\\\\";}
目前我只能使用正则表达式提取 batch_num :
(?<=batch_num\\\\\\\\";s:16:\\\\\\\\")([0-9]{1,16})(?=\\\\\\\\)
题
我想从batch_num、full_name 和profile_pic 中提取值。 我的预期输出是:
4578123645712459
约翰·多伊
个人资料\\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg
如何使用正确的正则表达式获得所需的输出?
提前致谢。
通过将字符串转换为 json 来优雅地提取值的解决方案。
第一步:清洁绳子
import re, itertools
str_text = text.replace('\\','').replace(';','').replace('""','"').replace(':"','"').replace('N',',""')
str_text = re.sub('s:\d+',',', str_text)
str_text = re.sub('^{,','{', str_text)
str_text = re.sub('}$',':""}', str_text)
str_text = re.sub('(,)', lambda m, c=itertools.count(): m.group() if next(c) % 2 else ':', str_text)
str_text
#'{"batch_num":"4578123645712459","full_name":"John Doe","mobile":"123456784512","address":"Redacted","create_time":"2017-09-10 12:45:01","gender":"1","birthdate":"1996-03-09","contact_num":"0","identity":"2","school":"","school_city_id":"","profile_pic":"profile/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg","school_address":"","enter_school_date":"","speciality":""}'
第 2 步:将字符串转换为 json 并提取
import json
str_json = json.loads(str_text)
print(str_json['batch_num'])
print(str_json['full_name'])
print(str_json['profile_pic'])
#4578123645712459
#John Doe
#profile/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg
带有多个正则表达式。
(?<="batch_num)\\\\{3}";s:\\d+:\\\\{3}"(\\d+)
(?<="full_name)\\\\{3}";s:\\d+:\\\\{3}"(\\w+\\s\\w+)
(?<="full_name)\\\\{3}";s:\\d+:\\\\{3}"([\\w+\\s]{1,})
(?<="profile_pic)\\\\{3}";s:\\d+:\\\\{3}"(\\w+\\\\{2}\\/\\w+\\.\\w+)
regex_batch = r'(?<="batch_num)\\{3}";s:\d+:\\{3}"(\d+)'
regex_name = r'(?<="full_name)\\{3}";s:\d+:\\{3}"(\w+\s\w+)'
regex_prof = r'(?<="profile_pic)\\{3}";s:\d+:\\{3}"(\w+\\{2}\/\w+\.\w+)'
test_str = "{s:9:\\\\\\\"batch_num\\\\\\\";s:16:\\\\\\\"4578123645712459\\\\\\\";s:9:\\\\\\\"full_name\\\\\\\";s:8:\\\\\\\"John Doe\\\\\\\";s:6:\\\\\\\"mobile\\\\\\\";s:12:\\\\\\\"123456784512\\\\\\\";s:7:\\\\\\\"address\\\\\\\";s:5:\\\\\\\"Redacted\"\\\\\\\";s:11:\\\\\\\"create_time\\\\\\\";s:19:\\\\\\\"2017-09-10 12:45:01\\\\\\\";s:6:\\\\\\\"gender\\\\\\\";s:1:\\\\\\\"1\\\\\\\";s:9:\\\\\\\"birthdate\\\\\\\";s:10:\\\\\\\"1996-03-09\\\\\\\";s:11:\\\\\\\"contact_num\\\\\\\";s:1:\\\\\\\"0\\\\\\\";s:8:\\\\\\\"identity\\\\\\\";s:1:\\\\\\\"2\\\\\\\";s:6:\\\\\\\"school\\\\\\\";N;s:14:\\\\\\\"school_city_id\\\\\\\";N;s:17:\\\\\\\"profile_pic\\\\\\\";s:43:\\\\\\\"profile\\\\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg\\\\\\\";s:14:\\\\\\\"school_address\\\\\\\";N;s:17:\\\\\\\"enter_school_date\\\\\\\";N;s:10:\\\\\\\"speciality\\\\\\\";}"
m_batch = re.findall(regex_batch, test_str, re.MULTILINE)[0]
m_name = re.findall(regex_name, test_str, re.MULTILINE)[0]
m_prof = re.findall(regex_prof, test_str, re.MULTILINE)[0]
print(m_batch, m_name, m_prof)
4578123645712459 John Doe profile\\\\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg
我想我有一个给你。 jpeg 匹配组 2 排除了两个 // 这就是为什么它们是粉红色的,它们是相同的匹配组:
https://regex101.com/r/OBaOY0/2
import itertools, re
a = '{s:9:\\"batch_num\\";s:16:\\"4578123645712459\\";s:9:\\"full_name\\";s:8:\\"John Doe\\";s:6:\\"mobile\\";s:12:\\"123456784512\\";s:7:\\"address\\";s:5:\\"Redacted"\\";s:11:\\"create_time\\";s:19:\\"2017-09-10 12:45:01\\";s:6:\\"gender\\";s:1:\\"1\\";s:9:\\"birthdate\\";s:10:\\"1996-03-09\\";s:11:\\"contact_num\\";s:1:\\"0\\";s:8:\\"identity\\";s:1:\\"2\\";s:6:\\"school\\";N;s:14:\\"school_city_id\\";N;s:17:\\"profile_pic\\";s:43:\\"profile\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg\\";s:14:\\"school_address\\";N;s:17:\\"enter_school_date\\";N;s:10:\\"speciality\\";}'.replace("\\","")
list(filter(None, list(itertools.chain.from_iterable(re.findall(r'(?:s:16:\")(\d+)|(?:s:8:\")(\w+ \w+)|(?:s:43:\")(\w+/\w+\.\w+)', a)))))
输出:
['4578123645712459',
'John Doe',
'profile/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg']
您可以使用交替和捕获组获取示例数据的所有 3 个匹配项:
\b(?:batch_num|full_name|profile_pic)\b\\\\\\";s:\d+:\\\\\\"([^"]+)\\\\\\"
在零件中
\\b(?:batch_num|full_name|profile_pic)\\b
匹配单词边界之间的选项之一\\\\\\\\\\\\";s:\\d+:
匹配\\\\\\"s:
和 1+ 位数字\\\\\\\\\\\\"
匹配\\\\\\"
(
捕获组 1
[^"]+
匹配 1+ 次字符,除了"
)
关闭群组\\\\\\\\\\\\"
匹配\\\\\\"
例如
import re
regex = r'\b(?:batch_num|full_name|profile_pic)\b\\\\\\";s:\d+:\\\\\\"([^"]+)\\\\\\"'
test_str = r'''{s:9:\\\"batch_num\\\";s:16:\\\"4578123645712459\\\";s:9:\\\"full_name\\\";s:8:\\\"John Doe\\\";s:6:\\\"mobile\\\";s:12:\\\"123456784512\\\";s:7:\\\"address\\\";s:5:\\\"Redacted"\\\";s:11:\\\"create_time\\\";s:19:\\\"2017-09-10 12:45:01\\\";s:6:\\\"gender\\\";s:1:\\\"1\\\";s:9:\\\"birthdate\\\";s:10:\\\"1996-03-09\\\";s:11:\\\"contact_num\\\";s:1:\\\"0\\\";s:8:\\\"identity\\\";s:1:\\\"2\\\";s:6:\\\"school\\\";N;s:14:\\\"school_city_id\\\";N;s:17:\\\"profile_pic\\\";s:43:\\\"profile\\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg\\\";s:14:\\\"school_address\\\";N;s:17:\\\"enter_school_date\\\";N;s:10:\\\"speciality\\\";}'''
matches = re.finditer(regex, test_str)
print(re.findall(regex, test_str))
输出
['4578123645712459', 'John Doe', 'profile\\\\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.