从字符串中提取名称和数字

Question

Similar to this question I have a string of names and numbers separated by a colon: 与此问题类似，我有一串用冒号分隔的名称和数字：

s = 'Waz D: 5 l gu l: 5 GrinVe: 3 P LUK: 2 Cubbi: 1 2 nd dok: 1 maf 74: 1 abr12: 1 Waza D 5'

I'm trying to split this to get: 我正在尝试将其拆分为：

 ('Waz D', '5'),
 ('l gu l', '5'),
 ('GrinVe', '3'),
 ('P LUK', '2'),
 ('Cubbi', '1'),
 ('2 nd dok', '1')
 ('maf 74', '1')
 ('abr12', '1')

I have tried two regular expressions so far with mixed success: 到目前为止，我已经尝试了两个正则表达式，但取得了不同的成功：

re.findall(r"(.*?)[a-zA-Z0-9]+: (\d+)*", s)
[('Waz ', '5'),
 (' l gu ', '5'),
 (' ', '3'),
 (' P ', '2'),
 (' ', '1'),
 (' 2 nd ', '1'),
 (' maf ', '1'),
 (' ', '1')]

And: 和：

re.findall(r"(.*?)([a-zA-Z0-9]+): (\d+)*", s)
[('Waz ', 'D', '5'),
 (' l gu ', 'l', '5'),
 (' ', 'GrinVe', '3'),
 (' P ', 'LUK', '2'),
 (' ', 'Cubbi', '1'),
 (' 2 nd ', 'dok', '1'),
 (' maf ', '74', '1'),
 (' ', 'abr12', '1')]

How can I adjust this to get the output I'm after? 我该如何调整以获得我想要的输出？

Answer 1

Consume the whitespace greedily and don't put it into the matching groups. 贪婪地使用空格，不要将其放入匹配的组中。

>>> import re
>>> s = 'Waz D: 5 l gu l: 5 GrinVe: 3 P LUK: 2 Cubbi: 1 2 nd dok: 1 maf 74: 1 abr12: 1 Waza D 5'
>>> 
>>> re.findall('([^:]+?):\s*(\d+)\s*', s)
[('Waz D', '5'), ('l gu l', '5'), ('GrinVe', '3'), ('P LUK', '2'), ('Cubbi', '1'), ('2 nd dok', '1'), ('maf 74', '1'), ('abr12', '1')]

Answer 2

If we assume that the string is always followed by a semicolon-space-number-space sequence, you can do it like this: 如果我们假设字符串始终后面跟有分号-空格-数字-空格序列，则可以这样执行：

re.findall(r"(.+?):\s(\d+)\s", s)

[('Waz D', '5'),
 ('l gu l', '5'),
 ('GrinVe', '3'),
 ('P LUK', '2'),
 ('Cubbi', '1'),
 ('2 nd dok', '1'),
 ('maf 74', '1'),
 ('abr12', '1')]

Answer 3

It comes down to splitting on the combination : \\d , nothing else (besides suppressing leading and following whitespace here and there). 归结为对组合的分割: \\d ，除此之外（除了在此处和此处抑制前导空格和后跟空格）。 All it needs is a group of any length that does not contain a colon : , followed by that colon and then a single run of digits. 它需要的是一个不包含冒号:的任意长度的组，其后是该冒号，然后是一串数字。

import re
s = 'Waz D: 5 l gu l: 5 GrinVe: 3 P LUK: 2 Cubbi: 1 2 nd dok: 1 maf 74: 1 abr12: 1 Waza D 5'

print (re.findall(r'([^:]+):\s*(\d+)\s+', s))

result: 结果：

[('Waz D', '5'),
 ('l gu l', '5'),
 ('GrinVe', '3'),
 ('P LUK', '2'),
 ('Cubbi', '1'),
 ('2 nd dok', '1'),
 ('maf 74', '1'),
 ('abr12', '1')]

Answer 4

You could match zero or more times a whitespace character followed by capturing in a group not a colon using a negated character class ([^:]+) . 您可以匹配零次或多次匹配空白字符，然后使用否定的字符类([^:]+)其捕获为一组而不是冒号。

Then match a colon, zero or more whitespace characters \\s* and capture in a group one or more digits (\\d+) 然后匹配一个冒号，零个或多个空格字符\\s*并捕获一组一个或多个数字(\\d+)

\\s*([^:]+):\\s*(\\d+)

Demo 演示版

Answer 5

In your sample the name starts generally from a letter, but in 1 case - from a digit. 在您的样本中，名称通常以字母开头，但在一种情况下-以数字开头。

So the first capturing group, for the name should: 因此，第一个捕获组的名称应为：

start with [az\\d] (remember of re.I flag at the end), 以[az\\d] re.I （记住re.I处的re.I标志），
then it should contain [^:]* - a sequence of chars other than : . 那么它应该包含[^:]* -除:以外的一系列字符。

Your solution ( [a-zA-Z0-9]+ ) is wrong, because the name can contain spaces. 您的解决方案（ [a-zA-Z0-9]+ ）错误，因为名称可以包含空格。

The second group, matching the number is simple - just \\d+ . 第二组，与数字匹配很简单- \\d+ 。

Between these 2 groups there should be :\\s* - a colon and a sequence of white chars. 在这两个组之间应该有:\\s* -一个冒号和一系列白色字符。

The code contains a single call to re.findall , as follows: 该代码包含对re.findall的单个调用，如下所示：

re.findall(r"([a-z\d][^:]*):\s*(\d+)", s, flags=re.I)

But I am in doubt about Cubbi: 1 2 in your sample. 但我对Cubbi: 1 2感到怀疑Cubbi: 1 2您的样本中Cubbi: 1 2 。 Should the 2 really be a part of the next name? 如若2 真的是下一个名称的一部分？

If not, consider changing the regex to: ([az][^:]*):\\s*(\\d+(?: \\d+)?) . 如果不是，请考虑将正则表达式更改为： ([az][^:]*):\\s*(\\d+(?: \\d+)?) 。 Differences: 差异：

The name must start with a letter (not a digit), 名称必须以字母（而不是数字）开头，
The number can contain the "second part", with a preceding single space - (?: \\d+)? 该数字可以包含“第二部分”，并带有一个前导空格- (?: \\d+)? . 。

Then 1 2 will be the "numer" for Cubbi and the next name will start from "nd". 然后1 2将是Cubbi的“数字”，并且下一个名称将从“ nd”开始。

And what about Waza D 5 at the end of your sample? 样品末尾的Waza D 5呢？ Did you forget to put the colon before 5 ? 您是否忘记将冒号放在5之前？

Answer 6

My solution 我的解决方案

I've added a ':' after Waza D because I think there should be (I think it was a typo, because the rule should be name: number). 我在Waza D之后添加了“：”，因为我认为应该是（我认为这是一个错字，因为规则应该是name：number）。 The pattern , for me, is a name starting with a letter and followed by other letters/numbers and spaces until the : a space and a number. 对我而言，模式是一个以字母开头的名称，然后是其他字母/数字和空格，直到：：一个空格和一个数字。

s = 'Waz D: 5 l gu l: 5 GrinVe: 3 P LUK: 2 Cubbi: 1 2 nd dok: 1 maf 74: 1 abr12: 1 Waza D: 5'

import re

# \w find something starting with a letter
# [\w\s]+ followed by any number of letter and space
# : followed by a :
# \s[0-9] and a space and a number
x = re.findall(r"\w[\w\s]+:\s[0-9]", s)
print(*x, sep="\n")

output 输出

Waz D: 5
l gu l: 5
GrinVe: 3
P LUK: 2
Cubbi: 1
2 nd dok: 1
maf 74: 1
abr12: 1
Waza D: 5

从字符串中提取名称和数字

问题描述

6 个解决方案

解决方案1
1 2018-07-04 09:45:33

解决方案2
1 2018-07-04 09:52:19

解决方案3
1 已采纳 2018-07-04 09:53:55

解决方案4
1 2018-07-04 09:56:28

解决方案5
0 2018-07-04 11:17:36

解决方案6
0 2018-07-05 17:09:04

My solution 我的解决方案

从字符串中提取名称和数字

问题描述

6 个解决方案

解决方案1 1 2018-07-04 09:45:33

解决方案2 1 2018-07-04 09:52:19

解决方案3 1 已采纳 2018-07-04 09:53:55

解决方案4 1 2018-07-04 09:56:28

解决方案5 0 2018-07-04 11:17:36

解决方案6 0 2018-07-05 17:09:04

My solution 我的解决方案

解决方案1
1 2018-07-04 09:45:33

解决方案2
1 2018-07-04 09:52:19

解决方案3
1 已采纳 2018-07-04 09:53:55

解决方案4
1 2018-07-04 09:56:28

解决方案5
0 2018-07-04 11:17:36

解决方案6
0 2018-07-05 17:09:04