[英]Extracting data according to a list
I'm trying to figure out how to extract some data from a string according to this list: 我正在试图弄清楚如何根据此列表从字符串中提取一些数据:
check_list = ['E1', 'E2', 'E7', 'E3', 'E9', 'E10', 'E12', 'IN1', 'IN2', 'IN4', 'IN10']
For example for this list: 例如,对于此列表:
s1 = "apto E1-E10 tower 1-2 sanit"
I would get ['E1', 'E10']
我会得到['E1', 'E10']
s2 = "apto IN2-IN1-IN4-E12-IN10 mamp"
For this I would get: ['IN2', 'IN1', 'IN4', 'E12', 'IN10']
为此我得到: ['IN2', 'IN1', 'IN4', 'E12', 'IN10']
And then this gets tricky: 然后这变得棘手:
s3 = "E-2-7-3-9-12; IN1-4-10 T 1-2 inst. hidr."
I would get: ['E2', 'E7', 'E3', 'E9', 'E12', 'IN1', 'IN4', 'IN10']
我会得到: ['E2', 'E7', 'E3', 'E9', 'E12', 'IN1', 'IN4', 'IN10']
Can you please give some advice to solve this? 你能提出一些建议来解决这个问题吗?
The following should work: 以下应该有效:
def extract_data(s):
check_set = set(['E1', 'E2', 'E7', 'E3', 'E9', 'E10', 'E12',
'IN1', 'IN2', 'IN4', 'IN10'])
result = []
for match in re.finditer(r'\b(E|IN)[-\d]+', s):
for digits in re.findall(r'\d+', match.group(0)):
item = match.group(1) + digits
if item in check_set:
result.append(item)
return result
Examples: 例子:
>>> extract_data("apto E1-E10 tower 1-2 sanit")
['E1', 'E10']
>>> extract_data("apto IN2-IN1-IN4-E12-IN10 mamp")
['IN2', 'IN1', 'IN4', 'E12', 'IN10']
>>> extract_data("E-2-7-3-9-12; IN1-4-10 T 1-2 inst. hidr.")
['E2', 'E7', 'E3', 'E9', 'E12', 'IN1', 'IN4', 'IN10']
import re
def parse(string):
result = []
for match in re.findall('(E|IN)-{0,1}([\d]+)((-[\d]+)*)', string):
letter = match[0]
numbers = [int(i) for i in [match[1]] + match[2].split('-')[1:]]
for number in numbers:
result.append('%s%d' % (letter, number))
return result
print parse('apto E1-E10 tower 1-2 sanit')
print parse('apto IN2-IN1-IN4-E12-IN10 mamp')
print parse('E-2-7-3-9-12; IN1-4-10 T 1-2 inst. hidr.')
This is a partial answer, more of an indication how I might start to solve your issue. 这是一个部分答案,更多的表明我可能会如何开始解决您的问题。
Using the "keys" IN
and E
, I'd search the strings for patterns matching the key followed by any number of spaces or dashes. 使用“键” IN
和E
,我会在字符串中搜索与键匹配的模式,后跟任意数量的空格或短划线。
For example: 例如:
import re
S = ['apto E1-E10 tower 1-2 sanit','apto IN2-IN1-IN4-E12-IN10 mamp','E-2-7-3-9-12; IN1-4-10 T 1-2 inst. hidr.']
for s in S:
print s
M = re.findall(r'(IN[\d\-]*)', s)
for m in M: print m
M = re.findall(r'(E[\d\-]*)', s)
for m in M: print m
Produces: 生产:
$ python extract.py apto E1-E10 tower 1-2 sanit E1- E10 apto IN2-IN1-IN4-E12-IN10 mamp IN2- IN1- IN4- IN10 E12- E-2-7-3-9-12; IN1-4-10 T 1-2 inst. hidr. IN1-4-10 E-2-7-3-9-12
I'd then take each m
and parse it further. 然后,我将每个m
并进一步解析。 So that E1-
resulted in [E1]
and E-2-7-3-9-12
resulted in [E2,E7,E3,E9,E12]
. 因此E1-
导致[E1]
和E-2-7-3-9-12
导致[E2,E7,E3,E9,E12]
。
I tried to make this as general as possible: 我试图尽可能地做到这一点:
import re
def make_relist(l):
relist = []
for a in l:
alpha, num = re.match('([a-zA-Z]+)(\d+)', a).groups()
re_string = r'\b{0}({1}|\d*-(\d+-)*{1})\b'.format(alpha, num)
relist.append((a, re.compile(re_string)))
return relist
def extract(s, relist):
return [v for v, r in relist if r.search(s)]
Test: 测试:
>>> tokens = ['E1', 'E2', 'E7', 'E3', 'E9', 'E10', 'E12', 'IN1', 'IN2', 'IN4', 'IN10']
>>> relist = make_relist(tokens)
>>> extract("apto E1-E10 tower 1-2 sanit", relist)
['E1', 'E10']
>>> extract("apto IN2-IN1-IN4-E12-IN10 mamp", relist)
['E12', 'IN1', 'IN2', 'IN4', 'IN10']
>>> extract("E-2-7-3-9-12; IN1-4-10 T 1-2 inst. hidr.", relist)
['E2', 'E7', 'E3', 'E9', 'E12', 'IN1', 'IN4', 'IN10']
Note that this becomes more efficient if you have a large number of strings to extract from, because the compilation overhead time becomes insignificant in that case. 请注意,如果要从中提取大量字符串,这会变得更有效,因为在这种情况下编译开销时间变得无关紧要。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.