简体   繁体   English

从字符串中提取单词

[英]Extract words from a string

Sample Input:样本输入:

'note - Part model D3H6 with specifications X30G and Y2A is having features 12H89.'

Expected Output:预期 Output:

['D3H6', 'X30G', 'Y2A', '12H89']

My code:我的代码:

split_note = re.split(r'[.;,\s]\s*', note)
pattern = re.compile("^[a-zA-Z0-9]+$")  
#if pattern.match(ini_str):
for a in n2:
        if pattern.match(a):
            alphaList.append(a)

I need to extract all the alpha numeric words from a split string and store them in a list.我需要从拆分字符串中提取所有字母数字单词并将它们存储在列表中。

The above code is unable to give expected output.上面的代码无法给出预期的 output。

Maybe this can solve the problem:也许这可以解决问题:

import re 

# input string
stri = "Part model D3H6 with specifications X30 and Y2 is having features 12H89"
# words tokenization
split = re.findall("[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",stri)
# this statment returns words containing both numbers and letters
print([word for word in split if bool(re.match('^(?=.*[a-zA-Z])(?=.*[0-9])', word))])

#output: ['D3H6', 'X30', 'Y2', '12H89']

^ and $ are meant for the end and beginning of a line, not of a word. ^$表示一行的结尾和开头,而不是一个单词。 Besides your example words don't include lower case, so why adding az ?除了您的示例单词不包括小写,那么为什么要添加az

Considering your example, if what you need is to fetch a word that always contains both at least one letter and at least one number and always ends with a number, this is the pattern:考虑到您的示例,如果您需要获取一个始终包含至少一个字母和至少一个数字并且始终以数字结尾的单词,则模式如下:

\b[0-9A-Z]+\d+\b

If it may end with a letter rather than a digit, but still requires at least one digit and one letter,then it gets more complex:如果它可能以字母而不是数字结尾,但仍然需要至少一个数字和一个字母,那么它会变得更复杂:

\b[0-9A-Z]*\d|[A-Z][0-9A-Z]*\b

\b stands for a word boundary. \b代表单词边界。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM