使用正则表达式查找模式

Question

This is example of a test I have 这是我的测试示例

JT  - American journal of public health
JID - 1254074
SB  - AIM
SB  - IM
MH  - Adult
MH  - Biomedical Research/*organization & administration
MH  - Female
MH  - Health Care Reform/*history/*methods
AB  - OBJECTIVES: We assessed whether a 2-phase labeling and choice 
AB-  architecture intervention
OWN - NLM
STAT- MEDLINE
DCOM- 20120417
LR  - 20170220
IS  - 1541-0048 (Electronic)

How I can write a regular expression to identify only vocabularies after all lines start with "MH" and then import them them to an excel sheet. 在所有行均以“ MH”开头之后，如何编写正则表达式以仅识别词汇表，然后将其导入到Excel工作表中。 The out put should be like this: 输出应该是这样的：

[Adult, Biomedical Research, organization & administration, Female, Health Care Reform, history, methods].

This is my try: 这是我的尝试：

import re
Path = "MH\s*.*" 
re.findall(Path,file)

I know this is wrong, but I do not know how to solve it. 我知道这是错误的，但我不知道如何解决。

Thank you 谢谢

Answer 1

Using re.findall 使用re.findall

Demo: 演示：

import re
s = """JT  - American journal of public health
JID - 1254074
SB  - AIM
SB  - IM
MH  - Adult
MH  - Biomedical Research/*organization & administration
MH  - Female
MH  - Health Care Reform/*history/*methods
AB  - OBJECTIVES: We assessed whether a 2-phase labeling and choice 
AB-  architecture intervention
OWN - NLM
STAT- MEDLINE
DCOM- 20120417
LR  - 20170220
IS  - 1541-0048 (Electronic)"""

res = []
for i in re.findall(r"MH\s+-\s+(.*)", s, flags=re.MULTILINE):
    res.extend(i.split("/*"))
print( res )

Output: 输出：

['Adult', 'Biomedical Research', 'organization & administration', 'Female', 'Health Care Reform', 'history', 'methods']

Answer 2

It looks like you'll need to do a few regexes since you also want to split on /* for some of the rows. 看起来您需要做一些正则表达式，因为您还想在/ *上拆分某些行。 This should do the trick! 这应该可以解决问题！

import re

my_file = """JT  - American journal of public health
JID - 1254074
SB  - AIM
SB  - IM
MH  - Adult
MH  - Biomedical Research/*organization & administration
MH  - Female
MH  - Health Care Reform/*history/*methods
AB  - OBJECTIVES: We assessed whether a 2-phase labeling and choice
AB-  architecture intervention
OWN - NLM
STAT- MEDLINE
DCOM- 20120417
LR  - 20170220
IS  - 1541-0048 (Electronic)"""

my_list = my_file.splitlines()

new_list = []

for item in my_list:
    if re.search("^MH\s*-", item):
        item = re.sub("[^-]+-\s*", "", item)
        item = item.split("/*")
        new_list = new_list + item

print(new_list)

Output: 输出：

['Adult', 'Biomedical Research', 'organization & administration', 'Female', 'Health Care Reform', 'history', 'methods']

I'm taking that string and putting it into a list. 我正在使用该字符串并将其放入列表中。 I figure there is a good chance you'll have that string as a list when it gets imported. 我认为很有可能在导入该字符串时将其作为列表。 I also like working with 1 line at a time with doing regexes, just easier to troubleshoot later. 我还喜欢每次使用正则表达式一次处理1行，只是以后更容易进行故障排除。

I'm matching items that start with MH then capturing them. 我要匹配以MH开头的项目，然后捕获它们。 I then split each on /* and put all those items together into a nice list you can use for your excel export. 然后，我在/*上拆分每个项目，并将所有这些项目放到一个不错的列表中，可用于excel导出。

Answer 3

Just posting code what I tried, before noticing that while coding a nicer answer was posted. 只是发布我尝试过的代码，然后才注意到在编写更好的答案之前已经发布了代码。
Please don't judge. 请不要判断。 That just happens on SO. 那只是发生在SO。

s = """
JT  - American journal of public health
JID - 1254074
MH  - Adult
MH  - Biomedical Research/*organization & administration
MH  - Health Care Reform/*history/*methods
AB  - OBJECTIVES: We assessed whether a 2-phase labeling and choice
"""

import re
import itertools
matches = re.findall(r"^MH[\s-]+(.*)$", s, re.MULTILINE)
splitmatches = [i.split(r"/*") for i in matches]
flattenedmatches = list(itertools.chain(*splitmatches))

print(flattenedmatches)

Output: 输出：

['Adult', 'Biomedical Research', 'organization & administration', 'Health Care Reform', 'history', 'methods']

使用正则表达式查找模式

问题描述

3 个解决方案

解决方案1
2 2018-06-05 14:57:54

解决方案2
2 已采纳 2018-06-05 15:01:41

解决方案3
1 2018-06-05 15:47:52

使用正则表达式查找模式

问题描述

3 个解决方案

解决方案1 2 2018-06-05 14:57:54

解决方案2 2 已采纳 2018-06-05 15:01:41

解决方案3 1 2018-06-05 15:47:52

解决方案1
2 2018-06-05 14:57:54

解决方案2
2 已采纳 2018-06-05 15:01:41

解决方案3
1 2018-06-05 15:47:52