This is example of a test I have
JT - American journal of public health
JID - 1254074
SB - AIM
SB - IM
MH - Adult
MH - Biomedical Research/*organization & administration
MH - Female
MH - Health Care Reform/*history/*methods
AB - OBJECTIVES: We assessed whether a 2-phase labeling and choice
AB- architecture intervention
OWN - NLM
STAT- MEDLINE
DCOM- 20120417
LR - 20170220
IS - 1541-0048 (Electronic)
How I can write a regular expression to identify only vocabularies after all lines start with "MH" and then import them them to an excel sheet. The out put should be like this:
[Adult, Biomedical Research, organization & administration, Female, Health Care Reform, history, methods].
This is my try:
import re
Path = "MH\s*.*"
re.findall(Path,file)
I know this is wrong, but I do not know how to solve it.
Thank you
Using re.findall
Demo:
import re
s = """JT - American journal of public health
JID - 1254074
SB - AIM
SB - IM
MH - Adult
MH - Biomedical Research/*organization & administration
MH - Female
MH - Health Care Reform/*history/*methods
AB - OBJECTIVES: We assessed whether a 2-phase labeling and choice
AB- architecture intervention
OWN - NLM
STAT- MEDLINE
DCOM- 20120417
LR - 20170220
IS - 1541-0048 (Electronic)"""
res = []
for i in re.findall(r"MH\s+-\s+(.*)", s, flags=re.MULTILINE):
res.extend(i.split("/*"))
print( res )
Output:
['Adult', 'Biomedical Research', 'organization & administration', 'Female', 'Health Care Reform', 'history', 'methods']
It looks like you'll need to do a few regexes since you also want to split on /* for some of the rows. This should do the trick!
import re
my_file = """JT - American journal of public health
JID - 1254074
SB - AIM
SB - IM
MH - Adult
MH - Biomedical Research/*organization & administration
MH - Female
MH - Health Care Reform/*history/*methods
AB - OBJECTIVES: We assessed whether a 2-phase labeling and choice
AB- architecture intervention
OWN - NLM
STAT- MEDLINE
DCOM- 20120417
LR - 20170220
IS - 1541-0048 (Electronic)"""
my_list = my_file.splitlines()
new_list = []
for item in my_list:
if re.search("^MH\s*-", item):
item = re.sub("[^-]+-\s*", "", item)
item = item.split("/*")
new_list = new_list + item
print(new_list)
Output:
['Adult', 'Biomedical Research', 'organization & administration', 'Female', 'Health Care Reform', 'history', 'methods']
I'm taking that string and putting it into a list. I figure there is a good chance you'll have that string as a list when it gets imported. I also like working with 1 line at a time with doing regexes, just easier to troubleshoot later.
I'm matching items that start with MH
then capturing them. I then split each on /*
and put all those items together into a nice list you can use for your excel export.
Just posting code what I tried, before noticing that while coding a nicer answer was posted.
Please don't judge. That just happens on SO.
s = """
JT - American journal of public health
JID - 1254074
MH - Adult
MH - Biomedical Research/*organization & administration
MH - Health Care Reform/*history/*methods
AB - OBJECTIVES: We assessed whether a 2-phase labeling and choice
"""
import re
import itertools
matches = re.findall(r"^MH[\s-]+(.*)$", s, re.MULTILINE)
splitmatches = [i.split(r"/*") for i in matches]
flattenedmatches = list(itertools.chain(*splitmatches))
print(flattenedmatches)
Output:
['Adult', 'Biomedical Research', 'organization & administration', 'Health Care Reform', 'history', 'methods']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.