[英]find a pattern using regular expression
這是我的測試示例
JT - American journal of public health
JID - 1254074
SB - AIM
SB - IM
MH - Adult
MH - Biomedical Research/*organization & administration
MH - Female
MH - Health Care Reform/*history/*methods
AB - OBJECTIVES: We assessed whether a 2-phase labeling and choice
AB- architecture intervention
OWN - NLM
STAT- MEDLINE
DCOM- 20120417
LR - 20170220
IS - 1541-0048 (Electronic)
在所有行均以“ MH”開頭之后,如何編寫正則表達式以僅識別詞匯表,然后將其導入到Excel工作表中。 輸出應該是這樣的:
[Adult, Biomedical Research, organization & administration, Female, Health Care Reform, history, methods].
這是我的嘗試:
import re
Path = "MH\s*.*"
re.findall(Path,file)
我知道這是錯誤的,但我不知道如何解決。
謝謝
使用re.findall
演示:
import re
s = """JT - American journal of public health
JID - 1254074
SB - AIM
SB - IM
MH - Adult
MH - Biomedical Research/*organization & administration
MH - Female
MH - Health Care Reform/*history/*methods
AB - OBJECTIVES: We assessed whether a 2-phase labeling and choice
AB- architecture intervention
OWN - NLM
STAT- MEDLINE
DCOM- 20120417
LR - 20170220
IS - 1541-0048 (Electronic)"""
res = []
for i in re.findall(r"MH\s+-\s+(.*)", s, flags=re.MULTILINE):
res.extend(i.split("/*"))
print( res )
輸出:
['Adult', 'Biomedical Research', 'organization & administration', 'Female', 'Health Care Reform', 'history', 'methods']
看起來您需要做一些正則表達式,因為您還想在/ *上拆分某些行。 這應該可以解決問題!
import re
my_file = """JT - American journal of public health
JID - 1254074
SB - AIM
SB - IM
MH - Adult
MH - Biomedical Research/*organization & administration
MH - Female
MH - Health Care Reform/*history/*methods
AB - OBJECTIVES: We assessed whether a 2-phase labeling and choice
AB- architecture intervention
OWN - NLM
STAT- MEDLINE
DCOM- 20120417
LR - 20170220
IS - 1541-0048 (Electronic)"""
my_list = my_file.splitlines()
new_list = []
for item in my_list:
if re.search("^MH\s*-", item):
item = re.sub("[^-]+-\s*", "", item)
item = item.split("/*")
new_list = new_list + item
print(new_list)
輸出:
['Adult', 'Biomedical Research', 'organization & administration', 'Female', 'Health Care Reform', 'history', 'methods']
我正在使用該字符串並將其放入列表中。 我認為很有可能在導入該字符串時將其作為列表。 我還喜歡每次使用正則表達式一次處理1行,只是以后更容易進行故障排除。
我要匹配以MH
開頭的項目,然后捕獲它們。 然后,我在/*
上拆分每個項目,並將所有這些項目放到一個不錯的列表中,可用於excel導出。
只是發布我嘗試過的代碼,然后才注意到在編寫更好的答案之前已經發布了代碼。
請不要判斷。 那只是發生在SO。
s = """
JT - American journal of public health
JID - 1254074
MH - Adult
MH - Biomedical Research/*organization & administration
MH - Health Care Reform/*history/*methods
AB - OBJECTIVES: We assessed whether a 2-phase labeling and choice
"""
import re
import itertools
matches = re.findall(r"^MH[\s-]+(.*)$", s, re.MULTILINE)
splitmatches = [i.split(r"/*") for i in matches]
flattenedmatches = list(itertools.chain(*splitmatches))
print(flattenedmatches)
輸出:
['Adult', 'Biomedical Research', 'organization & administration', 'Health Care Reform', 'history', 'methods']
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.