如何在Python中使用正則表達式從數據集中提取數據？

Question

我有一個數據集，我想從該數據集中提取正片特征。

در
همین
حال
،
<coref coref_coref_class="set_0" coref_mentiontype="ne" markable_scheme="coref" coref_coreftype="ident">
نجیب
 الله
خواجه
عمری
 ,
 </coref>
<coref coref_coref_class="set_0" coref_mentiontype="np" markable_scheme="coref" coref_coreftype="atr">
سرپرست
وزارت
تحصیلات
عالی
افغانستان
</coref>
گفت
که
در
سه
ماه
گذشته
در
۳۳
ولایت
کشور
<coref coref_coreftype="ident" coref_coref_class="empty" coref_mentiontype="ne" markable_scheme="coref">
خدمات
ملکی
</coref>
از
حدود
۱۴۹
هزار

我想將數據存儲在兩個列表中的數據集中。 在find_atr列表中，我將數據存儲在coref標簽包含coref_coreftype="atr" 。 對於find_ident列表，我想存儲coref_coreftype="ident"的數據，因此我們在此數據集中的最后一個coref標簽上具有另一個coref標簽，該標簽具有coref_coref_class="empty" 。 我不想存儲帶有標簽coref_coref_class="empty" 。 現在我在正則表達式中提到它只應包含coref_coref_class="set_.*?" “不是coref_coref_class="empty"但它仍存儲coref_coref_class="empty"的數據， coref_coref_class="empty"該數據僅應存儲coref_coref_class="set_.*?" 。

如何避免：

i_ident = []
j_atr = []
find_ident = re.findall(r'<coref.*?coref_coref_class="set_.*?coref_mentiontype="ne".*?coref_coreftype="ident".*?>(.*?)</coref>', read_dataset, re.S)
ident_list = list(map(lambda x: x.replace('\n', ' '), find_ident))
for i in range(len(ident_list)):
    i_ident.append(str(ident_list[i]))

find_atr = re.findall(r'<coref.*?coref_coreftype="atr".*?>(.*?)</coref>', read_dataset, re.S)
atr_list = list(map(lambda x: x.replace('\n', ' '), find_atr))
#print(coref_list)
for i in range(len(atr_list)):
    j_atr.append(str(atr_list[i]))

print(i_ident)
print()
print(j_atr)

Answer 1

我將您的數據集文件減少為：

A
<coref coref_coref_class="set_0" coref_mentiontype="ne" markable_scheme="coref" coref_coreftype="ident">
B
</coref>
<coref coref_coref_class="set_0" coref_mentiontype="np" markable_scheme="coref" coref_coreftype="atr">
C
</coref>
D
<coref coref_coreftype="ident" coref_coref_class="empty" coref_mentiontype="ne" markable_scheme="coref">
E
</coref>
F

並嘗試了這段代碼，幾乎與您提供的代碼相同：

import re

with open ("test_dataset.log", "r") as myfile:
    read_dataset = myfile.read()

i_ident = []
j_atr = []
find_ident = re.findall(r'<coref.*?coref_coref_class="set_.*?coref_mentiontype="ne".*?coref_coreftype="ident".*?>(.*?)</coref>', read_dataset, re.S)
ident_list = list(map(lambda x: x.replace('\n', ' '), find_ident))
for i in range(len(ident_list)):
    i_ident.append(str(ident_list[i]))

find_atr = re.findall(r'<coref.*?coref_coreftype="atr".*?>(.*?)</coref>', read_dataset, re.S)
atr_list = list(map(lambda x: x.replace('\n', ' '), find_atr))
#print(coref_list)
for i in range(len(atr_list)):
    j_atr.append(str(atr_list[i]))

print(i_ident)
print()
print(j_atr)

並得到以下輸出，對我來說似乎正確：

[' B ']

[' C ']

如何在Python中使用正則表達式從數據集中提取數據？

問題描述

1 個解決方案

解決方案1
0 已采納 2018-10-11 10:12:47

如何在Python中使用正則表達式從數據集中提取數據？

問題描述

1 個解決方案

解決方案1 0 已采納 2018-10-11 10:12:47

解決方案1
0 已采納 2018-10-11 10:12:47