[英]Creating filtered datasets from multiple data frames
我想創建基於多個數據框的過濾數據集(數據框彼此不同,因為主題不同)。 對於每個 dataframe 我需要根據一些關鍵詞過濾行。 例如,對於第一個 dataframe,我只需要包含某些單詞的行(例如Michael
和Andrew
); 對於第二個 dataframe 我只需要包含單詞Laura
的行,依此類推。
原始數據幀示例
df["0"]
Names Surnames
Michael Connelly
John Smith
Andrew Star
Laura Parker
df["1"]
Names Surnames
Laura Bistro
Lisa Roberts
Luke Gary
Norman Loren
為此,我寫了以下內容
for i in range(0,1): # I have more than 50 data frames, but I am considering only two for this example
key_words = []
while True:
key_word = input("Key word : ")
if key_word!='0':
list_key_words.append(key_word)
dataframe[str(i)].Filter= dataframe[str(i)]..str.contains('|'.join(key_word), case=False, regex=True) # Creates a new column where with boolean values
dataframe[str(i)].loc[dataframe[str(i)].Filter != False]
filtered=dataframe[str(i)][dataframe[str(i)]. Filter != False] # Create a dataframe/dataset with only filtered rows
filtered_surnames=filtered['Names'].tolist() # this should select only the column called Names, existing in each dataframe, just for analysing them
預期 output:
df["0"]
Names Surnames Filter
Michael Connelly 1
John Smith 0
Andrew Star 1
Laura Parker 0
df["1"]
Names Surnames Filter
Laura Bistro 1
Lisa Roberts 0
Luke Gary 0
Norman Loren 0
然后,過濾后的數據集應分別有 2 行和 1 行。
filtered["0"]
Names Surnames Filter
Michael Connelly 1
Andrew Star 1
filtered["1"]
Names Surnames Filter
Laura Bistro 1
但是,我的代碼中過濾的代碼行似乎是錯誤的。 你能看看他們,讓我知道錯誤在哪里嗎?
list_key_words = []
# BUG 1: range(first index included, last index excluded), to get 1 you need range(0, 2)
for i in range(0,2): # I have more than 50 data frames, but I am considering only two for this example
key_words = []
while True:
key_word = input("Key word : ")
if key_word!='0':
list_key_words.append(key_word)
# BUG 2.1: you can't apply ".str.contains" to an entire row, you need to indicate the column by name, e.g. "Names".
# If you want to test all the columns, you need multiple filter columns which you OR at the end
# BUG 2.2: You can't create a column using ".Filter", it needs to be "["Filter"]"
dataframe[str(i)]["Filter"]=dataframe[str(i)]["Names"].str.contains(key_word, case=False, regex=True) # Creates a new column where with boolean values
#BUG 3: this line does nothing
dataframe[str(i)].loc[dataframe[str(i)].Filter != False]
#BUG 5: You need a way to save these or they will be overwritten each time
filtered=dataframe[str(i)][dataframe[str(i)]. Filter != False] # Create a dataframe/dataset with only filtered rows
filtered_surnames=filtered['Names'].tolist() # this should select only the column called Names, existing in each dataframe, just for analysing them
#BUG 6: you need to actually leave the "while True" loop at some point
else:
break
有關修復的注釋在代碼中。 最大的問題是錯誤 2.1,您不能一次將正則表達式應用於行中的所有字段。 如果要檢查所有字段,可以為每個字段創建新的過濾器列,並使用df["Filter 1"] | df ["Filter 2"]...
重新組合 df["Filter 1"] | df ["Filter 2"]...
最后是 boolean 邏輯。
盡可能避免在 dataframe 中創建循環,因為 pandas 和 numpy 為許多常見案例問題提供了矢量化(更快)方法。 下面的解決方案將搜索詞與相應的數據框配對,進行搜索,並將結果整理到一個coll
列表中。
#create lists of words per df u need
list1=['Michael','Andrew']
list2=['Laura']
coll = []
#pair lists with dfs
for df,name in zip([df1,df2],(list1,list2)):
df['Extract'] = np.where(df.Names.str.contains('|'.join(name)),
1,0
)
coll.append(df)
coll[0]
Names Surnames Extract
0 Michael Connelly 1
1 John Smith 0
2 Andrew Star 1
3 Laura Parker 0
coll[1]
Names Surnames Extract
0 Laura Bistro 1
1 Lisa Roberts 0
2 Luke Gary 0
3 Norman Loren 0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.