從多個數據框創建過濾數據集

Question

我想創建基於多個數據框的過濾數據集（數據框彼此不同，因為主題不同）。 對於每個 dataframe 我需要根據一些關鍵詞過濾行。 例如，對於第一個 dataframe，我只需要包含某些單詞的行（例如Michael和Andrew ）； 對於第二個 dataframe 我只需要包含單詞Laura的行，依此類推。

原始數據幀示例

df["0"]

Names Surnames
Michael Connelly
John    Smith
Andrew   Star
Laura   Parker

df["1"]

Names Surnames
Laura  Bistro
Lisa    Roberts
Luke    Gary
Norman  Loren

為此，我寫了以下內容

for i in range(0,1): # I have more than 50 data frames, but I am considering only two for this example
    key_words = [] 

    while True:
        key_word = input("Key word : ")

        if key_word!='0':
            list_key_words.append(key_word)
            dataframe[str(i)].Filter= dataframe[str(i)]..str.contains('|'.join(key_word), case=False, regex=True) # Creates a new column where with boolean values
            dataframe[str(i)].loc[dataframe[str(i)].Filter != False]

            filtered=dataframe[str(i)][dataframe[str(i)]. Filter != False] # Create a dataframe/dataset with only filtered rows
            filtered_surnames=filtered['Names'].tolist() # this should select only the column called Names, existing in each dataframe, just for analysing them

預期 output：

df["0"]

Names Surnames  Filter
Michael Connelly 1
John    Smith    0
Andrew   Star    1
Laura   Parker   0

df["1"]

Names Surnames   Filter
Laura  Bistro     1
Lisa    Roberts   0
Luke    Gary      0
Norman  Loren     0

然后，過濾后的數據集應分別有 2 行和 1 行。

filtered["0"]

Names Surnames  Filter
Michael Connelly 1
Andrew   Star    1


filtered["1"]

Names Surnames   Filter
Laura  Bistro     1

但是，我的代碼中過濾的代碼行似乎是錯誤的。 你能看看他們，讓我知道錯誤在哪里嗎？

Answer 1

list_key_words = []
# BUG 1: range(first index included, last index excluded), to get 1 you need range(0, 2)
for i in range(0,2): # I have more than 50 data frames, but I am considering only two for this example
    key_words = [] 

    while True:
        key_word = input("Key word : ")

        if key_word!='0':
            list_key_words.append(key_word)

            # BUG 2.1: you can't apply ".str.contains" to an entire row, you need to indicate the column by name, e.g. "Names". 
            # If you want to test all the columns, you need multiple filter columns which you OR at the end
            # BUG 2.2: You can't create a column using ".Filter", it needs to be "["Filter"]"
            dataframe[str(i)]["Filter"]=dataframe[str(i)]["Names"].str.contains(key_word, case=False, regex=True) # Creates a new column where with boolean values

            #BUG 3: this line does nothing
            dataframe[str(i)].loc[dataframe[str(i)].Filter != False]


            #BUG 5: You need a way to save these or they will be overwritten each time
            filtered=dataframe[str(i)][dataframe[str(i)]. Filter != False] # Create a dataframe/dataset with only filtered rows
            filtered_surnames=filtered['Names'].tolist() # this should select only the column called Names, existing in each dataframe, just for analysing them

        #BUG 6: you need to actually leave the "while True" loop at some point
        else:
            break

有關修復的注釋在代碼中。 最大的問題是錯誤 2.1，您不能一次將正則表達式應用於行中的所有字段。 如果要檢查所有字段，可以為每個字段創建新的過濾器列，並使用df["Filter 1"] | df ["Filter 2"]...重新組合 df["Filter 1"] | df ["Filter 2"]...最后是 boolean 邏輯。

Answer 2

盡可能避免在 dataframe 中創建循環，因為 pandas 和 numpy 為許多常見案例問題提供了矢量化（更快）方法。 下面的解決方案將搜索詞與相應的數據框配對，進行搜索，並將結果整理到一個coll列表中。

#create lists of words per df u need
list1=['Michael','Andrew']
list2=['Laura']

coll = []
#pair lists with dfs
for df,name in zip([df1,df2],(list1,list2)):
    df['Extract'] = np.where(df.Names.str.contains('|'.join(name)),
                             1,0                            
                            )
    coll.append(df)

coll[0]

   Names    Surnames    Extract
0   Michael Connelly    1
1   John    Smith       0
2   Andrew  Star        1
3   Laura   Parker      0

coll[1]

   Names    Surnames    Extract
0   Laura   Bistro        1
1   Lisa    Roberts       0
2   Luke    Gary          0
3   Norman  Loren         0

從多個數據框創建過濾數據集

問題描述

2 個解決方案

解決方案1
0 已采納 2020-05-10 22:13:51

解決方案2
0 2020-05-11 00:16:27

從多個數據框創建過濾數據集

問題描述

2 個解決方案

解決方案1 0 已采納 2020-05-10 22:13:51

解決方案2 0 2020-05-11 00:16:27

解決方案1
0 已采納 2020-05-10 22:13:51

解決方案2
0 2020-05-11 00:16:27