簡體   English   中英

如何查找字符串是否在數據框特定列的列表中?

[英]How do I find if a string is in a list in a specific column of a dataframe?

我有 2 個要相互比較的大型數據框。 我有.split(" ")列之一,並將結果放在數據框的新列中。 我現在想檢查並查看該新列中是否存在一個值,而不是在原始列中使用.contains() ,以避免在一個單詞中提取該值。

這是我嘗試過的方法以及為什么我感到沮喪。

row['company'][i] == 'nom'

L_df['Name split'][7126853] == "['nom', '[this', 'is', 'nom]']"

row['company'][i] in L_df['Name split'][7126853] == True   (this is the index where I know the specific value occurs)

row['company'][i] in L_df['Name split'] #WHAAT? == False (my attempt to check the entire column); why is this false when I've shown it exists?

L_df[L_df['Name split'].isin([row['company'][i]])] == [empty]

編輯:我還應該補充一點,我正在嘗試建立一個過程,我可以在其中迭代檢查較小數據集中的條目與較大數據集中的條目。

result = L_df[ #The [9] is a placeholder for our iterable 'i' that will go row by row
    L_df['Company name'].str.contains(row['company'][i], na=False) #Can be difficult with names like 'Nom'
    #(row['company'][i] in L_df['Name split'])
    & L_df['Industry'].str.contains('marketing', na=False) #Unreliable currently, need to get looser matches; min. reduction
    & L_df['Locality'].str.contains(row['city'][i], na=False)  #Reliable, but not always great at reducing results
    & ((row['workers'][i] >= L_df['Emp Lower bound']) & (row['workers'][i] <= L_df['Emp Upper bound'])) #Unreliable
]

第一行是我試圖用這個新過程替換的內容,所以當“nom”出現在單詞中間時我沒有得到匹配。

這是一個解決方案,它首先將兩個數據幀合並為一個,然后使用 lambda 來處理感興趣的列。 結果放置在一個新列found

df1 = pandas.DataFrame(data={'company': ['findme', 'asdf']})
df2 = pandas.DataFrame(data={'Name split': ["here is a string including findme and then some".split(" "), "something here".split(" ")]})
combined_df = pandas.concat([df1,df2], axis=1)
combined_df['found'] = combined_df.apply(lambda row: row['company'] in row['Name split'], axis=1)

結果:

  company                                         Name split  found
0  findme  [here, is, a, string, including, findme, and, ...   True
1    asdf                                  [something, here]  False

編輯:為了將company列中的每個值與另一個數據框中Name split列中的每個單元格進行比較,並從后一個數據框中訪問整行,我將簡單地遍歷每一列,請參見此處:

df1 = pd.DataFrame(data={'company': ['findme', 'asdf']})
df2 = pd.DataFrame(data={'Name split': ["random text".split(" "), "here is a string including findme and then some".split(" "), "somethingasdfq here".split(" ")], '`another column`': [3, 1, 2]})
for index1, row1 in df1.iterrows():
    for index2, row2 in df2.iterrows():
        if row1['company'] in row2['Name split']:
            # do something here with row2
            print(row2)

可能不是很有效,但如果我們只需要一個 match ,則可以通過在找到匹配項后立即中斷內部循環來改進。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM