簡體   English   中英

如何匹配兩個不同的字符串 dataframe python

[英]How to match string of two different dataframe python

如何匹配不同數據框上的正文? 我已經使用 python 進行編碼,但由於某種原因,Match 列中的結果均為 False。 即使數據框 1 和數據框 2 之間存在匹配的文本內容。

這是我的代碼:

# List of search keywords
search_term = ["Gempa AND #gempa cianjur AND #gempa maluku", 
               "Sambo AND #ferdy sambo AND #brigadir j",
               "Lukas Enembe AND #lukas enembe tersangka AND #gubernur papua",
               "Puan Maharani AND #pdip AND #pilpres2024",
               "Putri Candrawathi AND #LPSK AND #brigadir yosua",
               "Resesi AND #resesi AND #APBD DKI",
               "IKN AND #ikn AND #ibu kota nusantara",
               "Piala AFF 2022 AND #piala aff 2022 AND #pssi",
               "Pemilu 2024 AND #partai politik pemilu 2024",
               "BMKG AND #BMKG",
               "Kripto AND #kripto AND #investasi",
               "Ekonomi AND #ekonomi indonesia AND #jokowi",
               "Elon Musk AND #elon musk",
               "Jokowi AND #Jokowi",
               "Puan AND #puan",
               "Ganjar Pranowo AND #Ganjar Pranowo AND #Pilpres 2024"]

# Calling DataFrame constructor on list
# with indices and columns specified
searc_term_df = pd.DataFrame(search_term,columns =['Search Term'])
searc_term_df['Search Term'] = searc_term_df['Search Term'].str.replace('AND','')
searc_term_df['Search Term'] = searc_term_df['Search Term'].str.replace('#','')

# Tokenize a sentence into a piece of words
def tokenize_data(tweet):
   return word_tokenize(tweet)
searc_term_df['Search Term'] = searc_term_df['Search Term'].apply(tokenize_data)

# Remove brackets from string
searc_term_df['Search Term'] = searc_term_df.astype(str).apply(lambda col:col.str.strip('[]'))
# Remove single quotes from string
searc_term_df['Search Term'] = searc_term_df['Search Term'].str.replace('\'', '')
searc_term_df

output是這樣的:

在此處輸入圖像描述

我想將它與數據框 2 相匹配,這將導致下面的數據框 2: 在此處輸入圖像描述

這是匹配它的代碼,但我得到的結果都是假的:

df_all['Match'] = df_all['Text'].isin(searc_term_df['Search Term'])

這里是錯誤的output:

在此處輸入圖像描述

我必須解決這個問題的想法使用了來自 scikit-learn 的帶有CountVectorizer的詞袋想法。

最初我創建了一個數據集來模擬您的df_all['Text'] ,添加了您定義的searc_term_df中存在的一些詞。

test_text = [
    "Lorem ipsum gempa sit amet.",
    "Lorem ipsum dolor sit amet.",
    "Lorem ipsum Puan sit amet.",
    "Lorem ipsum Jokowi sit amet.",
    "Lorem ipsum dolor sit amet.",
    "Lorem ipsum dolor sit amet.",
    "Lorem ipsum Elon Musk amet.",
    "Lorem ipsum dolor sit amet.",
    "Lorem ipsum dolor sit amet.",
    "Lorem ipsum Pranowo sit amet.",
    "Lorem ipsum dolor, 2024 amet.",
]

df_all = pd.DataFrame(data=test_text, columns=["text"])

然后我實例化了詞袋 ( CountVectorizer() ) model。

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# here I removed the commas from the search terms because they were also being
# interpreted as a match, but you can come up with another fancy pre-processing strategy
search_terms = searc_term_df["Search Term"].replace({",": ""}, regex=True)

# fits the model to your text data
vectorizer = CountVectorizer().fit(df_all["text"])

# creates a bag of words (BoW)
bow_text = vectorizer.transform(df_all["text"]).toarray()

# creates a bag of words of the search terms too
bow_search_terms = vectorizer.transform(search_terms).toarray()

# creates a sparse matrix that corresponds to the number of matches
# for each search term. A sum equal to 0 means there is no match.
is_a_match = (np.dot(bow_text, bow_search_terms.transpose()).sum(axis=1) != 0)
df_all["match"] = is_a_match

df_all

這是 output

    text                            match
0   Lorem ipsum gempa sit amet.      True
1   Lorem ipsum dolor sit amet.     False
2   Lorem ipsum Puan sit amet.       True
3   Lorem ipsum Jokowi sit amet.     True
4   Lorem ipsum dolor sit amet.     False
5   Lorem ipsum dolor sit amet.     False
6   Lorem ipsum Elon Musk amet.      True
7   Lorem ipsum dolor sit amet.     False
8   Lorem ipsum dolor sit amet.     False
9   Lorem ipsum Pranowo sit amet.    True
10  Lorem ipsum dolor, 2024 amet.    True

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM