[英]Cast topic modeling outcome to dataframe
我使用BertTopic
和KeyBERT
從一些docs
中提取一些topics
from bertopic import BERTopic
topic_model = BERTopic(nr_topics="auto", verbose=True, n_gram_range=(1, 4), calculate_probabilities=True, embedding_model='paraphrase-MiniLM-L3-v2', min_topic_size= 3)
topics, probs = topic_model.fit_transform(docs)
現在我可以訪問topic name
freq = topic_model.get_topic_info()
print("Number of topics: {}".format( len(freq)))
freq.head(30)
Topic Count Name
0 -1 1 -1_default_greenbone_gmp_manager
1 0 14 0_http_tls_ssl tls_ssl
2 1 8 1_jboss_console_web_application
並檢查主題
[('http', 0.0855701486234524),
('tls', 0.061977919455444744),
('ssl tls', 0.061977919455444744),
('ssl', 0.061977919455444744),
('tcp', 0.04551718585531556),
('number', 0.04551718585531556)]
[('jboss', 0.14014705432060262),
('console', 0.09285308122803233),
('web', 0.07323749337563096),
('application', 0.0622930523123512),
('management', 0.0622930523123512),
('apache', 0.05032395169459188)]
我想要的是有一個最終的數據frame
,其中一column
是topic name
,另一column
是topic
的元素
expected outcome:
class entities
o http_tls_ssl tls_ssl HTTP...etc
1 jboss_console_web_application JBoss, console, etc
和一個在不同列上具有主題名稱的數據框
http_tls_ssl tls_ssl jboss_console_web_application
o http JBoss
1 tls console
2 etc etc
我不知道該怎么做。 有辦法嗎?
這是一種方法:
import pandas as pd
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs[:1_000])
df = topic_model.get_topic_info()
print(df)
# Output
Topic Count Name
0 0 875 0_the_to_of_and
1 1 93 1_the_to_and_in
2 2 32 2_testing_deletion_hello_was
使用 Pandas 字符串方法:
df = df.rename(columns={"Name": "class"}).drop(columns=["Topic", "Count"])
df["class"] = df["class"].str.replace("-", "").apply(lambda x: x[2:]) # remove '-1_',...
df["entities"] = df["class"].str.split("_")
print(df)
# Output
class entities
0 the_to_of_and [the, to, of, and]
1 the_to_and_in [the, to, and, in]
2 testing_deletion_hello_was [testing, deletion, hello, was]
使用熊貓轉置:
other_df = df.T.reset_index(drop=True)
new_col_labels = other_df.iloc[0] # save first row
other_df = other_df[1:] # remove first row
other_df.columns = new_col_labels
other_df = pd.DataFrame({col: other_df.loc[1, col] for col in other_df.columns})
print(other_df)
# Output
the_to_of_and the_to_and_in testing_deletion_hello_was
0 the the testing
1 to to deletion
2 of and hello
3 and in was
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.