简体   繁体   English

将主题建模结果投射到数据框

[英]Cast topic modeling outcome to dataframe

I have used BertTopic with KeyBERT to extract some topics from some docs我使用BertTopicKeyBERT从一些docs中提取一些topics

from bertopic import BERTopic
topic_model = BERTopic(nr_topics="auto", verbose=True, n_gram_range=(1, 4), calculate_probabilities=True, embedding_model='paraphrase-MiniLM-L3-v2', min_topic_size= 3)
topics, probs = topic_model.fit_transform(docs)

Now I can access the topic name现在我可以访问topic name

freq = topic_model.get_topic_info()
print("Number of topics: {}".format( len(freq)))
freq.head(30)

   Topic    Count   Name
0   -1       1     -1_default_greenbone_gmp_manager
1    0      14      0_http_tls_ssl tls_ssl
2    1      8       1_jboss_console_web_application

and inspect the topics并检查主题

[('http', 0.0855701486234524),          
 ('tls', 0.061977919455444744),
 ('ssl tls', 0.061977919455444744),
 ('ssl', 0.061977919455444744),
 ('tcp', 0.04551718585531556),
 ('number', 0.04551718585531556)]

[('jboss', 0.14014705432060262),
 ('console', 0.09285308122803233),
 ('web', 0.07323749337563096),
 ('application', 0.0622930523123512),
 ('management', 0.0622930523123512),
 ('apache', 0.05032395169459188)]

What I want is to have a final data frame that has in one column the topic name and in another column the elements of the topic我想要的是有一个最终的数据frame ,其中一columntopic name ,另一columntopic的元素

expected outcome:

  class                         entities
o http_tls_ssl tls_ssl           HTTP...etc
1 jboss_console_web_application  JBoss, console, etc

and one dataframe with the topic name on different columns和一个在不同列上具有主题名称的数据框

  http_tls_ssl tls_ssl           jboss_console_web_application
o http                           JBoss
1 tls                            console
2 etc                            etc

I did not find out how to do this.我不知道该怎么做。 Is there a way?有办法吗?

Here is one way to to it:这是一种方法:

Setup设置

import pandas as pd
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs[:1_000])

df = topic_model.get_topic_info()
print(df)
# Output
   Topic  Count                          Name
0      0    875               0_the_to_of_and
1      1     93               1_the_to_and_in
2      2     32  2_testing_deletion_hello_was

First dataframe第一个数据框

Using Pandas string methods :使用 Pandas 字符串方法

df = df.rename(columns={"Name": "class"}).drop(columns=["Topic", "Count"])
df["class"] = df["class"].str.replace("-", "").apply(lambda x: x[2:])  # remove '-1_',...
df["entities"] = df["class"].str.split("_")

print(df)
# Output
                        class                         entities
0               the_to_of_and               [the, to, of, and]
1               the_to_and_in               [the, to, and, in]
2  testing_deletion_hello_was  [testing, deletion, hello, was]

Second dataframe第二个数据框

Using Pandas transpose :使用熊猫转置

other_df = df.T.reset_index(drop=True)
new_col_labels = other_df.iloc[0]  # save first row
other_df = other_df[1:]  # remove first row
other_df.columns = new_col_labels
other_df = pd.DataFrame({col: other_df.loc[1, col] for col in other_df.columns})

print(other_df)
# Output
  the_to_of_and the_to_and_in testing_deletion_hello_was
0           the           the                    testing
1            to            to                   deletion
2            of           and                      hello
3           and            in                        was

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM