简体   繁体   English

想要 get_dummies 获取列中最频繁的请求值 - Pandas

[英]Wanting to get_dummies for the most frequest values in a column - Pandas

I am working on a program to go through tweets and predict whether the author falls into one of two categories.我正在通过推文开发一个 go 的程序,并预测作者是否属于两个类别之一。 I want to get_dummies for whether or not a tweet contains any of the top 10 hashtags or if it contains 'other'.我想要 get_dummies 来确定一条推文是否包含前 10 个主题标签中的任何一个,或者它是否包含“其他”。 (In the end I will probably be using the top 500 or so hashtags not just 10, the data set is over 500,000 columns in total with over 50,000 unique hashtags) (最后我可能会使用前 500 个左右的主题标签,而不仅仅是 10 个,数据集总共超过 500,000 列,具有超过 50,000 个独特的主题标签)

This is my first time using pandas, so apologies if my question is unclear, but I think what I'm expecting is each row in the data set would be given a new column, one for each hashtag, and then the value of that [row][column] pair would be 1 if the row contains that hashtag or 0 if it does not.这是我第一次使用 pandas,如果我的问题不清楚,我深表歉意,但我认为我期望的是数据集中的每一行都会被赋予一个新列,每个标签对应一个,然后是 [如果该行包含该主题标签,则 row][column] 对将为 1,否则为 0。 There would also be a column for other indicating it has other hashtags not in the top 10.还会有一个其他列,表示它有其他不在前 10 名中的主题标签。

I know how to determine the most frequently occurring in the column already我知道如何确定列中出现频率最高的已经

counts = df.hashtags.value_counts()
counts.nlargest(10)

I also understand how to get dummies, I just don't know how to add the parameter of not making one for every hashtag.我也知道如何获得假人,我只是不知道如何添加不为每个主题标签制作一个的参数。

dummies = pd.get_dummies(df, columns=['hashtags'])

Please let me know if I could be clearer or provide more info.如果我可以更清楚或提供更多信息,请告诉我。 Appreciate the help!感谢帮助!

Don't have time to gen data and work it all out.没有时间生成数据并全力以赴。 But though I'd get you this idea in case it might help you out.但是,尽管我会为您提供这个想法,以防它对您有所帮助。

The idea is to leverage .isin() to get the values that you need to build the dummies.这个想法是利用.isin()来获取构建假人所需的值。 Then leverage the power of the index to match to the source rows.然后利用索引的力量来匹配源行。

Something like:就像是:

pd.get_dummies(df.loc[df['hashtags'].isin(counts.nlargest(10).index)], columns=['hashtags']) 

You will have to see if the indices will give you what you need.您将不得不查看指数是否能满足您的需求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM