简体   繁体   English

从 python 上的文本数据中提取特征

[英]Extract features from text data on python

I have a dataframe from pandas like this.我有一个来自 pandas 的 dataframe,就像这样。

ID            email
1            abc@google.com
2            abc@facebook.com
3            abc@GOOGLE.COM
4            abc@tesla.com
5            abc@hilton.com
6            abc@FaceBook.com

I want to learn company from email(after @).Sample output like this.我想通过电子邮件(在@之后)了解公司。像这样的示例 output。

Sample output样品 output

ID            email                WorkGoogle     WorkFacebook    etc.....
1            abc@google.com          Yes             No              ..
2            abc@facebook.com        No              Yes             .. 
3            abc@GOOGLE.com          Yes             No               ..   
4            abc@tesla.com           No              No              ..
5            abc@hilton.com          No              No              ..
6            abc@FaceBook.com        No              Yes             ..

Need to care Uppercase lowercase.需要注意大写小写。

workplace = df.email.rename("workplace").apply(lambda x: x.split("@")[1].lower())
pd.concat([df,
           pd.DataFrame(workplace.apply(lambda x: {x: "Yes"}).to_list(), index=df.index)],
          axis=1).fillna("No")

#    ID             email google.com facebook.com tesla.com hilton.com
# 0   1    abc@google.com        Yes           No        No         No
# 1   2  abc@facebook.com         No          Yes        No         No
# 2   3    abc@GOOGLE.COM        Yes           No        No         No
# 3   4     abc@tesla.com         No           No       Yes         No
# 4   5    abc@hilton.com         No           No        No        Yes
# 5   6  abc@FaceBook.com         No          Yes        No         No

But maybe you can just add a column instead of multiple但也许你可以只添加一列而不是多列

df["workplace"] = df.email.rename("workplace").str.lower().str.split("@").str[1]
# Then you could do
df.groupby("workplace").agg(list)
#                   ID                                 email
# workplace                                                 
# facebook.com  [2, 6]  [abc@facebook.com, abc@FaceBook.com]
# google.com    [1, 3]      [abc@google.com, abc@GOOGLE.COM]
# hilton.com       [5]                      [abc@hilton.com]
# tesla.com        [4]                       [abc@tesla.com]

here is the dynamic way without looping, using pivot_table :这是没有循环的动态方式,使用pivot_table

df['domain'] = df['email'].str.split('@').str[1].str.split('.').str[0].str.lower()
df = df.pivot_table(index=['ID','email'], columns='domain',aggfunc=lambda x: 'Yes' if len(x)> 0 else 'No', fill_value='No')

output: output:

>>
domain              facebook google hilton tesla
ID email                                        
1  abc@google.com         No    Yes     No    No
2  abc@facebook.com      Yes     No     No    No
3  abc@GOOGLE.COM         No    Yes     No    No
4  abc@tesla.com          No     No     No   Yes
5  abc@hilton.com         No     No    Yes    No
6  abc@FaceBook.com      Yes     No     No    No

Assuming df is your dataframe, please try this:假设 df 是你的 dataframe,请试试这个:

import numpy as np
df['workplace'] = df['email'].str.split('@',1).apply(lambda x:x[1].split('.',1)[0])
for workplace in df['workplace'].unique():
  df.loc[:,'Work'+workplace] = 'No'
  df['Work'+workplace] = np.where(df['workplace']==workplace,'Yes','No')
df = df.drop(columns=['workplace'],axis=1)

if you are OK with manually making the new column for each domain, you can use this如果您可以为每个域手动创建新列,则可以使用此

df['WorkGoogle'] = df['email'].str.lower().str.contains('google')
df['WorkFacebook'] = df['email'].str.lower().str.contains('facebook')
# etc etc

That will give you True/False rather than Yes/No.这会给你 True/False 而不是 Yes/No。 If you want Yes/No, you can map如果你想要是/否,你可以map

df['WorkGoogle'] = df['email'].str.lower().str.contains('google').map({True:'Yes',False:'No'})

FYI: this solution is not performance efficient.仅供参考:此解决方案的性能效率不高。 I am sure in the comments on this answer, you may find a more efficient solution我相信在这个答案的评论中,您可能会找到更有效的解决方案

I would first make a list of all companies by saying:我首先列出所有公司的名单:

companies = set([email.split('@')[1].split('.')[0].lower() for email in df['email']])

Then simply iterate over this:然后简单地迭代这个:

for company in companies:
    df['Work'+company.capitalize()] = df['email'].apply(lambda x: x.split("@")[1].lower()).str.contains(company)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM