[英]Python Pandas - Add a new column with value based on first and last name in multiple columns
Although I'm still a beginner myself, I'm trying to explain some Pandas fundamentals to colleagues who usually manipulate CSV files with Excel. 尽管我自己还是一个初学者,但我正在尝试向通常使用Excel处理CSV文件的同事解释一些Pandas基础知识。
I hit a wall with my ability to find a "good" answer for solving a given problem I'd like to use as an example. 我有能力找到一个“好的”答案来解决给定的问题,我想以此为例。
I have a CSV file like this: 我有这样的CSV文件:
"Id","First","Last"
"109","Karl","Evans"
"113","Louise","Hudson"
"106","Catherine","Johnson"
and I'm importing it into Python like this: 然后将其导入到Python中,如下所示:
import pandas
df = pandas.read_csv('C:\\example.csv')
I want to add a new column to df
called "StartsWithJOrK". 我想在df
添加一个名为“ StartsWithJOrK”的新列。
It should say "Yay!" 它应该说“是!” for anyone whose lowercased-first-name OR whose lowercased-last-name starts with a "j" or a "k". 对于小写的姓氏或小写的姓氏以“ j”或“ k”开头的任何人。 It should say "BooHiss" for anyone for whom neither lowercased-name starts with a "j" or a "k". 对于小写名称都不以“ j”或“ k”开头的任何人,应说“ BooHiss”。
(It's a rather overwrought example, but I feel like it packs in a lot of things I either don't know how to do or don't know how combine "pythonically.") (这是一个过度紧张的示例,但是我觉得它包含了很多我不知道如何做或不知道如何“ Python地”组合的东西。)
What's the most pythonic, fewest-lines-of-code way to do this? 什么是最pythonic,最少代码行的方法?
Not the easiest introduction to Pandas... 不是最简单的熊猫入门...
df['StartsWithJorK'] = 'BooHiss'
starting_letters = ['j', 'k']
df.loc[(df.First.str[0].str.lower().isin(starting_letters)) |
df.Last.str[0].str.lower().isin(starting_letters), 'StartsWithJorK'] = 'Yay!'
>>> df
Id First Last StartsWithJorK
0 109 Karl Evans Yay!
1 113 Louise Hudson BooHiss
2 106 Catherine Johnson Yay!
df.First.str[0]
finds the first character of the name. df.First.str[0]
查找名称的第一个字符。
.str.lower()
converts this series of letters to lower case. .str.lower()
将这一系列字母转换为小写。
.isin(starting_letters)
checks if each lower case letter is in our list of starting letters, ie 'j' and 'k'. .isin(starting_letters)
检查每个小写字母是否在我们的起始字母列表中,即“ j”和“ k”。
.loc
is for label and boolean based indexing where the column StartsWithJorK
is set to Yay!
.loc
用于基于标签和布尔的索引 ,其中StartsWithJorK
列设置为Yay!
for each matching condition. 对于每个匹配条件。
If you don't mind importing numpy
too, you can do 如果您也不想导入numpy
,则可以执行
import numpy as np
import pandas as pd
mask = df['Last'].str.match('[JjKk]') | df['First'].str.match('[JjKk]')
df['StartsWithJOrK'] = np.where(mask, 'Yay!', 'BooHiss')
Output: 输出:
Id First Last StartsWithJOrK
0 109 Karl Evans Yay!
1 113 Louise Hudson BooHiss
2 106 Catherine Johnson Yay!
There are other ways of creating the above mask
. 还有其他创建上述mask
。 Here is one: 这是一个:
mask = (df[['First', 'Last']]
.apply(lambda x: x.str.match('[JjKk]'), axis=1)
.any(axis=1))
Or, taking a cue from @Alexander's answer's use of .str.lower()
: 或者,从.str.lower()
的答案对.str.lower()
的使用中.str.lower()
提示:
mask = (df[['First', 'Last']]
.apply(lambda x: x.str.lower().str.match('[jk]'), axis=1)
.any(axis=1))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.