简体   繁体   English

取具有相同值 x 或不同值 x/y 的 2 列来创建具有值 x 的第 3 列

[英]Taking 2 columns with equal value x or differing values x/y to create a 3rd column with value x

I have df我有df

technologies = {'Writer1': ['Spark', 'PySpark', 'Hadoop', 'Python'],
 'Location1': ['LA', 'RIV', 'CHV', 'LA'],
 'Area1': ['L', 'R', 'C', 'L'],
 'Writer2': ['Spark', 'Dil', 'Chee', 'Python'],
 'Location2': ['RIV', 'RIV', 'LA', 'RIV'],
 'Area2': ['R', 'R', 'L', 'R'],}
df = pd.DataFrame(technologies)

I want to create a new Location column indexing location 1/2 but only taking locations in Area1/2 "R" and "C"我想创建一个新的 Location 列索引位置 1/2 但仅在 Area1/2“R”和“C”中获取位置

So I would want所以我想要

technologies = {'Writer1': ['Spark', 'PySpark', 'Hadoop', 'Python'],
 'Location1': ['LA', 'RIV', 'CHV', 'LA'],
 'Area1': ['L', 'R', 'C', 'L'],
 'Writer2': ['Spark', 'Dil', 'Chee', 'Python'],
 'Location2': ['RIV', 'RIV', 'LA', 'RIV'],
 'Area2': ['R', 'R', 'L', 'R'],
 'Location3': ['RIV', 'RIV', 'CHV', 'RIV']}

Is this possible?这可能吗? I am stuck and can't think of what would work for so many requirements.我被困住了,想不出什么能满足这么多要求。

Any help appreciated Thank you ===EDIT Sorry I did not include vital detail.任何帮助表示感谢谢谢===编辑对不起,我没有包括重要的细节。 I would like the location to index with the Writer1/2.我希望该位置与 Writer1/2 建立索引。 For example if I index PySpark with RIV, I also want Dil to index with RIV.例如,如果我用 RIV 索引 PySpark,我也希望 Dil 用 RIV 索引。 The code should not bypass a Writer if they both are in RIV or CHV.如果 Writer 都在 RIV 或 CHV 中,则代码不应绕过它们。

Replace non R,C values in Location1/2 columns to missing values by Series.where and then replace missing values from s1 by s2 in Series.fillna :Location1/2列中的非R,C值替换为 Series.where 中的缺失值,然后用Series.where中的s2 Series.fillna s1中的缺失值:

df = pd.DataFrame(technologies)

s1 = df['Location1'].where(df['Area1'].isin(['R','C']))  
s2 = df['Location2'].where(df['Area2'].isin(['R','C']))
df['Location3'] = s1.fillna(s2)
print (df)
   Writer1 Location1 Area1 Writer2 Location2 Area2 Location3
0    Spark        LA     L   Spark       RIV     R       RIV
1  PySpark       RIV     R     Dil        LA     L       RIV
2   Hadoop       CHV     C    Chee        LA     L       CHV
3   Python        LA     L  Python       RIV     R       RIV

Solution for multiple values - if match both values are joined:多个值的解决方案 - 如果匹配,则连接两个值:

technologies = {'Writer1': ['Spark', 'PySpark', 'Hadoop', 'Python'],
 'Location1': ['LA', 'RIV', 'CHV', 'RIV'],
 'Area1': ['L', 'R', 'C', 'L'],
 'Writer2': ['Spark', 'Dil', 'Chee', 'Python'],
 'Location2': ['RIV', 'RIV', 'RIV', 'RIV'],
 'Area2': ['R', 'R', 'L', 'R'],}
df = pd.DataFrame(technologies)



s1 = df['Location1'].add(', ').where(df['Area1'].isin(['R','C']), '')  
s2 = df['Location2'].where(df['Area2'].isin(['R','C']), '')

df['Location3'] = s1.add(s2).str.strip(', ')
print (df)
   Writer1 Location1 Area1 Writer2 Location2 Area2 Location3
0    Spark        LA     L   Spark       RIV     R       RIV
1  PySpark       RIV     R     Dil       RIV     R  RIV, RIV
2   Hadoop       CHV     C    Chee       RIV     L       CHV
3   Python       RIV     L  Python       RIV     R       RIV

For a generic method to use with any number of Location/Area pairs (in order), you can use:对于与任意数量的位置/区域对(按顺序)一起使用的通用方法,您可以使用:

lst = ['R', 'C']
df = pd.DataFrame(technologies)
df2 = (df.filter(like='Location')
      .where(df.filter(like='Area').isin(lst).values)
      )
df['Location3'] = df2.stack().groupby(level=0).first()
# first() for first value as preference if many

or, using a MultiIndex:或者,使用 MultiIndex:

idx = pd.MultiIndex.from_frame(df.columns.str.extract('(\w+)(\d+)'))
df2 = df.set_axis(idx, axis=1)
df['LocationX'] = (df2['Location'].where(df2['Area'].isin(lst)).stack()
                   .groupby(level=0).first()
                   )

output: output:

   Writer1 Location1 Area1 Writer2 Location2 Area2 Location3
0    Spark        LA     L   Spark       RIV     R       RIV
1  PySpark       RIV     R     Dil        LA     L       RIV
2   Hadoop       CHV     C    Chee        LA     L       CHV
3   Python        LA     L  Python       RIV     R       RIV

duplicates重复

if there are multiple possibilities and you want to keep all如果有多种可能性并且您想保留所有

as concatenated values:作为连接值:

lst = ['R', 'C']
df = pd.DataFrame(technologies)
df2 = (df.filter(like='Location')
      .where(df.filter(like='Area').isin(lst).values)
      )
df['LocationX'] = df2.stack().groupby(level=0).agg(','.join)

output: output:

   Writer1 Location1 Area1 Writer2 Location2 Area2 LocationX
0    Spark        LA     L   Spark       RIV     R       RIV
1  PySpark       RIV     R     Dil       RIV     R   RIV,RIV
2   Hadoop       CHV     C    Chee        LA     L       CHV
3   Python        LA     L  Python       RIV     R       RIV

or, as multiple rows:或者,作为多行:

lst = ['R', 'C']
df = pd.DataFrame(technologies)
df2 = (df.filter(like='Location')
      .where(df.filter(like='Area').isin(lst).values)
      )
df = df.join(df2.stack().rename('LocationX').droplevel(1))

output: output:

   Writer1 Location1 Area1 Writer2 Location2 Area2 LocationX
0    Spark        LA     L   Spark       RIV     R       RIV
1  PySpark       RIV     R     Dil       RIV     R       RIV
1  PySpark       RIV     R     Dil       RIV     R       RIV
2   Hadoop       CHV     C    Chee        LA     L       CHV
3   Python        LA     L  Python       RIV     R       RIV

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 是否有一种SQL语法会根据同一表中第3列的相等值搜索2列来创建新列? - Is there a SQL syntax that will create new column by searching in 2 columns based on equal value of 3rd column in same table? 迭代列 (a) 中的唯一值并为列 a 中的每个唯一值创建图 (x,y) - Iterate over unique values in a column(a) and create plots (x,y) for each unqiue value in column a Pandas:如果来自第三列的字符串值,则根据另一列的值创建列 - Pandas : Create columns based on values of another column if string value from 3rd column 如何使用范围(x,y)中的每一列值创建一个 NxM 矩阵? - How to create an NxM matrix with each column value in range(x,y)? 检查表X的Z列中的值Y - check if value Y in column Z in table X 当列y等于z时,pandas获取列x的最后一个值 - pandas get last value of column x when column y is equal to z 绘制每个 x 值的 y 值的平均值 - Plot average of y values for every x value Tensorflow将多个X值转换为一个Y值 - Tensorflow multiple X values to one Y value 在 numpy 数组 A = [[x0, y0, z0], [x1, y1, z1]] 中映射 z's 数组 B = [[x1, y1, ?], [x0, y0, ?]] 的第 3 列匹配(x,y)? - Mapping z's in numpy array A = [[x0, y0, z0], [x1, y1, z1]] for 3rd column of array B = [[x1, y1, ?], [x0, y0, ?]] based off matching (x,y)? 对于每个不同的 Y 值,我的线性回归数据是否应该具有大致相等数量的 X 值? - should my data for a linear regression have a roughly equal number of X values for each different Y value?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM