使用行值在pandas中创建新列

Question

First of all, this is not a duplicate! 首先，这不是重复的！ I have searched in several SO questions as well as the Pandas doc, and I have not found anything conclusive!To create a new column with a row value, like this and this ! 我已经搜索了几个SO问题以及Pandas文档，但没有发现任何结论。要创建一个具有行值的新列，例如this和this ！

Imagine I have the following table, opening an .xls and I create a dataframe with it. 想象一下，我有下表， 打开一个.xls然后用它创建一个数据框。 As this is a small example created from the real proble, I created this simple Excel table which can be easily reproduceable: 因为这是从实际问题中创建的一个小示例，所以我创建了这个简单的Excel表，该表可以轻松复制：

What I want now is to find the row that has "Population Month Year" (I will be looking at different .xls , so the structure is the same: population, month and year. 我现在想要的是找到具有"Population Month Year" （我将查看不同的.xls ，因此结构是相同的：人口，月份和年份。

xls='population_example.xls'
sheet_name='Sheet1'
df = pd.read_excel(xls, sheet_name=sheet_name, header=0, skiprows=2)
df

What I thought is: 我以为是：

Get the value of that row with startswith 使用startswith获取该行的值
Create a column, pythoning that value and getting the month and year value. 创建一列，使用该值进行Python处理并获取月份和年份的值。

I have tried several things similar to this: 我已经尝试过类似的几件事：

dff=df[s.str.startswith('Population')]
dff

But errors won't stop coming. 但是错误不会停止。 In this above's code error, specifically: 在上面的代码错误中，具体是：

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match IndexingError：作为索引器提供的不可对齐的布尔系列（布尔系列的索引与索引对象的索引不匹配

I have several guesses: 我有几个猜测：

I am not understanding properly how Series in pandas work, even though reading the doc. 即使阅读文档，我也无法正确理解熊猫Series的工作原理。 I did not even think on using them, but the startswith looks like the thing I am looking for. 我什至没有想到要使用它们，但是startswith看起来就像我想要的东西。
If I handle this properly, I might have a NaN error , but I cannot use df.dropna() yet, as I would lose that row value ( Population April 2017 )! 如果我处理正确，可能会出现NaN error ，但是我仍然不能使用df.dropna() ，因为我会丢失该行值（《 Population April 2017 ）！

Edit: 编辑：

The problem on using this: 使用此问题：

df[df['Area'].str.startswith('Population')] Is that it will check the na values . df[df['Area'].str.startswith('Population')]是它将检查na values 。

And this: 和这个：

df['Area'].str.startswith('Population')

Will give me a true/false/na set of values, which I am not sure how I can use. 会给我一个true / false / na的值集，我不确定该如何使用。

Answer 1

Thanks to @Erfan , I got to the solution: 感谢@Erfan，我找到了解决方案：

Using properly the line of code in the comments and not like I was trying, I managed to: 正确使用注释中的代码行，而不是像我尝试的那样，我设法：

dff=df[df['Area'].str.startswith('Population', na=False)] dff

Which would output: Population and household forecasts, 2016 to 20... NaN NaN NaN NaN NaN NaN 将会输出： Population and household forecasts, 2016 to 20... NaN NaN NaN NaN NaN NaN

Now I can access this value like 现在我可以像这样访问该值

value=dff.iloc[0][0] value

To get the string I was looking for: 'Population and household forecasts, 2016 to 2041, prepared by .id , the population experts, April 2019.' 为了得到我一直在寻找的字符串， 'Population and household forecasts, 2016 to 2041, prepared by .id , the population experts, April 2019.' And I can python around with this to create the desired column. 我可以用python来创建所需的列。 Thank you! 谢谢！

Answer 2

You could try: 您可以尝试：

import pandas as pd
import numpy as np

pd.DataFrame({'Area': [f'Whatever{i+1}' for i in range(3)] + [np.nan, 'Population April 2017.'],
              'Population': [3867, 1675, 1904, np.nan, np.nan]}).to_excel('population_example.xls', index=False)

df = pd.read_excel('population_example.xls').fillna('')

population_date = df[df.Area.str.startswith('Population')].Area.values[0].lstrip('Population ').rstrip('.').split()

Result: 结果：

['April', '2017']

Or (if Population Month Year is always on the last row): 或（如果“人口月份”始终在最后一行）：

df.iloc[-1, 0].lstrip('Population ').rstrip('.').split()

使用行值在pandas中创建新列

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-07-01 16:08:04

解决方案2
1 2019-07-01 17:14:46

使用行值在pandas中创建新列

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-07-01 16:08:04

解决方案2 1 2019-07-01 17:14:46

解决方案1
1 已采纳 2019-07-01 16:08:04

解决方案2
1 2019-07-01 17:14:46