简体   繁体   English

在 DataFrame 中为 NaN 添加值时出现问题

[英]Trouble when adding values for NaN in DataFrame

I have this DataFrame:我有这个数据帧:

    manufacturer    description
0   toyota          toyota, gmc 10 years old.
1   NaN             gmc, Motor runs and drives good.
2   NaN             Motor old, in pieces.
3   NaN             2 owner 0 rust. Cadillac.

And I want to fill the NaN values with keyword taken from the description.我想用从描述中获取的关键字填充 NaN 值。 To that end I created a list with the keywords I want:为此,我创建了一个包含我想要的关键字的列表:

keyword = ['gmc', 'toyota', 'cadillac']

Finally, I want to loop over each row in the DataFrame.最后,我想遍历 DataFrame 中的每一行。 Split the contents from the "description" column in each row and, if that word is also in the "keyword" list, add it in the "manufacturer" column.将“描述”列中的内容拆分为每一行,如果该词也在“关键字”列表中,则将其添加到“制造商”列中。 As an example, it would look like this:例如,它看起来像这样:

    manufacturer    description
0   toyota          toyota, gmc 10 years old.
1   gmc             gmc, Motor runs and drives good.
2   NaN             Motor old, in pieces.
3   cadillac        2 owner 0 rust. Cadillac.

Thanks to someone in this community I could improve my code to this:感谢这个社区的某个人,我可以改进我的代码:

import re
keyword = ['gmc', 'toyota', 'cadillac']
bag_of_words = []
for i, description in enumerate(test3['description']):
bag_of_words = re.findall(r"""[A-Za-z\-]+""", test3["description"][i])
for word in bag_of_words: 
    if word.lower() in keyword:
            test3.loc[i, 'manufacturer'] = word.lower()

But I realized that the first row also changed values even though it was not NaN:但我意识到第一行也改变了值,即使它不是 NaN:

  manufacturer  description
0   gmc         toyota, gmc 10 years old.
1   gmc         gmc, Motor runs and drives good.
2   NaN         Motor old, in pieces.
3   cadillac    2 owner 0 rust. Cadillac.

I would like to only change the NaN values but when I try to add:我只想更改 NaN 值,但是当我尝试添加时:

if word.lower() in keyword and test3.loc[i, 'manufacturer'] == np.nan:

It doesn't have any effect.它没有任何效果。

np.nan == np.nan is False. np.nan == np.nan是假的。 A bit counter-intuitive perhaps =) But it should mean that the last conditional should never kick in. Not really clear from your question whether you see the same result or no result.也许有点违反直觉 =) 但它应该意味着最后一个条件不应该开始。从你的问题中不清楚你是否看到相同的结果或没有结果。

If you changed如果你改变了

for i, description in enumerate(test3['description']):

to

for i, description in zip(test3.loc[test3['manufacturer'].isna(), :].index, test3.loc[test3['manufacturer'].isna(), 'description']):

then I think it should work fine.那么我认为它应该可以正常工作。 You would only get the rows in which 'manufacturer' is NaN.您只会得到“制造商”为 NaN 的行。 You could also delete the == np.nan part since non-empty strings evaluate to True and np.nan evaluates to False but that would make your code harder to understand.您还可以删除== np.nan部分,因为非空字符串的计算结果为 True 并且 np.nan 计算结果为 False 但这会使您的代码更难理解。

There a lot of ways in which your code could look nicer ;) but focus on learning to debug and the rest will come.有很多方法可以让您的代码看起来更好 ;) 但专注于学习调试,其余的就会到来。 And as long as it does what you want it to do who cares.只要它做你想做的事,谁在乎。

One way you could have debugged this would have been to print the truth value of each part of your conditional inside the loop.您可以调试的一种方法是在循环内打印条件的每个部分的真值。

print(bool(word.lower() in keyword))
print(bool(test3.loc[i, 'manufacturer'] == np.nan)

Best wishes!最好的祝愿!

Edit: okay, I should probably add how I would do this myself.编辑:好的,我可能应该添加我自己如何做到这一点。

df = pd.DataFrame({'manufacturer': ['toyota', np.nan, np.nan, np.nan],
                   'description': ['toyota, gmc 10 years old.', 'gmc, Motor runs and drives good.', 'Motor old, in pieces.', '2 owner 0 rust. Cadillac.']})
keyword = ['gmc', 'toyota', 'cadillac']
filler = df['description'].map(lambda s: [word for word in keyword if word in s.lower()][0] 
                                         if bool([word for word in keyword if word in s.lower()]) 
                                         else np.nan)
df['manufacturer'] = df['manufacturer'].fillna(filler)

Not sure if you want the last or first item in keywords when both appear in the string tho.当两者都出现在字符串中时,不确定您想要关键字中的最后一项还是第一项。 I set it to the first item here using index 0.我使用索引 0 将其设置为此处的第一项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM