Pandas数据框：将混合类型的字符串值转换为浮点数，同时跟踪真实的字符串值

Question

I've scrub a table from a web page. 我已经从网页上刷了一张桌子。 Some columns hold dollar amounts or percentage (as strings), but some entries have some notes, ie '(n)'. 一些列包含美元金额或百分比（作为字符串），但是某些条目包含一些注释，即'（n）'。 Before I can transform the str(numbers) to float, I need to inspect the notes to find out if they should get 0 or NA. 在将str（numbers）转换为float之前，我需要检查注释以找出它们应为0还是NA。

What I need is to output the row and note for each entry with a note, ie (4, '(4)'), (13, '(4)'); 我需要的是为每个条目输出行和带注释的注释，即（4，'（4）'），（13，'（4）'）; # or vectors ＃或向量

Using python: 3.5.4; 使用python：3.5.4; pandas: 0.22.0 熊猫：0.22.0

I've reproduced a smaller dataframe: 我复制了一个较小的数据框：

df = pd.DataFrame({'A':[ '$104.64', '$73.04', '(4)', '$82.95', '$92.45', '$95.09', 
                    '$79.20', '$63.66', '$90.27', '$98.80', '$33.82', '(8)', '$56.74', '$49.22', 
                    '$75.74'], 
               'B':['%28.90', '%73.36', '(3)', '%104.64', '%73.04', '%82.95',  
                    '%79.20', '(9)', '%63.66', '%90.27', '%98.80', '%33.82', '%56.74', '%49.22', 
                    '%75.74']})
df

        A   B
0   $104.64 %28.90
1   $73.04  %73.36
2   (4) (3)
3   $82.95  %104.64
4   $92.45  %73.04
5   $95.09  %82.95
6   $79.20  %79.20
7   $63.66  (9)
8   $90.27  %63.66
9   $98.80  %90.27
10  $33.82  %98.80
11  (8) %33.82
12  $56.74  %56.74
13  $49.22  %49.22
14  $75.74  %75.74

out = df['A'].where( df['A']>='(' )   # 1. how to get rid of the NaN?
out

out = out.astype(dtype=str)           # 2. found that NaN is of type float, 
                                           so now all entries are str
out

to get: 要得到：

2  '(4)'
11 '(8)'

I tried this, but that's not helping because the note value is changed to True: 我尝试了此操作，但这无济于事，因为注释值更改为True：

df['A'].where( df['A']>='(' ).isna() == False

0     False
1     False
2      True
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11     True
12    False
13    False
14    False

The closest answer I [found][ Skipping char-string values while comparing to a column values of mixed type to int or float in pandas Dataframe does not help: 我[找到] [最接近的答案，而在将混合类型的列值与pandas Dataframe中的int或float进行比较时跳过字符字符串值无济于事：

pd.to_numeric(df.A.str.strip('$'), errors='coerce')

performs the conversion, but transforms the '(n)' note value to nan. 执行转换，但是将'（n）'音符值转换为nan。

To sum up: The problem is the mixed types in the columns, due to the notes: I cannot just strip the '$' or '%', then convert to float. 总结：由于注释，问题在于列中的混合类型：我不能只剥离'$'或'％'，然后转换为float。 I also need to document where these were. 我还需要记录这些位置。

I am probably blind to a simple solution... 我可能对简单的解决方案视而不见...

Answer 1

Since I don't have enough reputation to comment... Can you use df.loc[] to extract ones with '('? 由于我没有足够的声誉来发表评论...您可以使用df.loc []提取带有'（'的评论吗？

withnotes = df.loc[df['A'].str.contains('\(')]
output = [(i, row.A) for i, row in withnotes.iterrows()]
output

Above example parses only column A and returns a list of tuples: output = [(2, '(4)'), (11, '(8)')] 上面的示例仅分析列A并返回一个元组列表：output = [（2，'（4）'），（11，'（8）'）]

Answer 2

You can use .loc accessor for this: 您可以为此使用.loc访问器：

res = df.loc[df['A'].str[0] == '(', 'A']

This results in a series: 结果是一系列：

2     (4)
11    (8)
Name: A, dtype: object

If you need a dataframe: 如果您需要一个数据框：

res = df.loc[df['A'].str[0] == '(', 'A'].to_frame()

Pandas数据框：将混合类型的字符串值转换为浮点数，同时跟踪真实的字符串值

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-03-31 20:50:22

解决方案2
0 2018-03-31 21:39:50

Pandas数据框：将混合类型的字符串值转换为浮点数，同时跟踪真实的字符串值

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-03-31 20:50:22

解决方案2 0 2018-03-31 21:39:50

解决方案1
1 已采纳 2018-03-31 20:50:22

解决方案2
0 2018-03-31 21:39:50