简体   繁体   English

Pandas数据框:将混合类型的字符串值转换为浮点数,同时跟踪真实的字符串值

[英]Pandas dataframe: converting mixed-typed string values to float while keeping track of the true string values

I've scrub a table from a web page. 我已经从网页上刷了一张桌子。 Some columns hold dollar amounts or percentage (as strings), but some entries have some notes, ie '(n)'. 一些列包含美元金额或百分比(作为字符串),但是某些条目包含一些注释,即'(n)'。 Before I can transform the str(numbers) to float, I need to inspect the notes to find out if they should get 0 or NA. 在将str(numbers)转换为float之前,我需要检查注释以找出它们应为0还是NA。

What I need is to output the row and note for each entry with a note, ie (4, '(4)'), (13, '(4)'); 我需要的是为每个条目输出行和带注释的注释,即(4,'(4)'),(13,'(4)'); # or vectors #或向量

Using python: 3.5.4; 使用python:3.5.4; pandas: 0.22.0 熊猫:0.22.0

I've reproduced a smaller dataframe: 我复制了一个较小的数据框:

df = pd.DataFrame({'A':[ '$104.64', '$73.04', '(4)', '$82.95', '$92.45', '$95.09', 
                    '$79.20', '$63.66', '$90.27', '$98.80', '$33.82', '(8)', '$56.74', '$49.22', 
                    '$75.74'], 
               'B':['%28.90', '%73.36', '(3)', '%104.64', '%73.04', '%82.95',  
                    '%79.20', '(9)', '%63.66', '%90.27', '%98.80', '%33.82', '%56.74', '%49.22', 
                    '%75.74']})
df

        A   B
0   $104.64 %28.90
1   $73.04  %73.36
2   (4) (3)
3   $82.95  %104.64
4   $92.45  %73.04
5   $95.09  %82.95
6   $79.20  %79.20
7   $63.66  (9)
8   $90.27  %63.66
9   $98.80  %90.27
10  $33.82  %98.80
11  (8) %33.82
12  $56.74  %56.74
13  $49.22  %49.22
14  $75.74  %75.74

out = df['A'].where( df['A']>='(' )   # 1. how to get rid of the NaN?
out

out = out.astype(dtype=str)           # 2. found that NaN is of type float, 
                                           so now all entries are str
out

to get: 要得到:

2  '(4)'
11 '(8)'

I tried this, but that's not helping because the note value is changed to True: 我尝试了此操作,但这无济于事,因为注释值更改为True:

df['A'].where( df['A']>='(' ).isna() == False

0     False
1     False
2      True
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11     True
12    False
13    False
14    False

The closest answer I [found][ Skipping char-string values while comparing to a column values of mixed type to int or float in pandas Dataframe does not help: 我[找到] [最接近的答案, 而在将混合类型的列值与pandas Dataframe中的int或float进行比较时跳过字符字符串值无济于事:

pd.to_numeric(df.A.str.strip('$'), errors='coerce')

performs the conversion, but transforms the '(n)' note value to nan. 执行转换,但是将'(n)'音符值转换为nan。

To sum up: The problem is the mixed types in the columns, due to the notes: I cannot just strip the '$' or '%', then convert to float. 总结:由于注释,问题在于列中的混合类型:我不能只剥离'$'或'%',然后转换为float。 I also need to document where these were. 我还需要记录这些位置。

I am probably blind to a simple solution... 我可能对简单的解决方案视而不见...

Since I don't have enough reputation to comment... Can you use df.loc[] to extract ones with '('? 由于我没有足够的声誉来发表评论...您可以使用df.loc []提取带有'('的评论吗?

withnotes = df.loc[df['A'].str.contains('\(')]
output = [(i, row.A) for i, row in withnotes.iterrows()]
output

Above example parses only column A and returns a list of tuples: output = [(2, '(4)'), (11, '(8)')] 上面的示例仅分析列A并返回一个元组列表:output = [(2,'(4)'),(11,'(8)')]

You can use .loc accessor for this: 您可以为此使用.loc访问器:

res = df.loc[df['A'].str[0] == '(', 'A']

This results in a series: 结果是一系列:

2     (4)
11    (8)
Name: A, dtype: object

If you need a dataframe: 如果您需要一个数据框:

res = df.loc[df['A'].str[0] == '(', 'A'].to_frame()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM