[英]Pandas dataframe: converting mixed-typed string values to float while keeping track of the true string values
I've scrub a table from a web page. 我已经从网页上刷了一张桌子。 Some columns hold dollar amounts or percentage (as strings), but some entries have some notes, ie '(n)'.
一些列包含美元金额或百分比(作为字符串),但是某些条目包含一些注释,即'(n)'。 Before I can transform the str(numbers) to float, I need to inspect the notes to find out if they should get 0 or NA.
在将str(numbers)转换为float之前,我需要检查注释以找出它们应为0还是NA。
What I need is to output the row and note for each entry with a note, ie (4, '(4)'), (13, '(4)'); 我需要的是为每个条目输出行和带注释的注释,即(4,'(4)'),(13,'(4)'); # or vectors
#或向量
Using python: 3.5.4; 使用python:3.5.4; pandas: 0.22.0
熊猫:0.22.0
I've reproduced a smaller dataframe: 我复制了一个较小的数据框:
df = pd.DataFrame({'A':[ '$104.64', '$73.04', '(4)', '$82.95', '$92.45', '$95.09',
'$79.20', '$63.66', '$90.27', '$98.80', '$33.82', '(8)', '$56.74', '$49.22',
'$75.74'],
'B':['%28.90', '%73.36', '(3)', '%104.64', '%73.04', '%82.95',
'%79.20', '(9)', '%63.66', '%90.27', '%98.80', '%33.82', '%56.74', '%49.22',
'%75.74']})
df
A B
0 $104.64 %28.90
1 $73.04 %73.36
2 (4) (3)
3 $82.95 %104.64
4 $92.45 %73.04
5 $95.09 %82.95
6 $79.20 %79.20
7 $63.66 (9)
8 $90.27 %63.66
9 $98.80 %90.27
10 $33.82 %98.80
11 (8) %33.82
12 $56.74 %56.74
13 $49.22 %49.22
14 $75.74 %75.74
out = df['A'].where( df['A']>='(' ) # 1. how to get rid of the NaN?
out
out = out.astype(dtype=str) # 2. found that NaN is of type float,
so now all entries are str
out
to get: 要得到:
2 '(4)'
11 '(8)'
I tried this, but that's not helping because the note value is changed to True: 我尝试了此操作,但这无济于事,因为注释值更改为True:
df['A'].where( df['A']>='(' ).isna() == False
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 True
12 False
13 False
14 False
The closest answer I [found][ Skipping char-string values while comparing to a column values of mixed type to int or float in pandas Dataframe does not help: 我[找到] [最接近的答案, 而在将混合类型的列值与pandas Dataframe中的int或float进行比较时跳过字符字符串值无济于事:
pd.to_numeric(df.A.str.strip('$'), errors='coerce')
performs the conversion, but transforms the '(n)' note value to nan. 执行转换,但是将'(n)'音符值转换为nan。
To sum up: The problem is the mixed types in the columns, due to the notes: I cannot just strip the '$' or '%', then convert to float. 总结:由于注释,问题在于列中的混合类型:我不能只剥离'$'或'%',然后转换为float。 I also need to document where these were.
我还需要记录这些位置。
I am probably blind to a simple solution... 我可能对简单的解决方案视而不见...
Since I don't have enough reputation to comment... Can you use df.loc[] to extract ones with '('? 由于我没有足够的声誉来发表评论...您可以使用df.loc []提取带有'('的评论吗?
withnotes = df.loc[df['A'].str.contains('\(')]
output = [(i, row.A) for i, row in withnotes.iterrows()]
output
Above example parses only column A and returns a list of tuples: output = [(2, '(4)'), (11, '(8)')] 上面的示例仅分析列A并返回一个元组列表:output = [(2,'(4)'),(11,'(8)')]
You can use .loc
accessor for this: 您可以为此使用
.loc
访问器:
res = df.loc[df['A'].str[0] == '(', 'A']
This results in a series: 结果是一系列:
2 (4)
11 (8)
Name: A, dtype: object
If you need a dataframe: 如果您需要一个数据框:
res = df.loc[df['A'].str[0] == '(', 'A'].to_frame()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.