[英]How do I remove all non- numerical numbers from entire data frame: Debugging
I am attempting to remove all non-numeric characters from my dataframe (ie characters like ]$^M# etc.) with a single line of code.我正在尝试使用一行代码从我的 dataframe 中删除所有非数字字符(即 ]$^M# 等字符)。 The data frame is a Google Play Store apps dataset.数据框是 Google Play 商店应用程序数据集。
df = pd.read_csv("googleplaystore.csv")
df['Rating'].fillna(value = '0.0', inplace = True)
#sample data#
Rating Reviews Size Installs Type Price \
0 4.1 159 19M 10,000+ Free 0
1 3.9 967 14M 500,000+ Free 0
2 4.7 87510 8.7M 5,000,000+ Free 0
3 4.5 215644 25M 50,000,000+ Free 0
4 4.3 967 2.8M 100,000+ Free 0
... ... ... ... ... ... ...
10836 4.5 38 53M 5,000+ Free 0
10837 5 4 3.6M 100+ Free 0
10838 0.0 3 9.5M 1,000+ Free 0
10839 4.5 114 Varies with device 1,000+ Free 0
10840 4.5 398307 19M 10,000,000+ Free 0
Content Rating Genres Last Updated \
0 Everyone Art & Design January 7, 2018
1 Everyone Art & Design;Pretend Play January 15, 2018
2 Everyone Art & Design August 1, 2018
3 Teen Art & Design June 8, 2018
4 Everyone Art & Design;Creativity June 20, 2018
... ... ... ...
10836 Everyone Education July 25, 2017
10837 Everyone Education July 6, 2018
10838 Everyone Medical January 20, 2017
10839 Mature 17+ Books & Reference January 19, 2015
10840 Everyone Lifestyle July 25, 2018
clean_data = df.replace('[^\d.]', '', regex = True).astype(float)
Essentially I am trying to remove the 'M' from the Size column after the digits as well as the '+' sign in the Installs column.本质上,我试图从数字后面的大小列中删除“M”以及“安装”列中的“+”号。
But I'm returned with this error message;但是我收到了这条错误消息;
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-325-887d47a9889e> in <module>
----> 1 data_ = df.replace('[^\d.]', '', regex = True).astype(float)
~\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors)
5696 else:
5697 # else, only a single dtype is given
-> 5698 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
5699 return self._constructor(new_data).__finalize__(self)
5700
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, copy, errors)
580
581 def astype(self, dtype, copy: bool = False, errors: str = "raise"):
--> 582 return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
583
584 def convert(self, **kwargs):
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, filter, **kwargs)
440 applied = b.apply(f, **kwargs)
441 else:
--> 442 applied = getattr(b, f)(**kwargs)
443 result_blocks = _extend_blocks(applied, result_blocks)
444
~\anaconda3\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors)
623 vals1d = values.ravel()
624 try:
--> 625 values = astype_nansafe(vals1d, dtype, copy=True)
626 except (ValueError, TypeError):
627 # e.g. astype_nansafe can fail on object-dtype of strings
~\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy, skipna)
895 if copy or is_object_dtype(arr) or is_object_dtype(dtype):
896 # Explicit copy, or required since NumPy can't view from / to object.
--> 897 return arr.astype(dtype, copy=True)
898
899 return arr.view(dtype)
ValueError: could not convert string to float:
Kindly assist in debugging if possible.如果可能,请协助调试。 I would really like to keep it to one line of code for the entire data frame.我真的很想将它保留在整个数据框的一行代码中。 Thank you in advance.先感谢您。
I think problem is need specify columns for replace and replace empty value to NaN
or 0
if not numeric like second last Size
value:我认为问题是需要指定用于替换的列并将空值替换为NaN
或0
如果不是数字,例如倒数第二个Size
值:
cols = ['Size','Installs']
df[cols] = df[cols].replace('[^\d.]', '', regex = True).replace('',np.nan).astype(float)
print (df)
Rating Reviews Size Installs Type Price
0 4.1 159 19.0 10000.0 Free 0
1 3.9 967 14.0 500000.0 Free 0
2 4.7 87510 8.7 5000000.0 Free 0
3 4.5 215644 25.0 50000000.0 Free 0
4 4.3 967 2.8 100000.0 Free 0
10836 4.5 38 53.0 5000.0 Free 0
10837 5.0 4 3.6 100.0 Free 0
10838 0.0 3 9.5 1000.0 Free 0
10839 4.5 114 NaN 1000.0 Free 0
10840 4.5 398307 19.0 10000000.0 Free 0
The problem is that you are replacing all non-numeric characters in your dataframe with "".问题是您将 dataframe 中的所有非数字字符替换为“”。
This means that a non-numeric string ends up as "" - a zero-length string.这意味着非数字字符串以“”结尾 - 长度为零的字符串。 That can't be interpreted as a float, so you get the error.这不能解释为浮点数,因此您会收到错误消息。
If you run the replace over just your rating column如果您仅在评级列上运行替换
df["Rating"].replace('[^\d.]', '', regex = True).astype(float)
then it works, because removing non-numeric characters from that column results in a column filled only with characters that can be converted into numbers.那么它就起作用了,因为从该列中删除非数字字符会导致一列仅填充可以转换为数字的字符。
However, running it over the whole dataframe doesn't work because so many of your values are purely non-numeric.但是,在整个 dataframe 上运行它是行不通的,因为您的许多值都是纯非数字的。 The genre column, for example, will end up as just a column of empty strings, throwing the error.例如,流派列最终将只是一列空字符串,从而引发错误。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.