简体   繁体   English

如何从整个数据框中删除所有非数字数字:调试

[英]How do I remove all non- numerical numbers from entire data frame: Debugging

I am attempting to remove all non-numeric characters from my dataframe (ie characters like ]$^M# etc.) with a single line of code.我正在尝试使用一行代码从我的 dataframe 中删除所有非数字字符(即 ]$^M# 等字符)。 The data frame is a Google Play Store apps dataset.数据框是 Google Play 商店应用程序数据集。


df = pd.read_csv("googleplaystore.csv")

df['Rating'].fillna(value = '0.0', inplace = True)

#sample data#

   Rating    Reviews                Size     Installs  Type Price  \
0        4.1     159                 19M      10,000+  Free     0   
1        3.9     967                 14M     500,000+  Free     0   
2        4.7   87510                8.7M   5,000,000+  Free     0   
3        4.5  215644                 25M  50,000,000+  Free     0   
4        4.3     967                2.8M     100,000+  Free     0   
...      ...     ...                 ...          ...   ...   ...   
10836    4.5      38                 53M       5,000+  Free     0   
10837      5       4                3.6M         100+  Free     0   
10838    0.0       3                9.5M       1,000+  Free     0   
10839    4.5     114  Varies with device       1,000+  Free     0   
10840    4.5  398307                 19M  10,000,000+  Free     0   


Content Rating                     Genres      Last Updated  \
0           Everyone               Art & Design   January 7, 2018   
1           Everyone  Art & Design;Pretend Play  January 15, 2018   
2           Everyone               Art & Design    August 1, 2018   
3               Teen               Art & Design      June 8, 2018   
4           Everyone    Art & Design;Creativity     June 20, 2018   
...              ...                        ...               ...   
10836       Everyone                  Education     July 25, 2017   
10837       Everyone                  Education      July 6, 2018   
10838       Everyone                    Medical  January 20, 2017   
10839     Mature 17+          Books & Reference  January 19, 2015   
10840       Everyone                  Lifestyle     July 25, 2018

clean_data = df.replace('[^\d.]', '', regex = True).astype(float)

Essentially I am trying to remove the 'M' from the Size column after the digits as well as the '+' sign in the Installs column.本质上,我试图从数字后面的大小列中删除“M”以及“安装”列中的“+”号。

But I'm returned with this error message;但是我收到了这条错误消息;

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-325-887d47a9889e> in <module>
----> 1 data_ = df.replace('[^\d.]', '', regex = True).astype(float)

~\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors)
   5696         else:
   5697             # else, only a single dtype is given
-> 5698             new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
   5699             return self._constructor(new_data).__finalize__(self)
   5700 

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, copy, errors)
    580 
    581     def astype(self, dtype, copy: bool = False, errors: str = "raise"):
--> 582         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    583 
    584     def convert(self, **kwargs):

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, filter, **kwargs)
    440                 applied = b.apply(f, **kwargs)
    441             else:
--> 442                 applied = getattr(b, f)(**kwargs)
    443             result_blocks = _extend_blocks(applied, result_blocks)
    444 

~\anaconda3\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors)
    623             vals1d = values.ravel()
    624             try:
--> 625                 values = astype_nansafe(vals1d, dtype, copy=True)
    626             except (ValueError, TypeError):
    627                 # e.g. astype_nansafe can fail on object-dtype of strings

~\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy, skipna)
    895     if copy or is_object_dtype(arr) or is_object_dtype(dtype):
    896         # Explicit copy, or required since NumPy can't view from / to object.
--> 897         return arr.astype(dtype, copy=True)
    898 
    899     return arr.view(dtype)

ValueError: could not convert string to float: 

Kindly assist in debugging if possible.如果可能,请协助调试。 I would really like to keep it to one line of code for the entire data frame.我真的很想将它保留在整个数据框的一行代码中。 Thank you in advance.先感谢您。

I think problem is need specify columns for replace and replace empty value to NaN or 0 if not numeric like second last Size value:我认为问题是需要指定用于替换的列并将空值替换为NaN0如果不是数字,例如倒数第二个Size值:

cols = ['Size','Installs']
df[cols] = df[cols].replace('[^\d.]', '', regex = True).replace('',np.nan).astype(float)

print (df)
       Rating  Reviews  Size    Installs  Type  Price
0         4.1      159  19.0     10000.0  Free      0
1         3.9      967  14.0    500000.0  Free      0
2         4.7    87510   8.7   5000000.0  Free      0
3         4.5   215644  25.0  50000000.0  Free      0
4         4.3      967   2.8    100000.0  Free      0
10836     4.5       38  53.0      5000.0  Free      0
10837     5.0        4   3.6       100.0  Free      0
10838     0.0        3   9.5      1000.0  Free      0
10839     4.5      114   NaN      1000.0  Free      0
10840     4.5   398307  19.0  10000000.0  Free      0

The problem is that you are replacing all non-numeric characters in your dataframe with "".问题是您将 dataframe 中的所有非数字字符替换为“”。

This means that a non-numeric string ends up as "" - a zero-length string.这意味着非数字字符串以“”结尾 - 长度为零的字符串。 That can't be interpreted as a float, so you get the error.这不能解释为浮点数,因此您会收到错误消息。

If you run the replace over just your rating column如果您仅在评级列上运行替换

df["Rating"].replace('[^\d.]', '', regex = True).astype(float)

then it works, because removing non-numeric characters from that column results in a column filled only with characters that can be converted into numbers.那么它就起作用了,因为从该列中删除非数字字符会导致一列仅填充可以转换为数字的字符。

However, running it over the whole dataframe doesn't work because so many of your values are purely non-numeric.但是,在整个 dataframe 上运行它是行不通的,因为您的许多值都是纯非数字的。 The genre column, for example, will end up as just a column of empty strings, throwing the error.例如,流派列最终将只是一列空字符串,从而引发错误。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从同时具有数字和非数字数据的 pandas DataFrame 中删除异常值 - How do I remove outliers from a pandas DataFrame that has both numerical and non-numerical data 如何从Python的数据框中排除非数字整数 - how to exclude the non numerical integers from a data frame in Python 如何从字典中的列表中删除所有非数值? - How to remove all non numerical values from a list in dictionary? 如何从数值数据中删除非数值数据? - How to remove non-numeric data from numerical data? 从字符串中删除不可打印的数据 - Removing non- printable data from string 如何从字符串中提取简单的数字表达式数字? - How do I extract simple numerical expressions numbers from a string? 如果我想删除“,”之前的所有值,如何从整个列中删除字符串的一部分? - How do I remove a part of a string from an entire column, if I want to remove all the values before a “,”? 如何将一些字符串编码为 pandas 数据帧中所有列的数字? 例如在整个数据框中将“是”更改为 1 - How to code some strings to numbers across all columns in a pandas data frame? For example changing “yes” to 1 in the entire data frame 如何从字符串中删除除数字字符之外的所有字母字符? 尝试了所有现有的答案 - How do you remove all the alphabetic characters from the string except the numerical characters? tried all the present answers 如何在添加数字时从 pandas 数据框创建嵌套字典 - How do I create nested dictionary from pandas data frame while adding numbers
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM