简体   繁体   English

使这个数据清理循环更多Python

[英]Make this data cleaning loop more Python

I am cleaning some human classified data with Python 2.7, mainly using pandas, but using numpy.isreal() to check for floats because some people apparently entered floats in fields like 'background_color' . 我正在使用Python 2.7清理一些人类分类数据,主要使用pandas,但是使用numpy.isreal()来检查浮点数,因为有些人显然在'background_color'等字段中输入了浮点数。 Anyway, I'm posting an example of what it would look like for one color with my current set up, which works, it just doesn't look very Python, by the end of the loops, blues is a list of all the indexes where background_color was case insensitive 'BLUE' : 无论如何,我发布了一个例子,说明当前设置的一种颜色会是什么样子,它起作用,它看起来不像Python,在循环结束时, blues是所有索引的列表其中background_color不区分大小写'BLUE'

blueShapes=[]
for i in range(imageData.shape[0]):
    if not (np.isreal(imageData.loc[i,'background_color'])):
        if imageData.loc[i,'background_color'].upper()=='BLUE':
            blueShapes.append(i)

It seems like I could use the map function to make this more Pythonic and prettier. 似乎我可以使用map函数来使这更像Pythonic和更漂亮。 Like I said, it functions as intended, but just seems too...C or Java for it to be written in Python. 就像我说的,它按预期运行,但似乎也是...... C或Java用Python编写。 Thanks in advance for any responses. 提前感谢您的回复。

-Edit: I removed the count because it was a relic from an old loop -Edit:我删除了计数,因为它是旧循环的遗物

you could create a bew column with upper cases 你可以用大写字母创建一个bew列

imageData['background_color_2'] = map(lambda x: x.upper(), imageData['background_color'].astype(str))

subset = imageData[imageData['background_color_2']=='BLUE']

for the count 为伯爵

len(subset['background_color'])

You can define a lambda function that returns the index of rows with a specific string value 您可以定义一个lambda函数,该函数返回具有特定字符串值的行的索引

getRowIndexWithStringColor = lambda df, color: [i for i in range(df.shape[0]) if (not np.isreal(df.loc[i,'background_color'])) and df.loc[i,'background_color'].upper()==color)]
rowIndexWithBlue = getRowIndexWithStringColor(imageData, 'BLUE')

As a general rule if you are looping in pandas you are doing it wrong. 作为一般规则,如果你在大熊猫中循环,你做错了。

Should look something like this (though untested so you need to adapt it!): 应该看起来像这样(虽然未经测试,所以你需要适应它!):

strings = (~imageData.background_color.apply(np.isreal))
blue = (imageData.background_color.str.upper()=="BLUE")
blueshapes = imageData[strings & blue].index           

Thanks all! 谢谢大家! I used a minor adaptation to Steven G's answer, I have all of this backed up in a master .csv, so I had no quams with just overwriting the background_color column with its string equivalent. 我使用了对Steven G的答案的一个小修改,我将所有这些都备份在一个.csv大师中,所以我没有使用它的字符串等效覆盖background_color列。 Any non-sring entries are invalid anyway, but they are not alone so I will find them later as leftover indexes after i concatenate the indexes of all the colors. 任何非sring条目无论如何都是无效的,但它们并不是唯一的,所以我会在连接所有颜色的索引之后将它们作为剩余索引发现。 Each list will be extracted as follows: 每个列表将提取如下:

imageData['background_color']=map(lambda x: x.upper(), imageData['background_color'].astype(str))

blueShapes=imageData[imageData['background_color']=='BLUE'].index

I would turn it into a function and return an array. 我会把它变成一个函数并返回一个数组。

Google: Zen of Python 谷歌:禅宗的Python

Giving you a quick reference python list/dict/set over map/filter . 给你一个快速参考python列表/字典/设置地图/过滤器

Better readability and cleaner code. 更好的可读性和更清晰的代码。

def colorShapes(color):
    return [i
             for i in range(imageData.shape[0])
             if not(np.isreal(imageData.loc[i, 'background_color'].upper() == color 
             and imageData.loc[i, 'background_color'].upper() == color]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM