简体   繁体   English

基于列名的加法在一个巨大的 dataframe pandas

[英]Addition based on the column name in a huge dataframe pandas

I have thousands of columns in the alldata dataframe.我在alldata dataframe 中有数千列。 The column names consist of the elements like A_B_C or A_B_D and so on.列名由A_B_CA_B_D等元素组成。 I already have A , B , C in the same dataframe and in other dataframes separately like df_A , df_B , df_C , df_D我已经在同一个 dataframe 和其他数据帧中分别有ABCdf_Adf_Bdf_Cdf_D

Iterating for getting the sum of A, B and C and check them if the list of sum of A,B,C or A,B,D is less than 0 on any row, doesn't look like a good idea as it taking unlimited time.迭代以获得 A、B 和 C 的总和并检查它们是否 A、B、C 或 A、B、D 的总和列表在任何行上都小于 0,因为它看起来不是一个好主意无限时间。 Not sure where is the issue.不确定问题出在哪里。

Here's my code.这是我的代码。 How should it be optimized?应该如何优化?

res1 is the list of combinations for A_B_C and more res1A_B_C等的组合列表

    for i in res1:
        x = i.split("_")
        alldata['sum'] = alldata[x[0]]+alldata[x[1]]+alldata[x[2]]
        if sum(n < 0 for n in alldata['sum']) >0:
            c=""
            print("nah")
        else:
            nice = [x[0],x[1],x[2]]
            good = good.append(nice)
            print(nice)
        alldata = alldata.drop([i], axis=1)
        print("dropped," + str(len(alldata.columns)) + "columns remaining")

Let's look at what your code is doing, line by line:让我们逐行看看你的代码在做什么:

res1 = ['A_B_C', 'A_B_D']
for i in res1:

How long is your actual res1 ?你的实际res1有多长? If it is very long (many thousands), this for loop is always going to take a while.如果它很长(数千),这个for循环总是需要一段时间。 By the way, i is a terrible variable name for a multipart string (how about grouped_names or something?).顺便说一句,对于多部分字符串, i是一个糟糕的变量名( grouped_names或其他东西怎么样?)。

    x = i.split("_")

That won't take much time at all, assuming each string is short as in your example.假设每个字符串都像您的示例一样短,那根本不会花费太多时间。

    alldata['sum'] = alldata[x[0]] + alldata[x[1]] + alldata[x[2]]

The good news is the above is vectorized, so it will run at native speed (as Pandas Series addition and assignment are implemented in compiled code).好消息是上面是矢量化的,因此它将以本机速度运行(因为 Pandas 系列的添加和分配是在编译代码中实现的)。 But how many rows does alldata have?但是alldata有多少行? That's the main factor in performance for the above line.这是上述线路性能的主要因素。

    if sum(n < 0 for n in alldata['sum']) > 0:
        c=""
        print("nah")

You don't use c anywhere else, remove it.您不要在其他任何地方使用c ,请将其删除。 And what's the print statement about?印刷声明是关于什么的? I'd remove that too.我也会删除它。

    else:
        nice = [x[0],x[1],x[2]]
        good = good.append(nice)
        print(nice)

Just say nice = x[:3] to take the first three elements instead of creating a new list.只需说nice = x[:3]来获取前三个元素,而不是创建一个新列表。

    alldata = alldata.drop([i], axis=1)

This doesn't seem great.这似乎不太好。 You're creating a new DataFrame every time, Instead: how about this:您每次都在创建一个新的 DataFrame,而是:这个怎么样:

    drop_cols.append(i)

And then do alldata.drop(drop_cols, axis=1, inplace=True) at the very end, just once.然后在最后执行alldata.drop(drop_cols, axis=1, inplace=True)一次。

    print("dropped," + str(len(alldata.columns)) + "columns remaining")

You can simplify that a little:您可以稍微简化一下:

    print("dropped,", len(alldata.columns), "columns remaining")

Once you've tried the above, let us know how long the code takes on some reasonably-sized DataFrame (one which contains real data, but perhaps not all the rows).一旦你尝试了上述方法,让我们知道代码在一些合理大小的 DataFrame (其中包含真实数据,但可能不是所有行)上需要多长时间。 Then tell us how much more speedup you need to make the solution acceptable.然后告诉我们你需要多少加速才能使解决方案可接受。

I think @John Zwinck covered all the right points...我认为@John Zwinck 涵盖了所有正确的观点......

Suggest using df.apply() to sum and then issue raise Exception to stop when find 1st failure so should speed up on failures.建议使用 df.apply() 求和,然后在发现第一个故障时发出raise Exception停止,因此应该加快故障速度。 If all successes, then sum of df's is better approach, IMO.如果所有成功,那么 df 的总和是更好的方法,IMO。

def testColumns(row, colName):
    s_ = sum(r[c] for c in colName.split('_')) 
    if s_ < 0 : 
        raise Exception 'nah'
    return True

for col in columnSet:
    sName = 's_'%str(col)
    try: 
        df[sName] = df.apply(lambda r : testColumns(r, col), axis=1 )
        msg = "col Kept  "
        nice = colName.split('_')
        good.append(nice)
    except Exception as err:
        msg = "col dropped"
        alldata.drop([col, sName], axis=1, inplace=True)
    print(msg, str(len(alldata.columns)) + "columns remaining")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM