组装和分析来自 python 中 dataframe 的列表列表

Question

I've got a.csv file that looks a bit like this:我有一个看起来有点像这样的.csv 文件：

COL_A    COL_B                   COL_C
1        2020-05-26T00:01:01     99999
2        2020-05-26T00:01:02     99999
3        2020-05-26T00:01:03     99999
4        2020-05-26T00:01:04     2.3
5        2020-05-26T00:01:05     2.3
6        2020-05-26T00:01:06     2.3
7        2020-05-26T00:01:07     99999
8        2020-05-26T00:01:08     99999
9        2020-05-26T00:01:09     3.4
10       2020-05-26T00:01:10     3.4
11       2020-05-26T00:01:11     99999
12       2020-05-26T00:01:12     99999

I'd like to be able to identify the longest continuous span of rows where COL_C is < 5 and return that list of rows.我希望能够识别COL_C < 5的最长连续行跨度并返回该行列表。 The desired output would be something like:所需的 output 将类似于：

[
    [4        2020-05-26T00:01:04     2.3,
     5        2020-05-26T00:01:05     2.3,
     6        2020-05-26T00:01:06     2.3]
], 3

The way I have approached this in theory is building a list of lists that meet the criteria, and then using max over the lists with len as the key.我在理论上处理这个问题的方法是建立一个符合标准的列表列表，然后在列表中使用max ，并以len为键。 I've attempted this:我试过这个：

import pandas as pd

def max_c(csv_file):
    row_list = []
    df = pd.read_csv(csv_file)
    for i, row in df.iterrows():
        while row[2] < 5:
            span = [*row]
            row_list.append(span)
    return max(row_list, key=len)

I know enough to know that this isn't correct syntax for what I'm trying to do and I can even explain why, but do not know enough to get the desired output.我知道这对于我正在尝试做的事情来说不是正确的语法，我什至可以解释原因，但对获得所需的 output 知之甚少。

Answer 1

Similar to Quang,find the greater than 5 and create the sub-group, then we just filter out he value more than 5, and get the group with transform count .和 Quang 类似，找到大于 5 并创建子组，然后我们只是过滤掉他的值大于 5，并得到transform count的组。 pick the max count index选择max计数index

s=df.COL_C.ge(5)
s=df.loc[~s,'COL_A'].groupby(s.cumsum()).transform('count')
target=df.loc[s[s==s.max()].index]
Out[299]: 
   COL_A                COL_B  COL_C
3      4  2020-05-26T00:01:04    2.3
4      5  2020-05-26T00:01:05    2.3
5      6  2020-05-26T00:01:06    2.3

Answer 2

I'll use cumsum() to identify blocks and do a groupby:我将使用cumsum()来识别块并进行分组：

s = df['COL_C'].lt(5)
sizes = s.groupby([s,(~s).cumsum()]).transform('size') * s

# max block 1 size
# max_size == 0 means all values are >= 5
max_size = sizes.max()


df[sizes==max_size]

Output: Output：

   COL_A                COL_B  COL_C
3      4  2020-05-26T00:01:04    2.3
4      5  2020-05-26T00:01:05    2.3
5      6  2020-05-26T00:01:06    2.3

Details:细节：

s is: s是：

0     False
1     False
2     False
3      True
4      True
5      True
6     False
7     False
8      True
9      True
10    False
11    False
Name: COL_C, dtype: bool

if we just do s.cumsum() then the True obviously belong to different groups.如果我们只做s.cumsum()那么True显然属于不同的组。 Instead we do (~s).cumsum() we get:相反，我们做(~s).cumsum()我们得到：

0     1
1     2
2     3
3     3
4     3
5     3
6     4
7     5
8     5
9     5
10    6
11    7
Name: COL_C, dtype: int64

Almost there, but each group of True is now preceded by a row of False .快到了，但现在每组True前面都有一行False 。 That suggests we group by both s and the negated cumsum.这表明我们同时按s和否定的 cumsum 分组。

组装和分析来自 python 中 dataframe 的列表列表

问题描述

2 个解决方案

解决方案1
2 2020-05-26 18:46:10

解决方案2
1 已采纳 2020-05-26 18:44:33

组装和分析来自 python 中 dataframe 的列表列表

问题描述

2 个解决方案

解决方案1 2 2020-05-26 18:46:10

解决方案2 1 已采纳 2020-05-26 18:44:33

解决方案1
2 2020-05-26 18:46:10

解决方案2
1 已采纳 2020-05-26 18:44:33