[英]Assemble and analyze a list of lists from dataframe in python
I've got a.csv file that looks a bit like this:我有一个看起来有点像这样的.csv 文件:
COL_A COL_B COL_C
1 2020-05-26T00:01:01 99999
2 2020-05-26T00:01:02 99999
3 2020-05-26T00:01:03 99999
4 2020-05-26T00:01:04 2.3
5 2020-05-26T00:01:05 2.3
6 2020-05-26T00:01:06 2.3
7 2020-05-26T00:01:07 99999
8 2020-05-26T00:01:08 99999
9 2020-05-26T00:01:09 3.4
10 2020-05-26T00:01:10 3.4
11 2020-05-26T00:01:11 99999
12 2020-05-26T00:01:12 99999
I'd like to be able to identify the longest continuous span of rows where COL_C
is < 5
and return that list of rows.我希望能够识别COL_C
< 5
的最长连续行跨度并返回该行列表。 The desired output would be something like:所需的 output 将类似于:
[
[4 2020-05-26T00:01:04 2.3,
5 2020-05-26T00:01:05 2.3,
6 2020-05-26T00:01:06 2.3]
], 3
The way I have approached this in theory is building a list of lists that meet the criteria, and then using max
over the lists with len
as the key.我在理论上处理这个问题的方法是建立一个符合标准的列表列表,然后在列表中使用max
,并以len
为键。 I've attempted this:我试过这个:
import pandas as pd
def max_c(csv_file):
row_list = []
df = pd.read_csv(csv_file)
for i, row in df.iterrows():
while row[2] < 5:
span = [*row]
row_list.append(span)
return max(row_list, key=len)
I know enough to know that this isn't correct syntax for what I'm trying to do and I can even explain why, but do not know enough to get the desired output.我知道这对于我正在尝试做的事情来说不是正确的语法,我什至可以解释原因,但对获得所需的 output 知之甚少。
Similar to Quang,find the greater than 5 and create the sub-group, then we just filter out he value more than 5, and get the group with transform
count
.和 Quang 类似,找到大于 5 并创建子组,然后我们只是过滤掉他的值大于 5,并得到transform
count
的组。 pick the max
count index
选择max
计数index
s=df.COL_C.ge(5)
s=df.loc[~s,'COL_A'].groupby(s.cumsum()).transform('count')
target=df.loc[s[s==s.max()].index]
Out[299]:
COL_A COL_B COL_C
3 4 2020-05-26T00:01:04 2.3
4 5 2020-05-26T00:01:05 2.3
5 6 2020-05-26T00:01:06 2.3
I'll use cumsum()
to identify blocks and do a groupby:我将使用cumsum()
来识别块并进行分组:
s = df['COL_C'].lt(5)
sizes = s.groupby([s,(~s).cumsum()]).transform('size') * s
# max block 1 size
# max_size == 0 means all values are >= 5
max_size = sizes.max()
df[sizes==max_size]
Output: Output:
COL_A COL_B COL_C
3 4 2020-05-26T00:01:04 2.3
4 5 2020-05-26T00:01:05 2.3
5 6 2020-05-26T00:01:06 2.3
Details:细节:
s
is: s
是:
0 False
1 False
2 False
3 True
4 True
5 True
6 False
7 False
8 True
9 True
10 False
11 False
Name: COL_C, dtype: bool
if we just do s.cumsum()
then the True
obviously belong to different groups.如果我们只做s.cumsum()
那么True
显然属于不同的组。 Instead we do (~s).cumsum()
we get:相反,我们做(~s).cumsum()
我们得到:
0 1
1 2
2 3
3 3
4 3
5 3
6 4
7 5
8 5
9 5
10 6
11 7
Name: COL_C, dtype: int64
Almost there, but each group of True
is now preceded by a row of False
.快到了,但现在每组True
前面都有一行False
。 That suggests we group by both s
and the negated cumsum.这表明我们同时按s
和否定的 cumsum 分组。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.