简体   繁体   English

Python:识别数据框列中的断点

[英]Python: Identify breaking points in a data frame column

I am interested in identifying when different trips take place in a dataset.我有兴趣确定数据集中何时发生不同的旅行。 There are two lock states, where lock means the vehicle is stationary and unlocked means that the vehicle is being used.有两种锁定状态,其中锁定表示车辆静止,解锁表示车辆正在使用中。

As the same vehicle could be used by the same user multiple times, I first isolate the vehicle and a unique user through IDs and from a chronologically sorted time date column I can see when the vehicle was used.由于同一辆车可以被同一用户多次使用,我首先通过 ID 和时间顺序排序的时间日期列将车辆和唯一用户隔离开来,我可以看到车辆被使用的时间。 In order to identify different trips taken in the same vehicle by the same user I thought of identifying through my lock_state variable.为了识别同一用户在同一辆车上的不同行程,我想通过我的 lock_state 变量进行识别。

I've been trying to find how this could be done and percolation is something I came across but it seems too complex to understand and implement.我一直在试图找到如何做到这一点,而渗透是我遇到的事情,但它似乎太复杂而无法理解和实施。 I was wondering if there is an easier way of achieving this.我想知道是否有更简单的方法来实现这一目标。

My end goal is to identify the number of trips (should be 2 in this example), add them to a new df alongside the user id and start/end datetimes (let's pretend all of this is the random column) and give them unique IDs.我的最终目标是确定旅行次数(在本例中应该是 2),将它们添加到一个新的 df 中,并与用户 ID 和开始/结束日期时间(假设所有这些都是随机列)并给它们唯一的 ID . So the final output should be something like this (random made-up example):所以最终的输出应该是这样的(随机制作的例子):

trip_id      star_time  end_time user_id
jk3b4334kjh  x           x       093723
nbnmvn829nk  x           x       234380

Assuming the following sample data is in chronological order, how could I identify through the variable state different trips?假设以下示例数据按时间顺序排列,我如何通过变量状态识别不同的行程? (there should be 2 trips identified as the array is under continuous "unlocked" state twice before being interrupted by a "locked" state). (在被“锁定”状态中断之前,应该有 2 次识别为阵列处于连续“解锁”状态两次)。

lock_state = ["locked", "unlocked", "unlocked", "unlocked", "locked", "locked", "unlocked", "unlocked"]
# should be 2 trips

random_values = random.sample(range(2,20), 8) 

df = pd.DataFrame(
    {'state': lock_state,
     'random': random_values
    })

df

>>
    state   random
0   locked      5
1   unlocked    12
2   unlocked    17
3   unlocked    13
4   locked      18
5   locked      6
6   unlocked    4
7   unlocked    9

shift your state column by 1 row; shift您的stateshift 1 行; compare against the original, and filter on state "unlocked".与原始比较,并过滤状态“解锁”。 Those identifies runs of consecutive "unlocked" states.这些标识连续“解锁”状态的运行。 The shift-compare portion is移位比较部分是

df.state.eq(df.state.shift())

Given that hint, I expect that you can finish the coding, eh?有了这个提示,我希望你能完成编码,嗯?

I came up with this implementation of a 1D Hoshen-Kopelman cluster labelling.我想出了一个一维 Hoshen-Kopelman 集群标签的实现。

import random
import pandas as pd
import numpy as np

lock_state = ["locked", "unlocked", "unlocked", "unlocked", "locked", "locked", "unlocked", "unlocked"]

random_values = random.sample(range(2,20), 8) 

df = pd.DataFrame(
    {'state': lock_state,
     'random': random_values
    })
    

def hoshen_kopelman_1d(grid, occupied_label):
    """
    Hoshen Kopelman implementation for 1D graphs.
    
    Parameters:
            grid (pd.DataFrame): The 1D grid. 
            ocuppied_label (str): the label that identifies occupied nodes.

    Returns:
            labeled_grid (pd.DataFrame): grid with cluster labeled nodes.
    """
    
    # create labeled_grid and assign all nodes to cluster 0
    labeled_grid = df.assign(cluster=0)
    cluster_count = 0
    
    # iterate through the grid's nodes left to right
    for index, node in grid.iterrows():
        # check if node is occupied
        if node["state"] == occupied_label: # node is occupied
            if index == 0:
                # initialize new cluster
                cluster_count += 1
                labeled_grid.loc[0, "cluster"] = cluster_count
            else:
                # check if left-neighbour node is occupied
                if labeled_grid.loc[index-1, "cluster"] != 0: # left-neighbour node is occupied
                    # assign node to the same cluster as left-neighbour node
                    labeled_grid.loc[index, "cluster"] = labeled_grid.loc[index-1, "cluster"]
                else: # left-neighbour node is unoccupied
                    # initialize new cluster
                    cluster_count += 1
                    labeled_grid.loc[index, "cluster"] = cluster_count
                    
    return labeled_grid
                

M = hoshen_kopelman_1d(grid=df, occupied_label="unlocked")

It returns a new pandas.DataFrame with an extra "cluster" column, which indicates the cluster to which the node belongs ( 0 means the node is unoccupied and does not belong to any cluster).它返回一个新的带有额外"cluster"列的pandas.DataFrame ,该列指示节点所属的集群( 0表示该节点未被占用且不属于任何集群)。

Having this, it becomes pretty straightforward to retrieve the rows from, eg, trip 1 .有了这个,从例如 trip 1检索行就变得非常简单了。 We could do我们可以做

trip_1 = M.loc[M['cluster'] == 1]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM