简体   繁体   English

Python:将轨迹分成几步

[英]Python: splitting trajectories into steps

I have trajectories created from moves between clusters such as these: 我有从这些集群之间的移动创建的轨迹:

user_id,trajectory
11011,[[[86], [110], [110]]
2139671,[[89], [125]]
3945641,[[36], [73], [110], [110]]
10024312,[[123], [27], [97], [97], [97], [110]]
14270422,[[0], [110], [174]]
14283758,[[110], [184]]
14317445,[[50], [88]]
14331818,[[0], [22], [36], [131], [131]]
14334591,[[107], [19]]
14373703,[[35], [97], [97], [97], [17], [58]]

I would like to split the trajectories with multiple moves into individual segments, but I am unsure how. 我想将多个动作的轨迹分成不同的段,但我不确定如何。

Example: 例:

14373703,[[35], [97], [97], [97], [17], [58]]

into

14373703,[[35,97], [97,97], [97,17], [17,58]]

The purpose is to then use these as edges in NetworkX to analyse them as a graph and identify dense movements (edges) between the individual clusters (nodes). 目的是在NetworkX中将它们用作边缘,将它们分析为图形,并识别各个簇(节点)之间的密集运动(边缘)。

This is the code I've used to create the trajectories initially: 这是我用来创建轨迹的代码:

# Import Data
data = pd.read_csv('G:\Programming Projects\GGS 681\dmv_tweets_20170309_20170314_cluster_outputs.csv', delimiter=',', engine='python')
#print len(data),"rows"

# Create Data Fame
df = pd.DataFrame(data, columns=['user_id','timestamp','latitude','longitude','cluster_labels'])

# Filter Data Frame by count of user_id
filtered = df.groupby('user_id').filter(lambda x: x['user_id'].count()>1)
#filtered.to_csv('G:\Programming Projects\GGS 681\dmv_tweets_20170309_20170314_final_filtered.csv', index=False, header=True)

# Get a list of unique user_id values
uniqueIds = np.unique(filtered['user_id'].values)

# Get the ordered (by timestamp) coordinates for each user_id
output = [[id,filtered.loc[filtered['user_id']==id].sort_values(by='timestamp')[['cluster_labels']].values.tolist()] for id in uniqueIds]

# Save outputs as csv
outputs = pd.DataFrame(output)
#print outputs
headers = ['user_id','trajectory']
outputs.to_csv('G:\Programming Projects\GGS 681\dmv_tweets_20170309_20170314_cluster_moves.csv', index=False, header=headers)

If splitting this way is possible, can it be completed during the processing, as opposed to after the fact? 如果能够以这种方式分割,是否可以在处理过程中完成,而不是在事后? I'd like to perform it while creating, to eliminate any postprocessing. 我想在创建时执行它,以消除任何后处理。

My solution uses the magic of pandas' .apply() function. 我的解决方案使用了pandas的.apply()函数。 I believe this should work (I tested this on your sample data). 我相信这应该有效(我在你的样本数据上测试了这个)。 Notice that I also added an extra data points on the end for the case when there is only a single move, and when there is no move. 请注意,当只有一次移动和没有移动时,我还在最后添加了一个额外的数据点。

# Python3.5
import pandas as pd 


# Sample data from post
ids = [11011,2139671,3945641,10024312,14270422,14283758,14317445,14331818,14334591,14373703,10000,100001]
traj = [[[86], [110], [110]],[[89], [125]],[[36], [73], [110], [110]],[[123], [27], [97], [97], [97], [110]],[[0], [110], [174]],[[110], [184]],[[50], [88]],[[0], [22], [36], [131], [131]],[[107], [19]],[[35], [97], [97], [97], [17], [58]],[10],[]]

# Sample frame
df = pd.DataFrame({'user_ids':ids, 'trajectory':traj})

def f(x):
    # Creates edges given list of moves
    if len(x) <= 1: return x
    s = [x[i]+x[i+1] for i in range(len(x)-1)]
    return s

df['edges'] = df['trajectory'].apply(lambda x: f(x))

Output: 输出:

print(df['edges'])

                                                edges  
0                             [[86, 110], [110, 110]]  
1                                         [[89, 125]]  
2                   [[36, 73], [73, 110], [110, 110]]  
3   [[123, 27], [27, 97], [97, 97], [97, 97], [97,...  
4                              [[0, 110], [110, 174]]  
5                                        [[110, 184]]  
6                                          [[50, 88]]  
7          [[0, 22], [22, 36], [36, 131], [131, 131]]  
8                                         [[107, 19]]  
9   [[35, 97], [97, 97], [97, 97], [97, 17], [17, ...  
10                                               [10]  
11                                                 []

As far as where you can put this in your pipeline - just put it right after you get your trajectory column (whether that's after you load the data, or after you do whatever filtering you require). 至于你可以把它放在你的管道中的位置 - 只需在你获得trajectory列后(无论是在加载数据之后,还是在你进行任何需要的过滤之后)将其放置。

If you zip your trajectory with itself offset by one you get your desired result. 如果你zip你的轨迹与自身的一个偏移你得到你想要的结果。

Code: 码:

for id, traj in data.items():
    print(id, list([i[0], j[0]] for i, j in zip(traj[:-1], traj[1:])))

Test Data: 测试数据:

data = {
    11011: [[86], [110], [110]],
    2139671: [[89], [125]],
    3945641: [[36], [73], [110], [110]],
    10024312: [[123], [27], [97], [97], [97], [110]],
    14270422: [[0], [110], [174]],
    14283758: [[110], [184]],
    14373703: [[35], [97], [97], [97], [17], [58]],
}

Results: 结果:

11011 [[86, 110], [110, 110]]
14373703 [[35, 97], [97, 97], [97, 97], [97, 17], [17, 58]]
3945641 [[36, 73], [73, 110], [110, 110]]
14283758 [[110, 184]]
14270422 [[0, 110], [110, 174]]
2139671 [[89, 125]]
10024312 [[123, 27], [27, 97], [97, 97], [97, 97], [97, 110]]

I think you can use groupby with apply and custom function with zip , for output list of lists in necessary list comprehension: 我认为你可以使用带有zip apply和自定义函数的groupby ,用于必要的列表理解中的列表输出列表:

Notice : 通知

count function return all no NaN values, if filtering by length without NaN better is len . count函数返回所有没有NaN值,如果按length过滤而没有NaN更好是len

#filtering and sorting     
filtered = df.groupby('user_id').filter(lambda x: len(x['user_id'])>1)
filtered = filtered.sort_values(by='timestamp')

f = lambda x: [list(a) for a in zip(x[:-1], x[1:])]
df2 = filtered.groupby('user_id')['cluster_labels'].apply(f).reset_index()
print (df2)
    user_id                                     cluster_labels
0     11011                            [[86, 110], [110, 110]]
1   2139671                                        [[89, 125]]
2   3945641                  [[36, 73], [73, 110], [110, 110]]
3  10024312  [[123, 27], [27, 97], [97, 97], [97, 97], [97,...
4  14270422                             [[0, 110], [110, 174]]
5  14283758                                       [[110, 184]]
6  14373703  [[35, 97], [97, 97], [97, 97], [97, 17], [17, ...

Similar solution, filtering is last step by boolean indexing : 类似的解决方案,过滤是boolean indexing最后一步:

filtered = filtered.sort_values(by='timestamp')

f = lambda x: [list(a) for a in zip(x[:-1], x[1:])]
df2 = filtered.groupby('user_id')['cluster_labels'].apply(f).reset_index()
df2 = df2[df2['cluster_labels'].str.len() > 0]
print (df2)
    user_id                                     cluster_labels
1     11011                            [[86, 110], [110, 110]]
2   2139671                                        [[89, 125]]
3   3945641                  [[36, 73], [73, 110], [110, 110]]
4  10024312  [[123, 27], [27, 97], [97, 97], [97, 97], [97,...
5  14270422                             [[0, 110], [110, 174]]
6  14283758                                       [[110, 184]]
7  14373703  [[35, 97], [97, 97], [97, 97], [97, 17], [17, ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM