[英]Pandas Dataframe : Complex problem with selecting rows linked each-other
I have a dataframe df
我有一个 dataframe
df
Col1 Col2
A B
C A
B D
E F
G D
G H
K J
and a Series id
of IDs和一系列
id
的 ID
ID
A
F
What I want is, for all letters in id
, to select other letters that have any link with a max of 2 intermediates.我想要的是,对于
id
中的所有字母,到 select 其他与最多 2 个中间体有任何链接的字母。 Let's make the example for A
(way easier to understand with the example):让我们制作
A
的示例(通过示例更容易理解):
There are 2 lines including A
, linked to B
and C
, so direct links to A are [B, C]
.有 2 行,包括
A
,链接到B
和C
,所以直接链接到 A 是[B, C]
。 (No matter if A is in Col1
or Col2
) (无论 A 是在
Col1
还是Col2
中)
A B
C A
But B
is also linked to D
, and D
is linked to G
:但是
B
也链接到D
,并且D
链接到G
:
B D
G D
So links to A
are [B, C, D, G]
.所以到
A
的链接是[B, C, D, G]
。 Even though G
and H
are linked, it would make more than 2 intermediates from A
( A > B > D > G > H
making B
, D
and G
as intermediates), so I don't include H
in A
links lists.即使
G
和H
是链接的,它也会从A
产生超过 2 个中间体( A > B > D > G > H
使B
、 D
和G
作为中间体),所以我不在A
链接列表中包含H
。
G H
I'm looking for a way to search, for all IDs in id
, the links list, and save it in id
:我正在寻找一种方法来搜索
id
中的所有 ID、链接列表,并将其保存在id
中:
ID LinksList
A [B, C, D, G]
F [E]
I don't mind the type of LinksList
(it can be String) as far as I can get the info for a specific ID and work with it.我不介意
LinksList
的类型(它可以是字符串),只要我可以获得特定 ID 的信息并使用它。 I also don't mind the order of IDs in LinksList
, as long as it's complete.我也不介意
LinksList
中 ID 的顺序,只要它是完整的。
I already found a way to solve the problem, but using 3 for
loops, so it takes a really long time.我已经找到了解决问题的方法,但是使用了 3 个
for
循环,所以需要很长时间。 (For k1 in ID, For k2 range(0,3), select direct links for each element of LinksList + starting element, and put them in LinksList if they're not already in). (对于 ID 中的 k1,对于 k2 范围(0,3),select 直接链接 LinksList + 起始元素的每个元素,如果它们还没有,则将它们放入 LinksList 中)。 Can someone please help me doing it only with Pandas?
有人可以帮我只用 Pandas 做吗? Thanks a lot in advance !!
非常感谢提前!
==== EDIT: Here are the "3 loops", after Karl's comment: ==== ==== 编辑:这是卡尔评论后的“3个循环”:====
i = 0
for k in id:
linklist = list(df[df['Col1'] == k]['Col2']) + list(df[df['Col2'] == k]['Col1'])
new = df.copy()
intermediate_count = 1
while(len(new) > 0 and intermediate_count <= 2):
nn = new.copy()
new = []
for n in nn:
toadd = list(df[df['Col1'] == n]['Col2']) + list(df[df['Col2'] == n]['Col1'])
toadd = list(set(toadd).difference(df))
df = df + toadd
new = new + toadd
if(i==0):
d = {'Id': k, 'Linked': linklist}
df_result = pd.DataFrame(data=d)
i = 1
else:
d = {'Id': k, 'Linked': linklist}
df_result.append(pd.DataFrame(data=d))
I would first append the reciprocal of the dataframe to be able to always go from Col1 to Col2.我将首先 append dataframe 的倒数,以便能够始终从 Col1 到 Col2 的 go。 Then I would use merges to compute the possible results with 1 and 2 intermediate steps.
然后我会使用合并来计算可能的结果,其中包含 1 个和 2 个中间步骤。 Finally, I would aggregate all those values into sets.
最后,我会将所有这些值聚合到集合中。 Code could be:
代码可以是:
# append the symetric (Col2 -> Col1) to the end of the dataframe
df2 = df.append(df.reindex(columns=reversed(df.columns)).rename(
columns={df.columns[len(df.columns)-i]: col
for i, col in enumerate(df.columns, 1)}), ignore_index=True
).drop_duplicates()
# add one step on Col3
df3 = df2.merge(df2, 'left', left_on='Col2', right_on='Col1',
suffixes=('', '_')).drop(columns='Col1_').rename(
columns={'Col2_': 'Col3'})
# add one second stop on Col4
df4 = df3.merge(df2, 'left', left_on='Col3', right_on='Col1',
suffixes=('', '_')).drop(columns='Col1_').rename(
columns={'Col2_': 'Col4'})
# aggregate Col2 to Col4 into a set
df4['Links'] = df4.iloc[:, 1:].agg(set, axis=1)
# aggregate that new column grouped by Col1
result = df4.groupby('Col1')['Links'].agg(lambda x: set.union(*x)).reset_index()
# remove the initial value if present in Links
result['Links'] = result['Links'] - result['Col1'].apply(set)
# and display the result restricted to id
print(result[result['Col1'].isin(id)])
With the sample data, it gives as expected:使用示例数据,它按预期给出:
Col1 Links
0 A {D, C, B, G}
5 F {E}
We can use Networkx library:我们可以使用 Networkx 库:
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
# Read in pandas dataframe using copy and paste
df = pd.read_clipboard()
# Create graph network from pandas dataframe
G = nx.from_pandas_edgelist(df, 'Col1', 'Col2')
# Create id, Series
id = pd.Series(['A', 'F'])
# Move values in the index of the Series
id.index=id
# Use `single_source_shortest_path` method in nx for each value in, id, Series
id.apply(lambda x: list(nx.single_source_shortest_path(G, x, 3).keys())[1:])
Output: Output:
A [B, C, D, G]
F [E]
dtype: object
Print graph representation:打印图形表示:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.