pandas DataFrame的单列中的多个值

Question

I have some data that I'm parsing from XML to a pandas DataFrame. 我有一些数据，我正在从XML解析为pandas DataFrame。 The XML data roughly looks like this: XML数据大致如下所示：

<tracks>
  <track name="trackname1" variants="1,2,3,4,5">
    <variant var="1,2,3">
      <leg time="21:23" route_id="5" stop_id="103" serial="1"/>
      <leg time="21:26" route_id="5" stop_id="17" serial="2"/>
      <leg time="21:30" route_id="5" stop_id="38" serial="3"/>
      <leg time="20:57" route_id="8" stop_id="101" serial="1"/>
      <leg time="21:01" route_id="8" stop_id="59" serial="2"/>
      ...
    </variant>
    <variant var="4,5">
      ... more leg elements
    </variant>
  </track>
  <track name="trackname2" variants="1,2,3,4,5,6,7">
    <variant var="1">
      ... more leg elements
    </variant>
    <variant var="2,3,4,5,7">
      ... more leg elements
    </variant>
  </track>
</tracks>

I'm importing this into pandas because I need to be able to join this data with other DataFrames and I need to be able to query for stuff like: "get all legs of variant 1 for route_id 5". 我将它导入到pandas中，因为我需要能够将这些数据与其他DataFrames连接起来，并且我需要能够查询类似的内容：“获取变量1的所有路径为route_id 5”。

I'm trying to figure out how I would do this in a pandas DataFrame. 我试图弄清楚如何在pandas DataFrame中执行此操作。 Should I make a DataFrame that would look something like this: 我应该创建一个看起来像这样的DataFrame：

track_name     variants  time     route_id  stop_id  serial
"trackname1"   "1,2,3"   "21:23"  "5"       "103"    "1"
"trackname1"   "1,2,3"   "21:26"  "5"       "17"     "2"
...
"trackname1"   "4,5"     "21:20"  "5"       "103"    "1"
...
"trackname2"   "1"       "20:59"  "3"       "45"     "1"
... you get the point

If this is the way to go, how would I (efficiently) extract for example "all rows for variant 3 on route_id 5"? 如果这是要走的路，我将如何（有效地）提取例如“route_id 5上的变体3的所有行”？ Note that this should give me all the rows that have 3 in the variant column list, not just the rows that only have "3" in the variants column. 请注意，这应该为我提供变体列列表中包含3的所有行，而不仅仅是变体列中只有 “3”的行。

Is there a different way of constructing the DataFrame that would make this easier? 是否有一种不同的构建DataFrame的方法可以使这更容易？ Should I be using something other than pandas? 我应该使用熊猫以外的东西吗？

Answer 1

Assuming you have enough memory, your task will be more easily accomplished if your DataFrame held one variant per row: 假设您有足够的内存，如果您的DataFrame每行保留一个变体，则您的任务将更容易完成：

track_name     variants  time     route_id  stop_id  serial
"trackname1"   1         "21:23"         5      103       1
"trackname1"   2         "21:23"         5      103       1
"trackname1"   3         "21:23"         5      103       1
"trackname1"   1         "21:26"         5       17       2
"trackname1"   2         "21:26"         5       17       2
"trackname1"   3         "21:26"         5       17       2
...
"trackname1"   4         "21:20"         5      103       1
"trackname1"   5         "21:20"         5      103       1
...
"trackname2"   1         "20:59"         3       45       1

Then you could find "all rows for variant 3 on route_id 5 with 然后你可以在route_id 5上找到“变体3的所有行”

df.loc[(df['variants']==3) & (df['route_id']==5)]

If you pack many variants into one row, such as 如果将多个变体打包成一行，例如

"trackname1"   "1,2,3"   "21:23"  "5"       "103"    "1"

then you could find such rows using 然后你可以使用找到这样的行

df.loc[(df['variants'].str.contains("3")) & (df['route_id']=="5")]

assuming that the variants are always single digits. 假设变体总是单个数字。 If there are also 2-digit variants like "13" or "30", then you would need to pass a more complicated regex pattern to str.contains . 如果还有像“13”或“30”这样的2位数变体，那么您需要将更复杂的正则表达式模式传递给str.contains 。

Alternatively, you could use apply to split each variant on commas: 或者，您可以使用apply在逗号上拆分每个变体：

df['variants'].apply(lambda x: "3" in x.split(','))

but this is very inefficent since you would now be calling a Python function once for every row, and doing string splitting and a test for membership in a list compared to a vectorized integer comparision. 但这是非常无效的，因为你现在要为每一行调用一次Python函数，并且与矢量化整数比较进行字符串拆分和列表中成员资格的测试。

Thus, to avoid possibly complicated regex or a relatively slow call to apply , I think your best bet is to build the DataFrame with one integer variant per row. 因此，为了避免可能复杂的正则表达式或相对较慢的apply调用，我认为最好的办法是构建每行一个整数变量的DataFrame。

pandas DataFrame的单列中的多个值

问题描述

1 个解决方案

解决方案1
3 2014-11-07 02:18:14

pandas DataFrame的单列中的多个值

问题描述

1 个解决方案

解决方案1 3 2014-11-07 02:18:14

解决方案1
3 2014-11-07 02:18:14