简体   繁体   English

如何调用熊猫str.split()而不是python内置的str.split()

[英]How to invoke pandas str.split() and not python's built in str.split()

I am trying to locate all the rows of a dataframe whose one attribute (say id_proof) value matches to the second part of another column (say adr_proof) that starts with a fixed word (say PARENT) and also the corresponding values should match which are part of the same dataframe. 我试图找到一个数据框的所有行,其一个属性(例如id_proof)值与以固定字(例如PARENT)开头的另一列(例如adr_proof)的第二部分匹配,并且相应的值也应该匹配同一数据框的一部分。

For example, in the dataframe : 例如,在数据框中:

import pandas as pd

main = {'account_number' : [1,2,3,4,5,6,7,8,9,10,11,12],
    'id_proof' : ['A','B','B','A','C','C','X','Y','X','Y','Y','X'],
    'id_value' : [101,201,301,401,501,601,111,222,333,444,555,666],
    'adr_proof' : ['Z','E','E','G','G','I','PARENT A','PARENT B','PARENT   B','PARENT C','PARENT C','PARENT A'],
    'adr_value' : [11,22,33,44,55,66,101,201,301,501,601,401]}
main = pd.DataFrame(main)

I am trying to achieve : 我正在努力实现:

node1    node2    relation
  1        7      parent-child
  2        8      parent-child
  3        9      parent-child
  4       12      parent-child
  5       10      parent-child
  6       11      parent-child

Below is my code. 下面是我的代码。 I am aware that my code is incomplete. 我知道我的代码不完整。 I am stuck with the split() function. 我陷入了split()函数。 I am new to python and pandas and am not sure how to invoke pandas' split() function rather than python's built-in str.split() function. 我是python和pandas的新手,不确定如何调用pandas的split()函数,而不是python内置的str.split()函数。 I have gone through this question 我经历了这个问题

import pandas as pd

main = {'account_number' : [1,2,3,4,5,6,7,8,9,10,11,12],
    'id_proof' : ['A','B','B','A','C','C','X','Y','X','Y','Y','X'],
    'id_value' : [101,201,301,401,501,601,111,222,333,444,555,666],
    'adr_proof' : ['Z','E','E','G','G','I','PARENT A','PARENT B','PARENT B','PARENT C','PARENT C','PARENT A'],
    'adr_value' : [11,22,33,44,55,66,101,201,301,501,601,401]}
main = pd.DataFrame(main)

df_group_count = pd.DataFrame({'count' : main.groupby(['adr_proof']).size()}).reset_index()
adr_type = df_group_count['adr_proof']
adr_type_parent = adr_type.loc[adr_type.str.startswith('PARENT',na=False)]

df_j_ = pd.DataFrame()
for j in adr_type_parent:
    dfn_j = main.loc[(main['adr_proof'] == j)]
    adr_type_parent_type = j.split(' ',expand=True,n=1)
    res = main.loc[(main['id_proof'] == adr_type_parent_type[1]) & (main['id_value'] == dfn_j['adr_value'])]

res

Please provide a way to achieve my goal. 请提供一种方法来实现我的目标。 The output is another dataframe. 输出是另一个数据帧。 Please excuse for bad code or any violations. 请原谅错误代码或任何违规行为。 A completely different approach is also appreciated. 还可以理解完全不同的方法。 Thank You. 谢谢。

You can't invoke str.split() of the pandas library in your particular case because you are using the DataFrame object, and this particular object does not implement str.split() . 在特定情况下,您不能调用pandas库的str.split() ,因为您正在使用DataFrame对象,并且此特定对象不实现str.split() Only the Series object implements str.split() . Series对象实现str.split()

Since your main question seems to be how to incorporate pandas split function: 由于您的主要问题似乎是如何合并熊猫拆分功能:

You can isolate the rows containing the keyword 'PARENT' using this: 您可以使用以下方法隔离包含关键字“ PARENT”的行:

parent_main = main[main.adr_proof.str.split(' ').str[0] == 'PARENT']

Now, you can easily extract the second value: 现在,您可以轻松提取第二个值:

parent_main.adr_proof.str.split(' ').str[-1]

After investigating this and also discussing about this in the #python channel on IRC network freenode.net, I have an answer for you. 在对此进行调查并在IRC网络freenode.net的#python频道中对此进行了讨论之后,我为您提供了答案。 You can't overshadow str.split() of Python with the str.split() of the pandas library. 你不能掩盖str.split()的Python与str.split()大熊猫库。

Also, the DataFrame object has no str.split() . 另外, DataFrame对象没有str.split() I have read the whole API and also played with from ... import ... to somehow import str.split() from pandas and overshadow the str.split() of Python. 我已经阅读了整个API,还使用from ... import ...到从熊猫中导入str.split()str.split() Python的str.split()黯然失色。

The DataFrame object you are using in your code has no str.split() . 您在代码中使用的DataFrame对象没有str.split() The only reason str.split() in your code doesn't throw an error is because it just so happens Python has str.split() built in and uses that. 您的代码中str.split()不会引发错误的唯一原因是因为它恰好是Python内置并使用了str.split()

The only pandas object I could find that has str.split() is the Series object, pandas.Series.str.split() . 我可以找到的唯一具有str.split() pandas对象是Series对象pandas.Series.str.split() But you're not using the Series object, you are using the DataFrame object. 但是您没有使用Series对象,而是在使用DataFrame对象。 I'm sorry, there's nothing to be done. 很抱歉,没有任何事情要做。

If you ask me, the structure of pandas is broken. 如果您问我,熊猫的结构就坏了。 You can't just import str.split() , because str is basically a StringMethods object and this object lives under the strings package, which lives in the core package, which lives in the pandas top-level package. 您不能仅仅导入str.split() ,因为str StringMethods是一个StringMethods对象,并且该对象位于strings包中,该strings包位于core包中,而该core包位于pandas顶级包中。 It's a mess! 一团糟! I wasted 2 hours of my life to understand its package/module/object structure. 我花了2个小时来了解它的包/模块/对象结构。

Also, pandas.Series.str.split() is basically pandas.core.series.Series.str.split() . 另外, pandas.Series.str.split()基本上是pandas.core.series.Series.str.split() I just gave up! 我只是放弃了!

Try to import str.split() from pandas and you'll get a Nobel prize! 尝试从熊猫中导入str.split() ,您将获得诺贝尔奖!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM