简体   繁体   English

将 pandas.DataFrame 列的字符串值拆分为数组

[英]Split string values of pandas.DataFrame column to array

I did some sql request from postgresql and set it like pandas.DataFrame().我做了一些来自 postgresql 的 sql 请求,并将其设置为 pandas.DataFrame()。 Every rows looks like: '8B1LP1D' where letters ('B' , 'LP' etc.) are delimiters And this aproach:每一行看起来像:'8B1LP1D',其中字母('B'、'LP'等)是分隔符这种方法:

#formula is a pd.DataFrame with 1 column
for x in formula:
    print(re.split('B|LP|D|E|OS|DN',x))

out put looks fine like:输出看起来不错:

['8', '1', '1']
...
['5', '3', '2']
#etc

But I have to append it in array:但我必须将它附加到数组中:

def move_parts(a):
    split = []
    for x in a:
        split.append(re.split('B|LP|D|E|OS|DN',x))
move_parts(formula)

and result was returned like error:结果像错误一样返回:

/usr/lib/python3.7/re.py in split(pattern, string, maxsplit, flags)
    211     and the remainder of the string is returned as the final element
    212     of the list."""
--> 213     return _compile(pattern, flags).split(string, maxsplit)
    214 
    215 def findall(pattern, string, flags=0):

TypeError: expected string or bytes-like object

what is wrong, how to save all splited values to array?有什么问题,如何将所有拆分的值保存到数组?

If formula is a pd.DataFrame with 1 column as your said, your first expression gives the same error.如果formula是一个pd.DataFrame如您所说的 1 列,则您的第一个表达式会给出相同的错误。 Use pandas split instead:改用熊猫split

df = pd.DataFrame({'col1': ['8B1LP1','5E3DN2']})
df.iloc[:,0].str.split('B|LP|DN|E|OS|D',expand=True).values.tolist()

Output:输出:

[['8', '1', '1'], ['5', '3', '2']]

PS: you shold re-order your delimiters (as shown in my example): the longer'DN' must be before the single 'D', otherwise it'll never match. PS:你不应该重新排序你的分隔符(如我的例子所示):更长的'DN'必须单个'D'之前,否则它永远不会匹配。

The error here is not due to the appending onto a list, it is actually with the re.split values.这里的错误不是由于附加到列表,它实际上是与 re.split 值有关。 The only way I was able to re-produce the error was when the type of formula = pandas.DataFrame.我能够重新产生错误的唯一方法是在公式类型 = pandas.DataFrame 时。 When I set formula to be a flat list or a pandas.Series, it all works fine.当我将公式设置为平面列表或 pandas.Series 时,一切正常。 Is it possible in your code that the first instance formula was a list (or a pandas.Series) and then changed after to a pandas.DataFrame?在您的代码中,第一个实例公式是否可能是一个列表(或一个 pandas.Series),然后更改为一个 pandas.DataFrame? It could be as simple as just referring to the actual column name of what you want it to run on in the pandas.DataFrame.它可以很简单,只需引用您希望它在 pandas.DataFrame 中运行的实际列名。 Let's presume it is called 'request_results', then we change the code to the below and it should be able to run:假设它被称为“request_results”,然后我们将代码更改为以下内容,它应该能够运行:

def move_parts(a):
    split = []
    for x in a:
        split.append(re.split('B|LP|D|E|OS|DN',x))
move_parts(formula['request_results'].astype(str))

Note I've also added in .astype(str) to the end.注意我还在最后添加了 .astype(str) 。 The other alternative is that some of the items in the list are not of str type.另一种选择是列表中的某些项目不是 str 类型。 The error the is being produced is that the second parameter of re.split() is expecting a str (or bytes object, but won't go into that), and instead is getting something else - possible something like None or a float.正在产生的错误是 re.split() 的第二个参数需要一个 str (或 bytes 对象,但不会进入那个),而是得到其他东西 - 可能是 None 或浮点数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM