简体   繁体   English

如何拆分 pandas 中 str.split() 列的 output?

[英]How can I split the output of str.split() column in pandas?

Here's the thing, I have this sort of dataset (let's call it df ):事情是这样的,我有这种数据集(我们称之为df ):

id       text
A1       How was your experience?: Great\nWhat did you buy?: A book\n
B1       How was your experience?: Good\nWhat did you buy?: A pen\n
C2       How was your experience?: Awful\nWhat did you buy?: A pencil\n

As you can see, this is a table containing a survey and I'm trying to get only the answers from the column text.如您所见,这是一个包含调查的表格,我试图仅从列文本中获取答案。 My first tought was to try to split the text, just like this:我的第一个任务是尝试拆分文本,就像这样:

df['text_splitted'] = df.text.str.split('\n')

And then I would do something like this:然后我会做这样的事情:

df['final_text'] = df. text_splitted.str.split(':')

However, final_text is returning NaN .但是, final_text正在返回NaN What just happened?刚才发生了什么? Why is the new column returning null?为什么新列返回 null? Is there any way I can fix this (or a better way to do what I'm trying to do here)?有什么办法可以解决这个问题(或者更好的方法来做我想做的事情)?

As you wrote you need to split two times your column text .正如您所写,您需要将列text拆分两次。 Afterward you can create a dataframe with 3 columns:之后,您可以创建一个包含 3 列的 dataframe:

  • id from your original dataframe来自原始 dataframe 的id
  • question (even rows) from the previous split上一次拆分的question (偶数行)
  • answer (odd rows) from the previous split上一次拆分的answer (奇数行)
text = df["text"].str.strip().str.split("\n").explode().str.split(": ").explode()

out = pd.merge(df["id"], pd.DataFrame({"question": text[0::2], "answer": text[1::2]}),
               left_index=True, right_index=True).reset_index(drop=True)

What do you think about this format?您如何看待这种格式?

>>> out
   id                  question    answer
0  A1  How was your experience?     Great
1  A1         What did you buy?    A book
2  B1  How was your experience?      Good
3  B1         What did you buy?     A pen
4  C2  How was your experience?     Awful
5  C2         What did you buy?  A pencil

You can use a combination of.apply() and.split() to get the answers您可以使用 .apply() 和 .split() 的组合来获得答案

df = pd.DataFrame({'text': ['How was your experience?: Great\nWhat did you buy?: A book\n']})

Input DF输入DF

    text
0   How was your experience?: Great\nWhat did you ..

Split into questions and answers拆分成问题和答案

df['questions'] = df['text'].apply(lambda x: [y.split(":")[0] for y in x.split("\n")])
df['answers'] = df['text'].apply(lambda x: [y.split(":")[1] for y in x.split("\n") if len(y)>1])

Output DF Output DF

    answers              questions
0   [ Great, A book]    [How was your experience?, What did you buy?, ]

You can try this:你可以试试这个:

df.set_index('id')['text'].str.replace(r'\\n$', '').str.split(r'\\n').explode().str.split(': ', expand=True)

                           0         1
id                                    
A1  How was your experience?     Great
A1         What did you buy?    A book
B1  How was your experience?      Good
B1         What did you buy?     A pen
C2  How was your experience?     Awful
C2         What did you buy?  A pencil

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM