简体   繁体   English

有没有更好的方法在Python中连接连续的字符串元素?

[英]Is there a better way to concatenate continuous string elements in Python?

Problem Context 问题背景

I am trying to create a chat log dataset from Whatsapp chats. 我正在尝试从Whatsapp聊天创建聊天日志数据集。 Let me just provide the context of what problem I am trying to solve. 让我只提供我想要解决的问题的背景。 Assume message to be M and response to be R . 假设消息为M ,响应为R The natural way in which chats happen is not always alternate, for eg chats tend to happen like this 聊天发生的自然方式并不总是交替,例如聊天倾向于这样发生

[ M, M, M, R, R, M, M, R, R, M ... and so on]

I am trying to concatenate continuously occurring strings of M's and R's. 我试图连接M和R的连续发生的字符串。 for the above example, I desire an output like this 对于上面的例子,我希望这样的输出

Desired Output 期望的输出

[ "MMM", "RR", "MM" , "RR", "M ... and so on ]

An Example of Realistic Data: 现实数据的一个例子:

 Input --> ["M: Hi", "M: How are you?", "R: Heyy", "R: Im cool", "R: Wbu?"] (length=5) Output --> ["M: Hi M: How are you?", "R: Heyy R: Im cool R: Wbu?"] (length = 2) 

Is there a fast and more efficient way of doing this? 有没有一种快速,有效的方法呢? I have already read this Stackoverflow link to solve this problem. 我已经读过这个Stackoverflow链接来解决这个问题。 But, I didn't find a solution there. 但是,我没有在那里找到解决方案。

So far, this is what I have tried . 到目前为止,这是我尝试过的

final= []
temp = ''
change = 0
for i,ele in enumerate(chats):
    if i>0:
        prev = chats[i-1][0]
        current = ele[0]

        if current == prev:
            continuous_string += chats[i-1]  
            continue
        else:
            continuous_string += chats[i-1]
            final.append(temp)
            temp = ''

Explanation of my code: I have chats list in which the starting character of every message is 'M' and starting character of every response is 'R'. 我的代码说明:我有chats列表,其中每条消息的起始字符是'M',每个响应的起始字符是'R'。 I keep track of prev value and current value in the list, and when there is a change (A transition from M -> R or R -> M), I append everything collected in the continuous_string to final list. 我跟踪列表中的prev值和current值,当有变化时(从M - > R或R - > M过渡),我将在continuous_string收集的所有内容追加到final列表中。

Again, my question is: Is there a shortcut in Python or a function to do the same thing effectively in less number of lines? 同样,我的问题是: Python或函数中是否有一个快捷方式可以在较少的行数中有效地执行相同的操作?

You can use the function groupby() : 您可以使用函数groupby()

from itertools import groupby

l = ['A', 'A', 'B', 'B']

[' '.join(g) for _, g in groupby(l)]
# ['A A', 'B B']

To group data from your example you need to add a key to the the groupby() function: 要对示例中的数据进行分组,需要在groupby()函数中添加一个键:

l = ["M: Hi", "M: How are you?", "R: Heyy", "R: Im cool", "R: Wbu?"]

[' '.join(g) for _, g in groupby(l, key=lambda x: x[0])]
# ['M: Hi M: How are you?', 'R: Heyy R: Im cool R: Wbu?']

As @TrebuchetMS mentioned in the comments the key lambda x: x.split(':')[0] might be more reliable. 正如@TrebuchetMS在评论中提到的那样,关键的lambda x: x.split(':')[0]可能更可靠。 It depends on your data. 这取决于您的数据。

Algorithm 算法

  • Initialize a temporary item. 初始化临时项目。 This will help determine if the speaker has changed 这将有助于确定扬声器是否已更改
  • For each item 对于每个项目
    • Extract the speaker 提取扬声器
    • If it's the same, append to the text of the last item of the array 如果它是相同的,则附加到数组的最后一项的文本
    • Else append a new item in the list containing the speaker and text 否则,在包含发言者和文本的列表中追加新项目

Implementation 履行

def parse(x):
    parts = x.split(':')
    return parts[0], ' '.join(parts[1:]).strip()


def compress(l):
    ans = []
    prev = ''
    for x in l:
        curr, text = parse(x)
        if curr != prev:
            prev = curr
            ans.append(x)
        else:
            ans[len(ans) - 1] += f' {text}'
    return ans

Character names 角色名称

IN:  ["M: Hi", "M: How are you?", "R: Heyy", "R: Im cool", "R: Wbu?"]
OUT: ['M: Hi How are you?', 'R: Heyy Im cool Wbu?']

String names 字符串名称

IN:  ["Mike: Hi", "Mike How are you?", "Mary: Heyy", "Mary: Im cool", "Mary: Wbu?"]
OUT: ['Mike: Hi How are you?', 'Mary: Heyy Im cool Wbu?']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM