简体   繁体   English

将文本列拆分为 Pandas dataframe 中的列表列,没有明确的拆分序列

[英]Split column of text into column of lists in Pandas dataframe with no unambiguous split sequence

I have a dataframe that contains of column of text that gives a numeric code followed by a colon and a text description.我有一个 dataframe 包含一列文本,该列给出数字代码,后跟冒号和文本描述。 The text may include one or many code descriptors each separated by a comma and a space.文本可以包括一个或多个代码描述符,每个代码描述符由逗号和空格分隔。

myDF = pd.DataFrame({'origtext':['012: some text','012: some text, 123: other text','012: some text, 234: text, strings and numbers']})

The dataframe looks like: dataframe 看起来像:

                                         origtext
0                                  012: some text
1                 012: some text, 123: other text
2  012: some text, 234: text, strings and numbers

I need to convert the text in the 'origtext' column to lists where each element of the list consists of the numeric code, colon and text descriptor.我需要将“原始文本”列中的文本转换为列表,其中列表的每个元素都由数字代码、冒号和文本描述符组成。

My first approach was to use .split() to split the text at ', ' such as:我的第一种方法是使用.split()', '处拆分文本,例如:

myDF['textlist'] = myDF['origtext'].str.split(', ')

to produce...生产...

                                           textlist  
0                                  [012: some text]  
1                 [012: some text, 123: other text]  
2  [012: some text, 234: text, strings and numbers]  

In my real-world dataframe, that worked well for the majority of rows but there were a few cases where the text description contained ', ' .在我的实际 dataframe 中,这对大多数行都有效,但在少数情况下,文本描述包含', ' This meant that the bottom list in the above example contained 3 elements (rather than 2) and the final element did not begin with 'nnn: ' .这意味着上面示例中的底部列表包含 3 个元素(而不是 2 个),并且最后一个元素不是以'nnn: '开头。 This made the .split() method unsuitable.这使得.split()方法不合适。

Is there a way to use a matched group in a regular expression to identify something like ', 123:' and replace it with 'xxxxx123:' and then split based on 'xxxxx' ?有没有办法在正则表达式中使用匹配组来识别类似', 123:'并将其替换为'xxxxx123:'然后根据'xxxxx'拆分? I've been able to replace the matched group with a string but I haven't been able to work out how to add some text to the matched group whilst keeping the matched text intact.我已经能够用字符串替换匹配的组,但我无法弄清楚如何在保持匹配文本完整的同时向匹配组添加一些文本。

Or is there another method to achieve the desired outcome?还是有另一种方法可以达到预期的结果?

You can use您可以使用

myDF['textlist'] = myDF['origtext'].str.split(r',\s+(?=\d+:)')

See the regex demo .请参阅正则表达式演示

Regex details :正则表达式详细信息

  • , - a comma , - 逗号
  • \s+ - one or more whitespace chars \s+ - 一个或多个空格字符
  • (?=\d+:) - a positive lookahead that reequires one or more digits and then a : immediately to the right of the current location. (?=\d+:) - 一个正向前瞻,需要一个或多个数字,然后是:紧挨当前位置的右侧。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM