Python Pandas-如何格式化和拆分列中的文本？

Question

我在如下所示的数据框中有一组字符串

ID TextColumn
1 This is line number one
2 I love pandas, they are so puffy
3 [This $tring is with specia| characters, yes it is!]

A.我想格式化此字符串以消除所有特殊字符B.格式化后，我想获得一个唯一单词的列表（空格是唯一的分割）

这是我编写的代码：

get_df_by_id数据帧具有一个选定的帧，例如ID 3。

#replace all special characters
formatted_title = get_df_by_id['title'].str.replace(r'[\-\!\@\#\$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?]' , '')
# then split the words
results = set()
get_df_by_id['title'].str.lower().str.split().apply(results.update)
print results

但是当我检查输出时，我可以看到特殊字符仍在列表中。

Output

set([u'[this', u'is', u'it', u'specia|', u'$tring', u'is!]', u'characters,', u'yes', u'with'])

预期的输出应如下所示：

set([u'this', u'is', u'it', u'specia', u'tring', u'is', u'characters,', u'yes', u'with'])

为什么格式化的数据框仍保留特殊字符？

Answer 1

我认为您可以先replace特殊字符（在末尾添加\\| ），然后replace lower文本，再用\\s+ （任意wtitespaces） split 。 输出是DataFrame。 因此，您可以将其stack到Series ， drop_duplicates和last tolist ：

print (df['title'].str
                  .replace(r'[\-\!\@\#\$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?\|]','')
                  .str
                  .lower()
                  .str
                  .split('\s+', expand=True)
                  .stack()
                  .drop_duplicates()
                  .tolist())

['this', 'is', 'line', 'number', 'one', 'i', 'love', 'pandas', 'they', 'are', 
'so', 'puffy', 'tring', 'with', 'specia', 'characters', 'yes', 'it']

Answer 2

如果要每行的唯一单词列表：

>>> get_df_by_id['title'].str.replace(r'[^a-zA-Z\s]', '').str.lower().str.split('\s+').apply(lambda x: list(set(x)))

0                           [this, is, one, line, number]
1                 [love, i, puffy, so, are, they, pandas]
2    [specia, this, is, it, characters, tring, yes, with]
Name: title, dtype: object

Answer 3

您必须将格式化的值分配给同一数据框

get_df_by_id['title'] = get_df_by_id['title'].str.replace(r'[\-\!\@\#\$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?]' , '')

Python Pandas-如何格式化和拆分列中的文本？

问题描述

3 个解决方案

解决方案1
2 已采纳 2016-05-25 06:41:41

解决方案2
1 2016-05-25 06:47:19

解决方案3
0 2016-05-25 06:30:38

Python Pandas-如何格式化和拆分列中的文本？

问题描述

3 个解决方案

解决方案1 2 已采纳 2016-05-25 06:41:41

解决方案2 1 2016-05-25 06:47:19

解决方案3 0 2016-05-25 06:30:38

解决方案1
2 已采纳 2016-05-25 06:41:41

解决方案2
1 2016-05-25 06:47:19

解决方案3
0 2016-05-25 06:30:38