使用正则表达式删除单词中的空格 - 用于文本挖掘的预处理数据

Question

For a school project I'm working on the kickstarter dataset on Kaggle;对于一个学校项目，我正在研究 Kaggle 上的 kickstarter 数据集； https://www.kaggle.com/kemical/kickstarter-projects https://www.kaggle.com/kemical/kickstarter-projects

In the "name" variable there's a few titles that have spaces in between them;在“name”变量中，有几个标题之间有空格； eg instance 373 "C R OSSTOWN".例如实例 373 “C R OSSTOWN”。

I've been working on some regex all day to re.sub the extra spaces and try to get it to look as a normal word.我整天都在研究一些正则表达式来重新划分多余的空格，并试图让它看起来像一个正常的单词。 Though I think this is a problem that occurs more often, most regex content is to add spaces, or add double-spaces.虽然我认为这是一个更经常发生的问题，但大多数正则表达式内容是添加空格，或者添加双空格。 Never this specific task.从来没有这个特定的任务。

I've tried a couple of ways to describe the exact kind of space that needs te be deleted, single out the characters to keep as a group, and using them as the replace string.我尝试了几种方法来描述需要删除的确切类型的空间，将字符挑出来作为一个组保留，并将它们用作替换字符串。 Though it looks like it should be working, my data doesn't change.虽然它看起来应该可以工作，但我的数据没有改变。

long regex expression written to identify words that are written as spaces + single capitals (tried a couple of different ones for this)编写长正则表达式以识别写为空格+单个大写的单词（为此尝试了几个不同的）
r'\2\4' refers to the second and fourth group (the first and second alphabetic character) r'\2\4' 指第二组和第四组（第一个和第二个字母字符）

 Names_fixed = [] for i in Name_New: Names_fixed.append(re.sub(r'(\s|^)([AZ])(\s)(AZ)\s/g', r'\2\4', i))

As I'm still pretty new to regex, turning to the community for help;由于我对正则表达式还很陌生，因此向社区寻求帮助； thanks a lot in advance.提前非常感谢。

Answer 1

If your target is only to remove spaces from words, am not sure if you really need regex.如果您的目标只是从单词中删除空格，那么不确定您是否真的需要正则表达式。

You can use simple replace() function like this:您可以像这样使用简单的 replace() function：

x = "C R O S S T O W N"
x = x.replace(' ','')

You can run a loop on your list for all such words.你可以在你的列表上为所有这些词运行一个循环。

Answer 2

Use this:用这个：

re.sub(r'(?<![ \t])[A-Z](?:[ \t][A-Z])+(?![ \t])', lambda x: x.group().replace(' ','').replace('\t',''), i)

Find space/tab-separated words and remove spaces/tabs from the found texts.查找空格/制表符分隔的单词并从找到的文本中删除空格/制表符。

EXPLANATION解释

--------------------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
    [ \t]                    any character of: ' ', '\t' (tab)
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  [A-Z]                    any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    [ \t]                    any character of: ' ', '\t' (tab)
--------------------------------------------------------------------------------
    [A-Z]                    any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
  )+                       end of grouping
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    [ \t]                    any character of: ' ', '\t' (tab)
--------------------------------------------------------------------------------
  )                        end of look-ahead

使用正则表达式删除单词中的空格 - 用于文本挖掘的预处理数据

问题描述

2 个解决方案

解决方案1
1 2021-05-15 16:32:16

解决方案2
1 已采纳 2021-05-15 20:55:42

使用正则表达式删除单词中的空格 - 用于文本挖掘的预处理数据

问题描述

2 个解决方案

解决方案1 1 2021-05-15 16:32:16

解决方案2 1 已采纳 2021-05-15 20:55:42

解决方案1
1 2021-05-15 16:32:16

解决方案2
1 已采纳 2021-05-15 20:55:42