從標記化句子的列中刪除空詞

Question

我有一個 dataframe 包含同一列中每一行中的單詞列表。 我想刪除我猜是空格。 我設法通過這樣做擺脫了一些：

for i in processed.text:
    for x in i:
        if x == '' or x==" ":
            i.remove(x)

但其中一些仍然存在。

>processed['text']

0         [have, month, #postdoc, within, on, chemical, ...
1         [hardworking, producers, iowa, so, for, state,...
2         [hardworking, producers, iowa, so, for, state,...
3         [today, time, is, to, sources, energy, much, p...
4         [thanks, gaetanos, club, c, oh, choosing, #rec...
                                ...                        
130736    [gw, fossil, renewable, import, , , , , , , , ...
130737                                     [s, not, , go, ]
130738                        [answer, deforestation, in, ]
130739    [plastic, regrind, any, and, grades, we, make,...
130740                     [grid, generating, of, , , , gw]
Name: text, Length: 130741, dtype: object

>type(processed)
<class 'pandas.core.frame.DataFrame'>

非常感謝。

Answer 1

逗號拆分刪除空值，然后再次與逗號組合

def remove_empty(x):
    if type(x) is str:
        x = x.split(",")
        x = [ y for y in x if y.strip()]
        return ",".join(x)
    elif type(x) is list:
        return [ y for y in x if y.strip()]

processed['text'] = processed['text'].apply(remove_empty)

Answer 2

您可以使用 split(expand=True) 來做到這一點。 注意：您不必專門給出 spilt(' ', expand=True)。 默認情況下，它以 ' ' 作為值。 你可以用任何東西替換' ' 。 例如：如果您的單詞用,或-分隔，那么您可以使用該分隔符來拆分列。

import pandas as pd
df = pd.DataFrame({'Col1':['This is a long sentence',
                           'This is another long sentence',
                           'This is short',
                           'This is medium  length',
                           'Wow. Tiny',
                           'Petite',
                           'Ok']})

print (df)
df = df.Col1.str.split(' ',expand=True)
print (df)

output 將是：

原廠dataframe：

                            Col1
0        This is a long sentence
1  This is another long sentence
2                  This is short
3         This is medium  length
4                      Wow. Tiny
5                         Petite
6                             Ok

Dataframe 分列

        0     1        2     3         4
0    This    is        a  long  sentence
1    This    is  another  long  sentence
2    This    is    short  None      None
3    This    is   medium          length
4    Wow.  Tiny     None  None      None
5  Petite  None     None  None      None
6      Ok  None     None  None      None

如果您只想將它們限制為 3 列，請使用 n=2

df = df.Col1.str.split(' ',n = 2, expand=True)

output 將是：

        0     1                      2
0    This    is        a long sentence
1    This    is  another long sentence
2    This    is                  short
3    This    is         medium  length
4    Wow.  Tiny                   None
5  Petite  None                   None
6      Ok  None                   None

如果要將列重命名為更具體，則可以像這樣將 rename 添加到末尾。

df = df.Col1.str.split(' ',n = 2, expand=True).rename({0:'A',1:'B',2:'C'},axis=1)

        A     B                      C
0    This    is        a long sentence
1    This    is  another long sentence
2    This    is                  short
3    This    is         medium  length
4    Wow.  Tiny                   None
5  Petite  None                   None
6      Ok  None                   None

如果您想用''替換所有的None並在列名前面加上前綴，您可以按以下方式進行：

df = df.Col1.str.split(expand=True).add_prefix('Col').fillna('')

     Col0  Col1     Col2    Col3      Col4
0    This    is        a    long  sentence
1    This    is  another    long  sentence
2    This    is    short                  
3    This    is   medium  length          
4    Wow.  Tiny                           
5  Petite                                 
6      Ok

從標記化句子的列中刪除空詞

問題描述

2 個解決方案

解決方案1
0 2021-01-06 00:14:55

解決方案2
0 2021-01-06 00:28:27

從標記化句子的列中刪除空詞

問題描述

2 個解決方案

解決方案1 0 2021-01-06 00:14:55

解決方案2 0 2021-01-06 00:28:27

解決方案1
0 2021-01-06 00:14:55

解決方案2
0 2021-01-06 00:28:27