[英]Removing empty words from column of tokenized sentences
我有一個 dataframe 包含同一列中每一行中的單詞列表。 我想刪除我猜是空格。 我設法通過這樣做擺脫了一些:
for i in processed.text:
for x in i:
if x == '' or x==" ":
i.remove(x)
但其中一些仍然存在。
>processed['text']
0 [have, month, #postdoc, within, on, chemical, ...
1 [hardworking, producers, iowa, so, for, state,...
2 [hardworking, producers, iowa, so, for, state,...
3 [today, time, is, to, sources, energy, much, p...
4 [thanks, gaetanos, club, c, oh, choosing, #rec...
...
130736 [gw, fossil, renewable, import, , , , , , , , ...
130737 [s, not, , go, ]
130738 [answer, deforestation, in, ]
130739 [plastic, regrind, any, and, grades, we, make,...
130740 [grid, generating, of, , , , gw]
Name: text, Length: 130741, dtype: object
>type(processed)
<class 'pandas.core.frame.DataFrame'>
非常感謝。
逗號拆分刪除空值,然后再次與逗號組合
def remove_empty(x):
if type(x) is str:
x = x.split(",")
x = [ y for y in x if y.strip()]
return ",".join(x)
elif type(x) is list:
return [ y for y in x if y.strip()]
processed['text'] = processed['text'].apply(remove_empty)
您可以使用 split(expand=True) 來做到這一點。 注意:您不必專門給出 spilt(' ', expand=True)。 默認情況下,它以 ' ' 作為值。 你可以用任何東西替換' '
。 例如:如果您的單詞用,
或-
分隔,那么您可以使用該分隔符來拆分列。
import pandas as pd
df = pd.DataFrame({'Col1':['This is a long sentence',
'This is another long sentence',
'This is short',
'This is medium length',
'Wow. Tiny',
'Petite',
'Ok']})
print (df)
df = df.Col1.str.split(' ',expand=True)
print (df)
output 將是:
原廠dataframe:
Col1
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny
5 Petite
6 Ok
Dataframe 分列
0 1 2 3 4
0 This is a long sentence
1 This is another long sentence
2 This is short None None
3 This is medium length
4 Wow. Tiny None None None
5 Petite None None None None
6 Ok None None None None
如果您只想將它們限制為 3 列,請使用 n=2
df = df.Col1.str.split(' ',n = 2, expand=True)
output 將是:
0 1 2
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny None
5 Petite None None
6 Ok None None
如果要將列重命名為更具體,則可以像這樣將 rename 添加到末尾。
df = df.Col1.str.split(' ',n = 2, expand=True).rename({0:'A',1:'B',2:'C'},axis=1)
A B C
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny None
5 Petite None None
6 Ok None None
如果您想用''
替換所有的None
並在列名前面加上前綴,您可以按以下方式進行:
df = df.Col1.str.split(expand=True).add_prefix('Col').fillna('')
Col0 Col1 Col2 Col3 Col4
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny
5 Petite
6 Ok
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.