[英]Flattening 3D list of words to 2D
I have a pandas column with text strings. 我有一列带有文本字符串的pandas列。 For simplicity ,lets assume I have a column with two strings.
为了简单起见,假设我有一列包含两个字符串。
s=["How are you. Don't wait for me", "this is all fine"]
I want to get something like this: 我想得到这样的东西:
[["How", "are","you"],["Don't", "wait", "for", "me"],["this","is","all","fine"]]
Basically take each sentence of a document and tokenism into list of words. 基本上将文档中的每个句子和符号表示法都包含在单词列表中。 So finally I need list of list of string.
所以最后我需要字符串列表列表。
I tried using a map like below: 我尝试使用如下地图:
nlp=spacy.load('en')
def text_to_words(x):
""" This function converts sentences in a text to a list of words
"""
global log_txt
x=re.sub("\s\s+" , " ", x.strip())
txt_to_words= [str(doc).replace(".","").split(" ") for doc in nlp(x).sents]
#log_txt=log_txt.extend(txt_to_words)
return txt_to_words
The nlp from spacy is used to split a string of text into list of sentences. spacy中的nlp用于将文本字符串拆分为句子列表。
log_txt=list(map(text_to_words,s))
log_txt
But this as you know would put all of the result from both the documents into another list 但是正如您所知,这会将两个文档的所有结果放入另一个列表中
[[['How', 'are', 'you'], ["Don't", 'wait', 'for', 'me']],
[['this', 'is', 'all', 'fine']]]
You'll need a nested list comprehension. 您将需要嵌套列表理解。 Additionally, you can get rid of punctuation using
re.sub
. 此外,您可以使用
re.sub
摆脱标点符号。
import re
data = ["How are you. Don't wait for me", "this is all fine"]
words = [
re.sub([^a-z\s], '', j.lower()).split() for i in data for j in nlp(i).sents
]
Or, 要么,
words = []
for i in data:
... # do something here
for j in nlp(i).sents:
words.append(re.sub([^a-z\s], '', j.lower()).split())
There is a much simpler way for list comprehension. 有一种更简单的列表理解方法。 You can first join the strings with a period '.'
您可以先用句号“。”连接字符串。 and split them again.
并再次拆分。
[x.split() for x in '.'.join(s).split('.')]
It will give the desired result. 它将给出期望的结果。
[["How", "are","you"],["Don't", "wait", "for", "me"],["this","is","all","fine"]]
For Pandas dataframes, you may get an object, and hence a list of lists after tolist
function in return. 对于Pandas数据框,您可能会得到一个对象,并因此得到
tolist
函数之后的列表列表。 Just extract the first element. 只需提取第一个元素。
For example, 例如,
import pandas as pd
def splitwords(s):
s1 = [x.split() for x in '.'.join(s).split('.')]
return s1
df = pd.DataFrame(s)
result = df.apply(splitwords).tolist()[0]
Again, it will give you the preferred result. 同样,它将为您提供首选的结果。
Hope it helps ;) 希望能帮助到你 ;)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.