简体   繁体   English

在数据框内搜索并拆分文本

[英]search in dataframe with spliting the text

I have a dataframe that consists of two columns, id and text. 我有一个由两列组成的数据框,即id和text。

I want to retrieve rows that have a text length larger than 2, as an example. 例如,我想检索文本长度大于2的行。

The text length is the number of words in the text rather than the number of chars. 文本长度是文本中的单词数,而不是字符数。

I did the following: 我做了以下事情:

df = pd.DataFrame([{'id': 1, 'text': 'Connected to hgfxg debugger'},
                   {'id': 2, 'text': 'fdss debugger - process 6384 is connecting'},
                   {'id': 3, 'text': 'we are'},
                   ])
df = df[df['text'].str.len() > 2]
print(df) #<-- it will print all the sentences above

But this retrieve the sentences that have more than 2 chars (in our case, all the sentences above). 但这会检索到超过2个字符的句子(在我们的例子中,是上面所有的句子)。

How can I achieve what I want in one code line? 如何在一个代码行中实现我想要的? possible? 可能?

I can do it with more than one, like: 我可以用一个以上的方法做到这一点,例如:

df['text_len'] = df['text'].map(lambda x: len(str(x).split()))
df = df[df['text_len'] > 2]
print(df) #<-- will print the first two sentences

Just think about another way , you want more than 2 sentence , so that you need two ' ' in the string , and here we just count the ' ' is more than 2 试想另一种方式,您需要两个以上的句子,因此您需要在字符串中包含两个' ' ,在这里我们只计算' '大于2

df[df['text'].str.count(' ')>2]
Out[230]: 
   id                                        text
0   1                 Connected to hgfxg debugger
1   2  fdss debugger - process 6384 is connecting

您还可以使用:

df[df.text.str.split('\s+').str.len().gt(2)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM