如何在 pandas dataframe 字符串列中找到最大單詞數？

Question

我有一個帶有一列字符串的 dataframe。 我試圖找到（a）列中的最大單詞數和（b）包含具有最大單詞數的字符串的行。

我執行以下操作：

import pandas as pd

something = ["Hello how are you", "I am doing great", "Lets go camping"]

test = pd.DataFrame(something)
test.columns = ["Response"]

length_of_the_messages = test["Response"].str.split("\\s+")
print(length_of_the_messages)
print(length_of_the_messages.len().max())

但這會產生一個錯誤，說Series沒有屬性len 。 如何獲得列中字符串中的最大單詞數及其行索引？

Answer 1

您可以使用.str和索引.idxmax ：

import pandas as pd

something = ["Hello how are you", "I am doing great", "Lets go camping"]

test = pd.DataFrame(something)
test.columns = ["Response"]

length_of_the_messages = test["Response"].str.split("\\s+")

print(length_of_the_messages)
print("Max number of words = ", length_of_the_messages.str.len().max())
print("Index = ", length_of_the_messages.str.len().idxmax())

印刷：

0    [Hello, how, are, you]
1     [I, am, doing, great]
2       [Lets, go, camping]
Name: Response, dtype: object

Max number of words =  4
Index =  0

Answer 2

另一種方法是創建一列字數，通過sort_values()按字數排序，然后檢索第一行：

#!/usr/bin/env python

import io
import pandas as pd

table_str = '''Sentence
Hello how are you
I am doing great
Lets go camping
'''

def main():
    df = pd.read_csv(io.StringIO(table_str), header=0, skipinitialspace=True)
    df['Count'] = df.apply(lambda x: x.str.split(" ").map(len))
    df = df.sort_values(['Count'], ascending=False)
    print(df.iloc[0])

if __name__ == '__main__':
    main()

Output：

$ ./67927014.py
Sentence    Hello how are you
Count                       4
Name: 0, dtype: object

Answer 3

您可以通過.str.len()獲得字數，並通過.max()獲得其最大值

至於那些最大長度的條目的索引，由於Series有2行最大長度，你可以得到一個最大長度索引的完整列表，如下：

something = ["Hello how are you", "I am doing great", "Lets go camping"]

test = pd.DataFrame(something)
test.columns = ["Response"]

length_of_the_messages = test["Response"].str.split("\\s+")

print(length_of_the_messages)
print("Max number of words = ", length_of_the_messages.str.len().max())
print("Index = ", length_of_the_messages.loc[length_of_the_messages.str.len() == length_of_the_messages.str.len().max()].index)

Output：

0    [Hello, how, are, you]
1     [I, am, doing, great]
2       [Lets, go, camping]
Name: Response, dtype: object
Max number of words =  4
Index =  Int64Index([0, 1], dtype='int64')

在這里，最大長度相同的 2 個索引打印為：

Index =  Int64Index([0, 1], dtype='int64')

Answer 4

我認為僅idxmax就足夠了，我們可以使用loc從中推斷出max ：

# we need `.str` accessor for `len`, too
idx_max = test.Response.str.split().str.len().idxmax()
val_max = test.Response.loc[idx_max].item()

要得到

>>> idx_max
0

>>> val_max
"Hello how are you"

如何在 pandas dataframe 字符串列中找到最大單詞數？

問題描述

4 個解決方案

解決方案1
3 已采納 2021-06-10 18:58:31

解決方案2
1 2021-06-10 19:08:12

解決方案3
1 2021-06-10 19:12:19

解決方案4
0 2021-06-10 19:28:42

如何在 pandas dataframe 字符串列中找到最大單詞數？

問題描述

4 個解決方案

解決方案1 3 已采納 2021-06-10 18:58:31

解決方案2 1 2021-06-10 19:08:12

解決方案3 1 2021-06-10 19:12:19

解決方案4 0 2021-06-10 19:28:42

解決方案1
3 已采納 2021-06-10 18:58:31

解決方案2
1 2021-06-10 19:08:12

解決方案3
1 2021-06-10 19:12:19

解決方案4
0 2021-06-10 19:28:42