簡體   English   中英

如何在 pandas dataframe 字符串列中找到最大單詞數?

[英]How to find the maximum number of words in a pandas dataframe column of strings?

我有一個帶有一列字符串的 dataframe。 我試圖找到(a)列中的最大單詞數和(b)包含具有最大單詞數的字符串的行。

我執行以下操作:

import pandas as pd

something = ["Hello how are you", "I am doing great", "Lets go camping"]

test = pd.DataFrame(something)
test.columns = ["Response"]

length_of_the_messages = test["Response"].str.split("\\s+")
print(length_of_the_messages)
print(length_of_the_messages.len().max())

但這會產生一個錯誤,說Series沒有屬性len 如何獲得列中字符串中的最大單詞數及其行索引?

您可以使用.str和索引.idxmax

import pandas as pd

something = ["Hello how are you", "I am doing great", "Lets go camping"]

test = pd.DataFrame(something)
test.columns = ["Response"]

length_of_the_messages = test["Response"].str.split("\\s+")

print(length_of_the_messages)
print("Max number of words = ", length_of_the_messages.str.len().max())
print("Index = ", length_of_the_messages.str.len().idxmax())

印刷:

0    [Hello, how, are, you]
1     [I, am, doing, great]
2       [Lets, go, camping]
Name: Response, dtype: object

Max number of words =  4
Index =  0

另一種方法是創建一列字數,通過sort_values()按字數排序,然后檢索第一行:

#!/usr/bin/env python

import io
import pandas as pd

table_str = '''Sentence
Hello how are you
I am doing great
Lets go camping
'''

def main():
    df = pd.read_csv(io.StringIO(table_str), header=0, skipinitialspace=True)
    df['Count'] = df.apply(lambda x: x.str.split(" ").map(len))
    df = df.sort_values(['Count'], ascending=False)
    print(df.iloc[0])

if __name__ == '__main__':
    main()

Output:

$ ./67927014.py
Sentence    Hello how are you
Count                       4
Name: 0, dtype: object

您可以通過.str.len()獲得字數,並通過.max()獲得其最大值

至於那些最大長度的條目的索引,由於Series有2行最大長度,你可以得到一個最大長度索引的完整列表,如下:

something = ["Hello how are you", "I am doing great", "Lets go camping"]

test = pd.DataFrame(something)
test.columns = ["Response"]

length_of_the_messages = test["Response"].str.split("\\s+")

print(length_of_the_messages)
print("Max number of words = ", length_of_the_messages.str.len().max())
print("Index = ", length_of_the_messages.loc[length_of_the_messages.str.len() == length_of_the_messages.str.len().max()].index)

Output:

0    [Hello, how, are, you]
1     [I, am, doing, great]
2       [Lets, go, camping]
Name: Response, dtype: object
Max number of words =  4
Index =  Int64Index([0, 1], dtype='int64')

在這里,最大長度相同的 2 個索引打印為:

Index =  Int64Index([0, 1], dtype='int64')

我認為僅idxmax就足夠了,我們可以使用loc從中推斷出max

# we need `.str` accessor for `len`, too
idx_max = test.Response.str.split().str.len().idxmax()
val_max = test.Response.loc[idx_max].item()

要得到

>>> idx_max
0

>>> val_max
"Hello how are you"

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM