![](/img/trans.png)
[英]How to find the maximum number of words and characters in sentences from a dataframe?
[英]How to find the maximum number of words in a pandas dataframe column of strings?
我有一個帶有一列字符串的 dataframe。 我試圖找到(a)列中的最大單詞數和(b)包含具有最大單詞數的字符串的行。
我執行以下操作:
import pandas as pd
something = ["Hello how are you", "I am doing great", "Lets go camping"]
test = pd.DataFrame(something)
test.columns = ["Response"]
length_of_the_messages = test["Response"].str.split("\\s+")
print(length_of_the_messages)
print(length_of_the_messages.len().max())
但這會產生一個錯誤,說Series
沒有屬性len
。 如何獲得列中字符串中的最大單詞數及其行索引?
您可以使用.str
和索引.idxmax
:
import pandas as pd
something = ["Hello how are you", "I am doing great", "Lets go camping"]
test = pd.DataFrame(something)
test.columns = ["Response"]
length_of_the_messages = test["Response"].str.split("\\s+")
print(length_of_the_messages)
print("Max number of words = ", length_of_the_messages.str.len().max())
print("Index = ", length_of_the_messages.str.len().idxmax())
印刷:
0 [Hello, how, are, you]
1 [I, am, doing, great]
2 [Lets, go, camping]
Name: Response, dtype: object
Max number of words = 4
Index = 0
另一種方法是創建一列字數,通過sort_values()
按字數排序,然后檢索第一行:
#!/usr/bin/env python
import io
import pandas as pd
table_str = '''Sentence
Hello how are you
I am doing great
Lets go camping
'''
def main():
df = pd.read_csv(io.StringIO(table_str), header=0, skipinitialspace=True)
df['Count'] = df.apply(lambda x: x.str.split(" ").map(len))
df = df.sort_values(['Count'], ascending=False)
print(df.iloc[0])
if __name__ == '__main__':
main()
Output:
$ ./67927014.py
Sentence Hello how are you
Count 4
Name: 0, dtype: object
您可以通過.str.len()
獲得字數,並通過.max()
獲得其最大值
至於那些最大長度的條目的索引,由於Series有2行最大長度,你可以得到一個最大長度索引的完整列表,如下:
something = ["Hello how are you", "I am doing great", "Lets go camping"]
test = pd.DataFrame(something)
test.columns = ["Response"]
length_of_the_messages = test["Response"].str.split("\\s+")
print(length_of_the_messages)
print("Max number of words = ", length_of_the_messages.str.len().max())
print("Index = ", length_of_the_messages.loc[length_of_the_messages.str.len() == length_of_the_messages.str.len().max()].index)
Output:
0 [Hello, how, are, you]
1 [I, am, doing, great]
2 [Lets, go, camping]
Name: Response, dtype: object
Max number of words = 4
Index = Int64Index([0, 1], dtype='int64')
在這里,最大長度相同的 2 個索引打印為:
Index = Int64Index([0, 1], dtype='int64')
我認為僅idxmax
就足夠了,我們可以使用loc
從中推斷出max
:
# we need `.str` accessor for `len`, too
idx_max = test.Response.str.split().str.len().idxmax()
val_max = test.Response.loc[idx_max].item()
要得到
>>> idx_max
0
>>> val_max
"Hello how are you"
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.