單詞頻率使用pandas和matplotlib

Question

如何使用csv文件中的pandas和matplotlib繪制單詞頻率直方圖（作者列）？ 我的csv就像：id，作者，標題，語言有時我在作者列中有多個作者用空格分隔

file = 'c:/books.csv'
sheet = open(file)
df = read_csv(sheet)
print df['author']

Answer 1

使用collections.Counter創建直方圖數據，並按照此處給出的示例，即：

from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Read CSV file, get author names and counts.
df = pd.read_csv("books.csv", index_col="id")
counter = Counter(df['author'])
author_names = counter.keys()
author_counts = counter.values()

# Plot histogram using matplotlib bar().
indexes = np.arange(len(author_names))
width = 0.7
plt.bar(indexes, author_counts, width)
plt.xticks(indexes + width * 0.5, author_names)
plt.show()

有了這個測試文件：

$ cat books.csv 
id,author,title,language
1,peter,t1,de
2,peter,t2,de
3,bob,t3,en
4,bob,t4,de
5,peter,t5,en
6,marianne,t6,jp

上面的代碼創建了以下圖表：

在此輸入圖像描述

編輯：

您添加了輔助條件，其中author列可能包含多個以空格分隔的名稱。 以下代碼處理此問題：

from itertools import chain

# Read CSV file, get 
df = pd.read_csv("books2.csv", index_col="id")
authors_notflat = [a.split() for a in df['author']]
counter = Counter(chain.from_iterable(authors_notflat))
print counter

對於這個例子：

$ cat books2.csv 
id,author,title,language
1,peter harald,t1,de
2,peter harald,t2,de
3,bob,t3,en
4,bob,t4,de
5,peter,t5,en
6,marianne,t6,jp

它打印

$ python test.py 
Counter({'peter': 3, 'bob': 2, 'harald': 2, 'marianne': 1})

請注意，此代碼僅起作用，因為字符串是可迭代的。

這個代碼基本上沒有pandas，除了領導DataFrame df的CSV解析部分。 如果你需要pandas的默認情節樣式，那么在提到的線程中也有一個建議。

Answer 2

您可以使用value_counts計算每個名稱的出現次數：

In [11]: df['author'].value_counts()
Out[11]: 
peter       3
bob         2
marianne    1
dtype: int64

Series（和DataFrames）有一個用於繪制直方圖的hist方法：

In [12]: df['author'].value_counts().hist()

單詞頻率使用pandas和matplotlib

問題描述

2 個解決方案

解決方案1
5 已采納 2014-03-10 15:00:47

解決方案2
4 2014-03-10 16:54:48

單詞頻率使用pandas和matplotlib

問題描述

2 個解決方案

解決方案1 5 已采納 2014-03-10 15:00:47

解決方案2 4 2014-03-10 16:54:48

解決方案1
5 已采納 2014-03-10 15:00:47

解決方案2
4 2014-03-10 16:54:48