[英]Building a Transition Matrix using words in Python/Numpy
我試圖用這些數據構建一個 3x3 轉換矩陣
days=['rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds',
'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun',
'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain',
'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain',
'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun',
'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun',
'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds',
'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun',
'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain',
'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain',
'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
'sun', 'sun', 'rain']
目前,我使用一些臨時字典和一些列表來分別計算每種天氣的概率。 這不是一個很好的解決方案。 有人可以指導我更合理地解決這個問題嗎?
self.transitionMatrix=np.zeros((3,3))
#the columns are today
sun_total_count = 0
temp_dict={'sun':0, 'clouds':0, 'rain':0}
total_runs = 0
for (x, y), c in Counter(zip(data, data[1:])).items():
#if column 0 is sun
if x is 'sun':
#find the sum of all the numbers in this column
sun_total_count += c
total_runs += 1
if y is 'sun':
temp_dict['sun'] = c
if y is 'clouds':
temp_dict['clouds'] = c
if y is 'rain':
temp_dict['rain'] = c
if total_runs is 3:
self.transitionMatrix[0][0] = temp_dict['sun']/sun_total_count
self.transitionMatrix[1][0] = temp_dict['clouds']/sun_total_count
self.transitionMatrix[2][0] = temp_dict['rain']/sun_total_count
return self.transitionMatrix
對於每種類型的天氣,我都需要計算第二天的概率
如果您不介意使用pandas
,則有一個用於提取轉換概率的單行:
pd.crosstab(pd.Series(days[1:],name='Tomorrow'),
pd.Series(days[:-1],name='Today'),normalize=1)
輸出:
Today clouds rain sun
Tomorrow
clouds 0.40625 0.230769 0.309524
rain 0.28125 0.423077 0.142857
sun 0.31250 0.346154 0.547619
在這里,假設今天下雨,明天將是晴天的(向前)概率在“rain”列,“sun”行中找到。 如果您想要反向概率(根據今天的天氣可能是昨天的天氣),請切換前兩個參數。
如果您希望將概率存儲在行而不是列中,則設置normalize=0
但請注意,如果您直接在此示例中執行此操作,則會獲得存儲為行的向后概率。 如果您想獲得與上述相同但轉置的結果,您可以 a) 是,轉置或 b) 切換前兩個參數的順序並將normalize
設置為 0。
如果您只想將結果保留為numpy
.values
數組(而不是.values
數據幀), .values
在最后一個括號后鍵入.values
。
我喜歡pandas
和itertools
的組合。 代碼塊比上面的要長一些,但不要將冗長與速度混為一談。 ( window
函數應該非常快;誠然,pandas 部分會更慢。)
首先,制作一個“窗口”功能。 這是 itertools 食譜中的一個。 這會讓你得到一個轉換元組列表(state1 到 state2)。
from itertools import islice
def window(seq, n=2):
"""Sliding window width n from seq. From old itertools recipes."""
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
# list(window(days))
# [('rain', 'rain'),
# ('rain', 'rain'),
# ('rain', 'clouds'),
# ('clouds', 'rain'),
# ('rain', 'sun'),
# ...
然后使用pandas groupby + value counts 操作得到從每個狀態1到每個狀態2的轉換矩陣:
import pandas as pd
pairs = pd.DataFrame(window(days), columns=['state1', 'state2'])
counts = pairs.groupby('state1')['state2'].value_counts()
probs = (counts / counts.sum()).unstack()
您的結果如下所示:
print(probs)
state2 clouds rain sun
state1
clouds 0.13 0.09 0.10
rain 0.06 0.11 0.09
sun 0.13 0.06 0.23
這是一個“純”的 numpy 解決方案,它創建 3x3 表,其中第零個暗淡(行號)對應於今天,最后一個暗淡(列號)對應於明天。
從單詞到索引的轉換是通過在第一個字母之后截斷然后使用查找表來完成的。
用於計數numpy.add.at
被使用。
這是在考慮效率的情況下編寫的。 它在不到一秒鍾的時間內完成一百萬個單詞。
import numpy as np
report = [
'rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds',
'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun',
'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain',
'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain',
'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun',
'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun',
'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds',
'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun',
'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain',
'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain',
'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
'sun', 'sun', 'rain']
# create np array, keep only first letter (by forcing dtype)
# obviously, this only works because rain, sun, clouds start with different
# letters
# cast to int type so we can use for indexing
ri = np.array(report, dtype='|S1').view(np.uint8)
# create lookup
c, r, s = 99, 114, 115 # you can verify this using chr and ord
lookup = np.empty((s+1,), dtype=int)
lookup[[c, r, s]] = np.arange(3)
# translate c, r, s to 0, 1, 2
rc = lookup[ri]
# get counts (of pairs (today, tomorrow))
cnts = np.zeros((3, 3), dtype=int)
np.add.at(cnts, (rc[:-1], rc[1:]), 1)
# or as probs
probs = cnts / cnts.sum()
# or as condional probs (if today is sun how probable is rain tomorrow etc.)
cond = cnts / cnts.sum(axis=-1, keepdims=True)
print(cnts)
print(probs)
print(cond)
# [13 9 10]
# [ 6 11 9]
# [13 6 23]]
# [[ 0.13 0.09 0.1 ]
# [ 0.06 0.11 0.09]
# [ 0.13 0.06 0.23]]
# [[ 0.40625 0.28125 0.3125 ]
# [ 0.23076923 0.42307692 0.34615385]
# [ 0.30952381 0.14285714 0.54761905]]
這是幫助您入門的編碼設置。
report = [
'rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds',
'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun',
'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain',
'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain',
'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun',
'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun',
'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds',
'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun',
'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain',
'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain',
'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
'sun', 'sun', 'rain']
weather_dict = {"sun":0, "clouds":1, "rain": 2}
weather_code = [weather_dict[day] for day in report]
print weather_code
for n in range(1, len(weather_code)):
yesterday_code = weather_code[n-1]
today_code = weather_code[n]
# You now have the indicies you need for your 3x3 matrix.
似乎您想創建一個太陽后雨或雲后太陽(或等)的概率矩陣。 您可以像這樣吐出概率矩陣(不是數學術語):
def probabilityMatrix():
tomorrowsProbability=np.zeros((3,3))
occurancesOfEach = Counter(data)
myMatrix = Counter(zip(data, data[1:]))
probabilityMatrix = {key : myMatrix[key] / occurancesOfEach[key[0]] for key in myMatrix}
return probabilityMatrix
print(probabilityMatrix())
但是,您可能想根據今天的天氣吐出每種天氣的概率:
def getTomorrowsProbability(weather):
probMatrix = probabilityMatrix()
return {key[1] : probMatrix[key] for key in probMatrix if key[0] == weather}
print(getTomorrowsProbability('sun'))
下面是使用熊貓的另一種選擇。 轉換列表可以替換為“rain”、“clouds”等。
import pandas as pd
transitions = ['A', 'B', 'B', 'C', 'B', 'A', 'D', 'D', 'A', 'B', 'A', 'D'] * 2
df = pd.DataFrame(columns = ['state', 'next_state'])
for i, val in enumerate(transitions[:-1]): # We don't care about last state
df_stg = pd.DataFrame(index=[0])
df_stg['state'], df_stg['next_state'] = transitions[i], transitions[i+1]
df = pd.concat([df, df_stg], axis = 0)
cross_tab = pd.crosstab(df['state'], df['next_state'])
cross_tab.div(cross_tab.sum(axis=1), axis=0)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.