![](/img/trans.png)
[英]What is the Python (numpy or scipy or Pandas) equivalent for R's adjboxStats function?
[英]What is the equivalent to R's match() for python Pandas/numpy?
我是一個R用戶,我無法弄清楚匹配的匹配()的熊貓。 我需要使用這個函數迭代一堆文件,獲取一個關鍵信息,然后將它合並回'url'上的當前數據結構。 在R我會做這樣的事情:
logActions <- read.csv("data/logactions.csv")
logActions$class <- NA
files = dir("data/textContentClassified/")
for( i in 1:length(files)){
tmp <- read.csv(files[i])
logActions$class[match(logActions$url, tmp$url)] <-
tmp$class[match(tmp$url, logActions$url)]
}
我不認為我可以使用merge()或join(),因為每次都會覆蓋logActions $ class。 我也不能使用update()或combine_first(),因為它們都沒有必要的索引功能。 我也試過基於這個SO帖子創建一個match()函數,但是無法弄清楚如何讓它與DataFrame對象一起工作。 如果我遺漏了一些明顯的東西,請道歉。
這里有一些python代碼總結了我在pandas中執行類似match()的無效嘗試:
from pandas import *
left = DataFrame({'url': ['foo.com', 'foo.com', 'bar.com'], 'action': [0, 1, 0]})
left["class"] = NaN
right1 = DataFrame({'url': ['foo.com'], 'class': [0]})
right2 = DataFrame({'url': ['bar.com'], 'class': [ 1]})
# Doesn't work:
left.join(right1, on='url')
merge(left, right, on='url')
# Also doesn't work the way I need it to:
left = left.combine_first(right1)
left = left.combine_first(right2)
left
# Also does something funky and doesn't really work the way match() does:
left = left.set_index('url', drop=False)
right1 = right1.set_index('url', drop=False)
right2 = right2.set_index('url', drop=False)
left = left.combine_first(right1)
left = left.combine_first(right2)
left
所需的輸出是:
url action class
0 foo.com 0 0
1 foo.com 1 0
2 bar.com 0 1
但是,我需要能夠一遍又一遍地調用它,以便我可以迭代每個文件。
注意pandas.match
的存在,這恰好與R的match
。
編輯 :
如果所有數據幀中的url都是唯一的,那么您可以將正確的數據幀設置為由url
索引的一系列class
,然后您可以通過索引來獲取左側每個URL的類。
from pandas import *
left = DataFrame({'url': ['foo.com', 'bar.com', 'foo.com', 'tmp', 'foo.com'], 'action': [0, 1, 0, 2, 4]})
left["klass"] = NaN
right1 = DataFrame({'url': ['foo.com', 'tmp'], 'klass': [10, 20]})
right2 = DataFrame({'url': ['bar.com'], 'klass': [30]})
left["klass"] = left.klass.combine_first(right1.set_index('url').klass[left.url].reset_index(drop=True))
left["klass"] = left.klass.combine_first(right2.set_index('url').klass[left.url].reset_index(drop=True))
print left
這是你想要的嗎?
import pandas as pd
left = pd.DataFrame({'url': ['foo.com', 'foo.com', 'bar.com'], 'action': [0, 1, 0]})
left["class"] = NaN
right1 = pd.DataFrame({'url': ['foo.com'], 'class': [0]})
right2 = pd.DataFrame({'url': ['bar.com'], 'class': [ 1]})
pd.merge(left.drop("class", axis=1), pd.concat([right1, right2]), on="url")
輸出:
action url class
0 0 foo.com 0
1 1 foo.com 0
2 0 bar.com 1
如果左側的類列不是全部NaN,則可以將它與結果合並。
這是我最終完成的完整代碼:
#read in df containing actions in chunks:
tp = read_csv('/data/logactions.csv',
quoting=csv.QUOTE_NONNUMERIC,
iterator=True, chunksize=1000,
encoding='utf-8', skipinitialspace=True,
error_bad_lines=False)
df = concat([chunk for chunk in tp], ignore_index=True)
# set classes to NaN
df["klass"] = NaN
df = df[notnull(df['url'])]
df = df.reset_index(drop=True)
# iterate over text files, match, grab klass
startdate = date(2013, 1, 1)
enddate = date(2013, 1, 26)
d = startdate
while d <= enddate:
dstring = d.isoformat()
print dstring
# Read in each file w/ classifications in chunks
tp = read_csv('/data/textContentClassified/content{dstring}classfied.tsv'.format(**locals()),
sep = ',', quoting=csv.QUOTE_NONNUMERIC,
iterator=True, chunksize=1000,
encoding='utf-8', skipinitialspace=True,
error_bad_lines=False)
thisdatedf = concat([chunk for chunk in tp], ignore_index=True)
thisdatedf=thisdatedf.drop_duplicates(['url'])
thisdatedf=thisdatedf.reset_index(drop=True)
thisdatedf = thisdatedf[notnull(thisdatedf['url'])]
df["klass"] = df.klass.combine_first(thisdatedf.set_index('url').klass[df.url].reset_index(drop=True))
# Now iterate
d = d + timedelta(days=1)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.