[英]How to use classifier random forest in Python for 2 different data sets?
I have 2 data sets with different variables. 我有2个具有不同变量的数据集。 But both includes a variable, say NUM, that helps to identify the occurrence of an event. 但是两者都包含一个变量,例如NUM,可以帮助识别事件的发生。 With the NUM, I was able to identify the event, by labelling it. 使用NUM,我可以通过标记事件来识别事件。 How can one run RF to effectively include considerations of the 2 datasets? 如何运行RF以有效地包括对两个数据集的考虑? I am not able to append them (column wise) as the number of records for each NUM differs. 由于每个NUM的记录数不同,因此我无法添加它们(以列为单位)。
From the way your question is phrased, I'm guessing you have two pandas dataframes. 从您的问题的表达方式来看,我猜您有两个熊猫数据框。
You can use pandas.merge to pull the two together. 您可以使用pandas.merge将两者拉在一起。 All you need to do is a join of some sort. 您需要做的只是某种形式的联接。 Left might be what you're looking for, but if you want to only pull data where you have a NUM value in both dataframes, use an inner join. 左边可能是您要寻找的内容,但是如果您只想在两个数据框中都具有NUM值的地方提取数据,请使用内部联接。
See the documentation here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html 请参阅此处的文档: https : //pandas.pydata.org/pandas-docs/stable/genic/pandas.DataFrame.merge.html
Here's how that might look: 这可能是这样的:
pd.merge(df1,df2,how='left',left_on='NUM')
You could try to put NUM as a single column, and the first and second datasets would use completely independent columns, with the non-matching cells containing empty data. 您可以尝试将NUM放在单个列中,并且第一和第二个数据集将使用完全独立的列,并且不匹配的单元格包含空数据。 Whether the results will be any good, will depend much on your data. 结果是否良好,将在很大程度上取决于您的数据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.