I just learnt pandas and basically I want to take the some rows of a dataframe based on the ids that are stored in another dataframe. Let me show you the code:
import pandas as pd
from sklearn.model_selection import train_test_split
f_data="data.tsv"
all_data = pd.read_csv(f_data,delimiter='\t',encoding='utf-8',header=None)
x_data = all_data[[0,1,3]]
y_data = all_data[[2]]
# Split train and test sets
x_train,x_test,y_train,y_test = train_test_split(x_data,y_data,test_size=0.1)
all_data have 12 columns in total. I use 3 of the columns in x_data and 1 of them in y_data.
Once I create x_train
and x_test
, I would like to write these instances into tsv
files but while doing that I want to write all of the 12 columns stored in all_data
. To be able to do that, I need to match the instances in x_train
and x_test
with all_data
. How could I do that ?
EDIT
Here how my data looks like:
all_data
0 1 2 3 ... 8 9 10 11
0 35 Auch in Großbritannien, wo 19 Atomreaktoren in... Ausstieg -1.0 ... Sunday Times Sunday Times NaN 1
# continues like that
x_train
0 1 3
939 2074 Die CSU verlangt von der schwarz-gelben Koalit... 1.0
So, what I want to do is to get the rows starting with 939,710,288,854,433 in all_data and write them into a file.
The index
of the split data corresponds to the original, and can be used to look up the original data (assuming the index is unique):
all_data.loc[x_train.index]
all_data.loc[x_test.index]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.