简体   繁体   中英

Machine Learning: Predict second dataset on behalf of first dataset trained classifier

i am new to "Machine Learning" and tried to implement this question but it is unclear to me. its been 2 months i am Seducing, so please help me to resolve my error.

Actually, i am trying to:

  1. "Train svm classifer" on TRAIN_features and TRAIN_labels extracted from TRAIN_dataset of shape (98962,) and size 98962
  2. "Test svm classifer" on TEST_features extracted from another dataset ie TEST_dataset of the same shape (98962,) and size 98962 as TRAIN_dataset is.

After "preprocessing" of both "TRAIN_features" & "TEST_features" , with the help of "TfidfVectorizer" i vectorized my both features. after that i again computed the shape and size of both features ie

vectorizer = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TRAIN_features = vectorizer.fit_transform(processed_TRAIN_features)

"processed_TRAIN_features" size becomes 1032665 and "shape" becomes (98962, 9434)

vectorizer1 = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TEST_features = vectorizer1.fit_transform(processed_TEST_features)

"processed_TEST_features" size becomes 1457961 and "shape" becomes (98962, 10782)

i know when i will "TRAIN" svm classifier on processed_TRAIN_features and when "predict" the "processed_TEST_features" using same classifier, it will produce error because "shape" and "size" of both features has become different.

what i think, the only solution to this problem is to "reshape" sparse matrix (numpy.float64) either processed_TEST_features or processed_TRAIN_features ... i think reshape to "processed_TRAIN_features" is possible only as its size is less than "processed_TEST_features" OR there is anyother way to implement my above points (1,2). I am unable to implement this question regarding to my problem and still in search that how it will become equal to "processed_TEST_features" wrt shape and size.

please if anyone of you can do this for me... thanks in advance.

Full code is below:

DataPath2     = ".../train.csv"
TRAIN_dataset =   pd.read_csv(DataPath2)

DataPath1     = "..../completeDATAset.csv"
TEST_dataset  =   pd.read_csv(DataPath1)

TRAIN_features = TRAIN_dataset.iloc[:, 1 ].values
TRAIN_labels = TRAIN_dataset.iloc[:,0].values

TEST_features = TEST_dataset.iloc[:, 1 ].values
TEST_labeels = TEST_dataset.iloc[:,0].values
lab_enc = preprocessing.LabelEncoder()
TEST_labels = lab_enc.fit_transform(TEST_labeels)

processed_TRAIN_features = []

for sentence in range(0, len(TRAIN_features)):
    # Remove all the special characters
    processed_feature = re.sub(r'\W', ' ', str(TRAIN_features[sentence]))

    # remove all single characters
    processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)

    #remove special symbols
    processed_feature = re.sub(r'\s+[xe2 x80 xa6]\s+', ' ', processed_feature)

    # remove special symbols
    processed_feature = re.sub(r'\s+[xe2 x80 x98]\s+', ' ', processed_feature)

    # remove special symbols
    processed_feature = re.sub(r'\s+[xe2 x80 x99]\s+', ' ', processed_feature)

    # Remove single characters from the start
    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature)

    # Substituting multiple spaces with single space
    processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)

    #remove links
    processed_feature = re.sub(r"http\S+", "", processed_feature)

    # Removing prefixed 'b'
    processed_feature = re.sub(r'^b\s+', '', processed_feature)

    #removing rt
    processed_feature = re.sub(r'^rt\s+', '', processed_feature)

    # Converting to Lowercase
    processed_feature = processed_feature.lower()

    processed_TRAIN_features.append(processed_feature)

vectorizer = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TRAIN_features = vectorizer.fit_transform(processed_TRAIN_features)


processed_TEST_features = []

for sentence in range(0, len(TEST_features)):
    # Remove all the special characters
    processed_feature1 = re.sub(r'\W', ' ', str(TEST_features[sentence]))

    # remove all single characters
    processed_feature1 = re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature1)

    #remove special symbols
    processed_feature1 = re.sub(r'\s+[xe2 x80 xa6]\s+', ' ', processed_feature1)

    # remove special symbols
    processed_feature1 = re.sub(r'\s+[xe2 x80 x98]\s+', ' ', processed_feature1)

    # remove special symbols
    processed_feature1 = re.sub(r'\s+[xe2 x80 x99]\s+', ' ', processed_feature1)

    # Remove single characters from the start
    processed_feature1 = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature1)

    # Substituting multiple spaces with single space
    processed_feature1 = re.sub(r'\s+', ' ', processed_feature1, flags=re.I)

    #remove links
    processed_feature1 = re.sub(r"http\S+", "", processed_feature1)

    # Removing prefixed 'b'
    processed_feature1 = re.sub(r'^b\s+', '', processed_feature1)

    #removing rt
    processed_feature1 = re.sub(r'^rt\s+', '', processed_feature1)

    # Converting to Lowercase
    processed_feature1 = processed_feature1.lower()

    processed_TEST_features.append(processed_feature1)

vectorizer1 = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TEST_features = vectorizer1.fit_transform(processed_TEST_features)

X_train_data, X_test_data, y_train_data, y_test_data = train_test_split(processed_TRAIN_features, TRAIN_labels, test_size=0.3, random_state=0)

text_classifier = svm.SVC(kernel='linear', class_weight="balanced" ,probability=True ,C=1 , random_state=0)

text_classifier.fit(X_train_data, y_train_data)

text_classifier.predict(processed_TEST_features)

Title EDIT: predict classification of dataset => predict dataset

processed_TRAIN_features = csr_matrix((processed_TRAIN_features),shape=(new row length,new column length))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM