简体   繁体   中英

classify duplication using decision tree

I'm working on a project. I develop decision tree using python and I have data set divided to 80% train and 20% test. For the test dataset I want decision tree to classify duplication. I want the decision tree to read the test set, and if there is a repetition that classifies all the repetitions as No only one classify Yes, is this possible?Does a decision tree serve to classify replications ? If it possible then how?

Note: the decision tree will classify on train data set also Want to classify duplication.

Here is example:

在此处输入图片说明

Thank you

I see a couple of problems here:

  1. Decision Trees are Supervised Learning methods, meaning they require labels. Your problem doesn't seem to have any labels (This isn't a classification problem, what are you classifying your comments into? Uniqueness isn't a category! )

  2. Decision Trees need to find patterns in data, but what is the pattern in your data? How it should distinguish between "It is crashing" and "Small icon" ? What feature separates these two sentences? Again, Uniqueness is too broad!

In order to solve your problem, you need to use Unsupervised Learning methods. First detect the duplications in your data by finding similar comments. This can be done with various methods and libraries. A few are described here .

For instance, one way to do it is with cosine_similarity :

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

comments_df = pd.DataFrame(data = {'Comment': ['Crash', 'It is crashing', 'Always crash', 'Small icon'], 'Unique': ['Y', 'N', 'N', 'Y']})
X = TfidfVectorizer().fit_transform(list(comments_df["Comment"].values))

threshold = 0.4

for i in range(0, X.shape[0]):
  for j in range(i, X.shape[0]):
    if i != j:
      sim_score = cosine_similarity(X[i], X[j])
      if sim_score > threshold:
        print(f"\"{comments_df.iloc[i]['Comment']}\", \"{comments_df.iloc[j]['Comment']}\"")
        print(f"Cos similarity: {sim_score}\n")

That outputs:

"Crash", "Always crash"
Cos similarity: [[0.6191303]]

You can also use spaCy :

import spacy
nlp = spacy.load("en_core_web_sm")

comments_df = pd.DataFrame(data = {'Comment': ['Crash', 'It is crashing', 'Always crash', 'Small icon'], 'Unique': ['Y', 'N', 'N', 'Y']})

threshold = 0.4

for i in range(comments_df.shape[0]):
  for j in range(i, comments_df.shape[0]):
    if i != j:

      doc1 = nlp(comments_df.iloc[i]['Comment'])
      doc2 = nlp(comments_df.iloc[j]['Comment'])
      similary_score = doc1.similarity(doc2)

      if similary_score > threshold:
        print(f"\"{comments_df.iloc[i]['Comment']}\", \"{comments_df.iloc[j]['Comment']}\"")
        print(f"Similarity: {similary_score}\n")

That will output:

"Crash", "Always crash"
Similarity: 0.5155057238138886

"Crash", "Small icon"
Similarity: 0.5194295610092798

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM