简体   繁体   中英

Using BERT and Keras's neural network for text classification

I'm trying to run a binary supervised text classification task using BERT, but I'm not sure how to do that. I have tried to run BERT with the Hugging Face () library, but I have no idea what to do with the output of the process.

After a lot of internet searches I ended up with the following class (according to https://towardsdatascience.com/build-a-bert-sci-kit-transformer-59d60ddd54a5 ):

class BertTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
        self.model = BertModel.from_pretrained("bert-base-uncased")
        self.model.eval()
        self.embedding_func = lambda x: x[0][:, 0, :].squeeze()

    def _tokenize(self, text: str):
        # Tokenize the text with the provided tokenizer
        tokenized_text = self.tokenizer.encode_plus(text,
                                                    add_special_tokens=True,
                                                    truncation=True
                                                    )["input_ids"]

        # Create an attention mask telling BERT to use all words
        attention_mask = [1] * len(tokenized_text)

        # bert takes in a batch so we need to unsqueeze the rows
        return (
            torch.tensor(tokenized_text).unsqueeze(0),
            torch.tensor(attention_mask).unsqueeze(0),
        )

    def _tokenize_and_predict(self, text: str) -> torch.tensor:
        tokenized, attention_mask = self._tokenize(text)

        embeddings = self.model(tokenized, attention_mask)
        return self.embedding_func(embeddings)

    def transform(self, text: List[str]):
        if isinstance(text, pd.Series):
            text = text.tolist()

        with torch.no_grad():
            return torch.stack([self._tokenize_and_predict(string) for string in text])

    def fit(self, X, y=None):
        return self
  1. This class suitable for use in Sikict-Learn which is good for me, but I want also to run it with deep learning models using Keras. How can I make this work with Keras's neural networks (such as RNN and CNN)?

  2. From what I understand, the above code takes only the CLS token and not all of the tokens. I don't know if that's alright. Maybe I should use all of them? If so, how can I do that?

Any help would be appreciated.

I'm not sure what you mean by the output of the process. If you want to use the model to make predictions, you do this using something like the code below. There may be some tips on how to use predictions based on a pretrained model in this library, lazy-text-predict It might also help you with implementation of your text classifier in general.

text='my text to classify'
model=BertForSequenceClassification.from_pretrained('/content/bert-base-uncased_model')
tokenizer=BertTokenizerFast.from_pretrained('bert-base-uncased')
text_classification= transformers.pipeline('sentiment-analysis',
                                            model=model, 
                                            tokenizer=tokenizer)
y=text_classification(text)[0]
print(y)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM