简体   繁体   中英

HuggingFace FinBert Model in google Collab

When I run my FinBert model it always crashes the RAM in Google Collab at outputs = model(**input)

from transformers.utils.dummy_pt_objects import HubertModel
import textwrap
# Reads all files at once but you will have to upload it again
import pandas as pd
import glob
import numpy as np
import torch


all_files = glob.glob("*.csv")
tickerList = []
textList = []
model.eval()
for filename in all_files:
    # Get ticker symbol
    ticker = filename.split('_', 1)[0].replace('.', '').upper()
    #Read file into dataframe
    df = pd.read_csv(filename)
    headlines_array = np.array(df)
    # Data fram will not be a list of text for tokenizer to process
    text = list(headlines_array[:,0])
    textList.append(text)
    #Checks if we have seen this ticker before 
    if ticker not in tickerList:
      tickerList.append(ticker)

#Gets data to be an acceptable format for our model
    inputs = tokenizer(text, padding = True, truncation = True, return_tensors='pt')
    outputs = model(**inputs)  #time consuming and crashes RAM so can't up int for loop
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

    positive = predictions[:, 0].tolist()
    negative = predictions[:, 1].tolist()
    neutral = predictions[:, 2].tolist()

    table = {'Headline': text,
         'Ticker' : ticker,
         "Positive":positive,
         "Negative":negative, 
         "Neutral":neutral}

    df = pd.DataFrame(table, columns = ["Headline", "Ticker", "Positive", "Negative", "Neutral"])
    final_table = wandb.Table(columns=["Sentence", "Ticker", "Positive", "Negative", "Neutral"])

    for headline, pos, neg, neutr in zip(text, predictions[:, 0].tolist(), predictions[:, 1].tolist(), predictions[:, 2].tolist() ): 
      final_table.add_data(headline, ticker, pos, neg, neutr)

Not quite sure what is going wrong as outputs = model(**input) runs fine outside the for loop but does not seems to run even once when I bring it inside the for loop.

You do

text = list(headlines_array[:,0])

and then later,

inputs = tokenizer(text, padding = True, truncation = True, return_tensors='pt')

Hence, you give the tokenizer a list of text. It will return you a tensor for every element in your headlines_array . Unless you give it in batches, the model will calculate the predictions all at once. That can cause a memory problem.

You can do something like:

def chunks(lst, n):
    """Yield successive n-sized chunks from list."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

batch_size = 16
for batch in chunks(text, batch_size):
    inputs = tokenizer(batch, padding = True, truncation = True, return_tensors='pt')

And then continue with rest of your code.

Note: The chunks function is from How do you split a list into evenly sized chunks?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM