When I run my FinBert model it always crashes the RAM in Google Collab at outputs = model(**input)
from transformers.utils.dummy_pt_objects import HubertModel
import textwrap
# Reads all files at once but you will have to upload it again
import pandas as pd
import glob
import numpy as np
import torch
all_files = glob.glob("*.csv")
tickerList = []
textList = []
model.eval()
for filename in all_files:
# Get ticker symbol
ticker = filename.split('_', 1)[0].replace('.', '').upper()
#Read file into dataframe
df = pd.read_csv(filename)
headlines_array = np.array(df)
# Data fram will not be a list of text for tokenizer to process
text = list(headlines_array[:,0])
textList.append(text)
#Checks if we have seen this ticker before
if ticker not in tickerList:
tickerList.append(ticker)
#Gets data to be an acceptable format for our model
inputs = tokenizer(text, padding = True, truncation = True, return_tensors='pt')
outputs = model(**inputs) #time consuming and crashes RAM so can't up int for loop
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
positive = predictions[:, 0].tolist()
negative = predictions[:, 1].tolist()
neutral = predictions[:, 2].tolist()
table = {'Headline': text,
'Ticker' : ticker,
"Positive":positive,
"Negative":negative,
"Neutral":neutral}
df = pd.DataFrame(table, columns = ["Headline", "Ticker", "Positive", "Negative", "Neutral"])
final_table = wandb.Table(columns=["Sentence", "Ticker", "Positive", "Negative", "Neutral"])
for headline, pos, neg, neutr in zip(text, predictions[:, 0].tolist(), predictions[:, 1].tolist(), predictions[:, 2].tolist() ):
final_table.add_data(headline, ticker, pos, neg, neutr)
Not quite sure what is going wrong as outputs = model(**input) runs fine outside the for loop but does not seems to run even once when I bring it inside the for loop.
You do
text = list(headlines_array[:,0])
and then later,
inputs = tokenizer(text, padding = True, truncation = True, return_tensors='pt')
Hence, you give the tokenizer a list of text. It will return you a tensor for every element in your headlines_array
. Unless you give it in batches, the model will calculate the predictions all at once. That can cause a memory problem.
You can do something like:
def chunks(lst, n):
"""Yield successive n-sized chunks from list."""
for i in range(0, len(lst), n):
yield lst[i:i + n]
batch_size = 16
for batch in chunks(text, batch_size):
inputs = tokenizer(batch, padding = True, truncation = True, return_tensors='pt')
And then continue with rest of your code.
Note: The chunks
function is from How do you split a list into evenly sized chunks?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.