My code:
import numpy as np
import pandas
import codecs
import re
dataframe = pandas.read_csv("tmp.csv", delimiter=",")
dataset = dataframe.values
x = dataset[:,0:1]
y = dataset[:,1]
#j = 0
for data in x:
text = str(data[0])
tokenizer = re.compile('\W+')
tokens = tokenizer.split(text)
i = 0
for token in tokens:
tokens[i] = token.lower()
i += 1
data = tokens
#x[j] = tokens
#j += 1
print(data)
print(x)
While print(data)
has the form ['token1', 'token2', ...]
print(x)
has the form [["text1"], ["text2"], ...]
I want the form [['token1', 'token2', ...], ['token5', 'token6', ...], ...]
for x
x[j] = tokens
instead of data = tokens
with a counting index j returns in ValueError: cannot copy sequence with size 4 to array axis with dimension 1
tmp.csv has this form: image with ca 3,5 million rows.
I'm relative new to python, so I hope anyone can help me.
Your code does not modify x
in any way, hence you get the same list you had at the beginning, when you print(x)
.
You need to create a new list where you'll store the tokenised text (ie a list of lists). Add x_tokens = []
before the first for loop, then append each list of tokens with x_tokens.append(tokens)
.
import numpy as np
import pandas
import codecs
import re
dataframe = pandas.read_csv("tmp.csv", delimiter=",")
dataset = dataframe.values
x = dataset[:,0:1]
y = dataset[:,1]
x_tokens = []
for data in x:
text = str(data[0])
tokenizer = re.compile('\W+')
tokens = tokenizer.split(text)
i = 0
for token in tokens:
tokens[i] = token.lower()
i += 1
x_tokens.append(tokens)
print(tokens)
print(x_tokens)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.