简体   繁体   中英

Tokenize - String to Array of Tokens

My code:

import numpy as np
import pandas
import codecs
import re

dataframe = pandas.read_csv("tmp.csv", delimiter=",")
dataset = dataframe.values
x = dataset[:,0:1]
y = dataset[:,1]

#j = 0
for data in x:
    text = str(data[0])
    tokenizer = re.compile('\W+')
    tokens = tokenizer.split(text)
    i = 0
    for token in tokens:
        tokens[i] = token.lower()
        i += 1
    data = tokens
    #x[j] = tokens
    #j += 1
    print(data)

print(x)

While print(data) has the form ['token1', 'token2', ...]
print(x) has the form [["text1"], ["text2"], ...]

I want the form [['token1', 'token2', ...], ['token5', 'token6', ...], ...] for x

x[j] = tokens instead of data = tokens with a counting index j returns in ValueError: cannot copy sequence with size 4 to array axis with dimension 1

tmp.csv has this form: image with ca 3,5 million rows.

I'm relative new to python, so I hope anyone can help me.

Your code does not modify x in any way, hence you get the same list you had at the beginning, when you print(x) .

You need to create a new list where you'll store the tokenised text (ie a list of lists). Add x_tokens = [] before the first for loop, then append each list of tokens with x_tokens.append(tokens) .

import numpy as np
import pandas
import codecs
import re

dataframe = pandas.read_csv("tmp.csv", delimiter=",")
dataset = dataframe.values
x = dataset[:,0:1]
y = dataset[:,1]

x_tokens = []

for data in x:
    text = str(data[0])
    tokenizer = re.compile('\W+')
    tokens = tokenizer.split(text)
    i = 0
    for token in tokens:
        tokens[i] = token.lower()
        i += 1

    x_tokens.append(tokens)

    print(tokens)

print(x_tokens)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM