Python + Regex + CSV + Pandas : failed to produce numeric values from alpha-numeric values

Question

I am fetching data from a multisheet xlsx file and storing data in separate csv files. The first rows of all the sheets in xslx are stored in the first csv, the 2nd rows of all the sheets are stored in the 2nd csv, and, so on. Now sometimes any of the cells of 3rd to 10th columns contains alpha numeric values like this '1 pkt'. I need to make these values numeric only, like '1' so that I can feed these values to a ML model to predict something. For that purpose I wrote a code:

xls = xlrd.open_workbook(r'Smallys ORDER.xlsx', on_demand=True)
df_list = []

names = xls.sheet_names()
names.remove('EVENT')

for i in range(191):
    rows = []
    for name in names:
        count = 0
        prod = pd.read_excel('Smallys ORDER.xlsx', name, index_col=None, header=0)
        prod['date'] = name
        prod.fillna(0, inplace=True)
        try:
            item = prod.iloc[i]
            item[3] = re.split('[a-z]+', item[3])[0]
            print(item[3])
            '''item[4] = item[4].split(sep, 1)[0]
            item[5] = item[5].split(sep, 1)[0]
            item[6] = item[6].split(sep, 1)[0]
            item[7] = item[7].split(sep, 1)[0]
            item[8] = item[8].split(sep, 1)[0]
            item[9] = item[9].split(sep, 1)[0]
            item[10] = item[10].split(sep, 1)[0]'''


            rows.append(item)

        except:
            print('Row finished !!!')


    writer = csv.writer(open('/home/hp/products/' + 'prod['+str(i)+'].csv', 'w')) 
    writer.writerow(prod.columns.tolist())
    writer.writerows(rows)

The print(item[3]) statement prints nothing. Also, in the generated CSVs, only headers got printed. All the cells are empty.

Edit:

Before applying any regex, this:

item = prod.iloc[i]
print(item[3])
print(type(item[3]))

prints this:

0
<class 'int'>
0
<class 'int'>
0
<class 'int'>
0
<class 'int'>
0
<class 'int'>
0
<class 'int'>
0
<class 'int'>
1 btl
<class 'str'>
0
<class 'int'>

So the values are either ints or strings.

Sample data from a sheet of the original xlsx file:

Answer 1

As you want to change any text like 1 pkt to 1 , rather than splitting using [az]+ , it should be better to substitute and change this line:

item[3] = re.split('[a-z]+', item[3])[0]

to:

item[3] = re.sub(r'\D*', '', str(item[3]))

which will replace any non-digit characters to empty string.

Let me know if this works. If not, can you print the value of item[3] and show what it prints?

Python + Regex + CSV + Pandas : failed to produce numeric values from alpha-numeric values

Question

1 answers

solution1
1 ACCPTED 2019-03-26 06:38:12

Python + Regex + CSV + Pandas : failed to produce numeric values from alpha-numeric values

Question

1 answers

solution1 1 ACCPTED 2019-03-26 06:38:12

solution1
1 ACCPTED 2019-03-26 06:38:12