简体   繁体   中英

Python + Regex + CSV + Pandas : failed to produce numeric values from alpha-numeric values

I am fetching data from a multisheet xlsx file and storing data in separate csv files. The first rows of all the sheets in xslx are stored in the first csv, the 2nd rows of all the sheets are stored in the 2nd csv, and, so on. Now sometimes any of the cells of 3rd to 10th columns contains alpha numeric values like this '1 pkt'. I need to make these values numeric only, like '1' so that I can feed these values to a ML model to predict something. For that purpose I wrote a code:

xls = xlrd.open_workbook(r'Smallys ORDER.xlsx', on_demand=True)
df_list = []

names = xls.sheet_names()
names.remove('EVENT')

for i in range(191):
    rows = []
    for name in names:
        count = 0
        prod = pd.read_excel('Smallys ORDER.xlsx', name, index_col=None, header=0)
        prod['date'] = name
        prod.fillna(0, inplace=True)
        try:
            item = prod.iloc[i]
            item[3] = re.split('[a-z]+', item[3])[0]
            print(item[3])
            '''item[4] = item[4].split(sep, 1)[0]
            item[5] = item[5].split(sep, 1)[0]
            item[6] = item[6].split(sep, 1)[0]
            item[7] = item[7].split(sep, 1)[0]
            item[8] = item[8].split(sep, 1)[0]
            item[9] = item[9].split(sep, 1)[0]
            item[10] = item[10].split(sep, 1)[0]'''


            rows.append(item)

        except:
            print('Row finished !!!')


    writer = csv.writer(open('/home/hp/products/' + 'prod['+str(i)+'].csv', 'w')) 
    writer.writerow(prod.columns.tolist())
    writer.writerows(rows)    

The print(item[3]) statement prints nothing. Also, in the generated CSVs, only headers got printed. All the cells are empty.

Edit:

Before applying any regex, this:

item = prod.iloc[i]
print(item[3])
print(type(item[3]))

prints this:

0
<class 'int'>
0
<class 'int'>
0
<class 'int'>
0
<class 'int'>
0
<class 'int'>
0
<class 'int'>
0
<class 'int'>
1 btl
<class 'str'>
0
<class 'int'>

So the values are either ints or strings.

Sample data from a sheet of the original xlsx file:

在此处输入图片说明

As you want to change any text like 1 pkt to 1 , rather than splitting using [az]+ , it should be better to substitute and change this line:

item[3] = re.split('[a-z]+', item[3])[0]

to:

item[3] = re.sub(r'\D*', '', str(item[3]))

which will replace any non-digit characters to empty string.

Let me know if this works. If not, can you print the value of item[3] and show what it prints?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM