简体   繁体   English

Python + Regex + CSV + Pandas:无法从字母数字值生成数字值

[英]Python + Regex + CSV + Pandas : failed to produce numeric values from alpha-numeric values

I am fetching data from a multisheet xlsx file and storing data in separate csv files. 我正在从多页xlsx文件中获取数据,并将数据存储在单独的csv文件中。 The first rows of all the sheets in xslx are stored in the first csv, the 2nd rows of all the sheets are stored in the 2nd csv, and, so on. xslx中所有工作表的第一行存储在第一csv中,所有工作表的第二行存储在第二csv中,依此类推。 Now sometimes any of the cells of 3rd to 10th columns contains alpha numeric values like this '1 pkt'. 现在,有时第3列到第10列的任何单元格都包含字母数字值,例如“ 1 pkt”。 I need to make these values numeric only, like '1' so that I can feed these values to a ML model to predict something. 我只需要使这些值成为数字即可,例如“ 1”,这样我就可以将这些值提供给ML模型以进行预测。 For that purpose I wrote a code: 为此,我编写了一个代码:

xls = xlrd.open_workbook(r'Smallys ORDER.xlsx', on_demand=True)
df_list = []

names = xls.sheet_names()
names.remove('EVENT')

for i in range(191):
    rows = []
    for name in names:
        count = 0
        prod = pd.read_excel('Smallys ORDER.xlsx', name, index_col=None, header=0)
        prod['date'] = name
        prod.fillna(0, inplace=True)
        try:
            item = prod.iloc[i]
            item[3] = re.split('[a-z]+', item[3])[0]
            print(item[3])
            '''item[4] = item[4].split(sep, 1)[0]
            item[5] = item[5].split(sep, 1)[0]
            item[6] = item[6].split(sep, 1)[0]
            item[7] = item[7].split(sep, 1)[0]
            item[8] = item[8].split(sep, 1)[0]
            item[9] = item[9].split(sep, 1)[0]
            item[10] = item[10].split(sep, 1)[0]'''


            rows.append(item)

        except:
            print('Row finished !!!')


    writer = csv.writer(open('/home/hp/products/' + 'prod['+str(i)+'].csv', 'w')) 
    writer.writerow(prod.columns.tolist())
    writer.writerows(rows)    

The print(item[3]) statement prints nothing. print(item[3])语句不打印任何内容。 Also, in the generated CSVs, only headers got printed. 另外,在生成的CSV中,仅打印标题。 All the cells are empty. 所有单元格都是空的。

Edit: 编辑:

Before applying any regex, this: 在应用任何正则表达式之前,这:

item = prod.iloc[i]
print(item[3])
print(type(item[3]))

prints this: 打印此:

0
<class 'int'>
0
<class 'int'>
0
<class 'int'>
0
<class 'int'>
0
<class 'int'>
0
<class 'int'>
0
<class 'int'>
1 btl
<class 'str'>
0
<class 'int'>

So the values are either ints or strings. 因此,值可以是整数或字符串。

Sample data from a sheet of the original xlsx file: 来自原始xlsx文件表的样本数据:

在此处输入图片说明

As you want to change any text like 1 pkt to 1 , rather than splitting using [az]+ , it should be better to substitute and change this line: 由于您要将1 pkt类的任何文本更改为1 ,而不是使用[az]+分割,因此最好替换并更改此行:

item[3] = re.split('[a-z]+', item[3])[0]

to: 至:

item[3] = re.sub(r'\D*', '', str(item[3]))

which will replace any non-digit characters to empty string. 它将所有非数字字符替换为空字符串。

Let me know if this works. 让我知道这个是否奏效。 If not, can you print the value of item[3] and show what it prints? 如果不是,您可以打印item[3]的值并显示其打印内容吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM