简体   繁体   English

Python,从Excel列中提取数字并写为输出

[英]Python, extracting numbers from Excel column and write as output

Trying to extract the number from columns in an Excel file, and write them into the next columns. 尝试从Excel文件中的列中提取数字,然后将其写入下一列。

Matching criteria: any number of length five, either started with “PB” or not 匹配条件:长度为5的任意数目,是否以“ PB”开头

I've limited the length of the number match to five however there are a “16” extracted (row#2, column D) 我将数字匹配的长度限制为五个,但是提取了“ 16”(第2行,D列)

在此处输入图片说明

How I can improve it? 我该如何改善? Thank you. 谢谢。

import xlwt, xlrd, re
from xlutils.copy import copy 

workbook = xlrd.open_workbook("C:\\Documents\\num.xlsx")
old_sheet = workbook.sheet_by_name("Sheet1")

wb = copy(workbook) 
sheet = wb.get_sheet(0)

number_of_ships = old_sheet.nrows

for row_index in range(0, old_sheet.nrows):

    Column_a = old_sheet.cell(row_index, 0).value   
    Column_b = old_sheet.cell(row_index, 1).value

    a_b = Column_a + Column_b

    found_PB = re.findall(r"[PB]+(\d{5})", a_b, re.I)
    list_of_numbers = re.findall(r'\d+', a_b)

    for f in found_PB:
        if len(f) == 5:
            sheet.write(row_index, 2, "";"".join(found_PB))

    for l in list_of_numbers:
        if len(l) == 5:
            sheet.write(row_index, 3, "";"".join(list_of_numbers))

wb.save("C:\\Documents\\num-1.xls")    

Your \\d+ pattern matches any 1 or more digits, thus the 16 value is matched. 您的\\d+模式匹配任意1个或多个数字,因此16值匹配。 Your [PB]+ character class matches either P or B one or more times, so it restricts the digits to be preceded with either P or B . 您的[PB]+字符类与PB匹配一次或多次,因此它限制了数字以PB As you want to match any digits, you actually do not need that restriction (if an A can be preceded with something optionally , the restriction no longer makes sense). 当您要匹配任何数字时,实际上并不需要该限制(如果A可以在前面加上可选的内容 ,则该限制不再有意义)。

You also seem to need to extract 5 digit string exactly, when no other digits precedes or follows them. 您似乎还需要准确地提取5位数字的字符串,而没有其他数字在它们之前或之后。 You may do that with (?<!\\d)\\d{5}(?!\\d) . 您可以使用(?<!\\d)\\d{5}(?!\\d)来做到这一点。 The (?<!\\d) negative lookbehind makes sure there is no digit immediately to the left of the current location, \\d{5} consumes 5 digits, and the (?!\\d) negative lookahead makes sure there is no digit immediately to the right of the current location. 后面的(?<!\\d)负数确保当前位置的左边没有数字, \\d{5}消耗5位数字,并且(?!\\d)负数提前确保没有数字。立即位于当前位置的右侧。 That makes the if len(l) == 5: line redundant and you may omit the whole part of code related to list_of_numbers . 这使得if len(l) == 5:行成为多余的,您可以省略与list_of_numbers相关的整个代码部分。

So, you may just use 因此,您可以使用

import xlwt, xlrd, re
from xlutils.copy import copy 

workbook = xlrd.open_workbook("C:\\Documents\\num.xlsx")
old_sheet = workbook.sheet_by_name("Sheet1")

wb = copy(workbook) 
sheet = wb.get_sheet(0)

number_of_ships = old_sheet.nrows

for row_index in range(0, old_sheet.nrows):

    Column_a = old_sheet.cell(row_index, 0).value   
    Column_b = old_sheet.cell(row_index, 1).value

    a_b = Column_a + Column_b

    found_PB = re.findall(r"(?<!\d)\d{5}(?!\d)", a_b)

    for f in found_PB:
            sheet.write(row_index, 2, "";"".join(found_PB))

wb.save("C:\\Documents\\num-1.xls")    

You may use this: ^(?:PB)?\\d{5}$ 您可以使用: ^(?:PB)?\\d{5}$

Demo 演示版

Explained: 解释:

^           # Begin of line/string
  (?:       # Begin of group
     PB     #   Literal 'PB'
  )         # End of group
  ?         # Make the previous group optional (? means 0 or 1 times)
  \d{5}     # 5 digits
$           # End of line/string

It is important to use the $ , since if you just wrote ^(?:PB)?\\d{5} you would match 6 digit numbers even if you wrote \\d{5} this is because you would match the first five digits and you would stop there, without knowing if there are more digits. 使用$很重要,因为如果您只写了^(?:PB)?\\d{5} ,即使您写了\\d{5}也将匹配6位数字,这是因为您将匹配前五位数字并且您会停在那里,而不知道是否还有更多数字。

If your data may start or end with spaces you may use this instead: ^\\s*(?:PB)?\\d{5}\\s*$ It basically adds \\s* at the beginning and the end of the regex. 如果您的数据可能以空格开头或结尾,则可以改用: ^\\s*(?:PB)?\\d{5}\\s*$它基本上在正则表达式的开头和结尾添加了\\s* \\s* means 0 or more spaces. \\s*表示0个或多个空格。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM