[英]Python, extracting numbers from Excel column and write as output
Trying to extract the number from columns in an Excel file, and write them into the next columns. 尝试从Excel文件中的列中提取数字,然后将其写入下一列。
Matching criteria: any number of length five, either started with “PB” or not 匹配条件:长度为5的任意数目,是否以“ PB”开头
I've limited the length of the number match to five however there are a “16” extracted (row#2, column D) 我将数字匹配的长度限制为五个,但是提取了“ 16”(第2行,D列)
How I can improve it? 我该如何改善? Thank you. 谢谢。
import xlwt, xlrd, re
from xlutils.copy import copy
workbook = xlrd.open_workbook("C:\\Documents\\num.xlsx")
old_sheet = workbook.sheet_by_name("Sheet1")
wb = copy(workbook)
sheet = wb.get_sheet(0)
number_of_ships = old_sheet.nrows
for row_index in range(0, old_sheet.nrows):
Column_a = old_sheet.cell(row_index, 0).value
Column_b = old_sheet.cell(row_index, 1).value
a_b = Column_a + Column_b
found_PB = re.findall(r"[PB]+(\d{5})", a_b, re.I)
list_of_numbers = re.findall(r'\d+', a_b)
for f in found_PB:
if len(f) == 5:
sheet.write(row_index, 2, "";"".join(found_PB))
for l in list_of_numbers:
if len(l) == 5:
sheet.write(row_index, 3, "";"".join(list_of_numbers))
wb.save("C:\\Documents\\num-1.xls")
Your \\d+
pattern matches any 1 or more digits, thus the 16
value is matched. 您的\\d+
模式匹配任意1个或多个数字,因此16
值匹配。 Your [PB]+
character class matches either P
or B
one or more times, so it restricts the digits to be preceded with either P
or B
. 您的[PB]+
字符类与P
或B
匹配一次或多次,因此它限制了数字以P
或B
。 As you want to match any digits, you actually do not need that restriction (if an A
can be preceded with something optionally , the restriction no longer makes sense). 当您要匹配任何数字时,实际上并不需要该限制(如果A
可以在前面加上可选的内容 ,则该限制不再有意义)。
You also seem to need to extract 5 digit string exactly, when no other digits precedes or follows them. 您似乎还需要准确地提取5位数字的字符串,而没有其他数字在它们之前或之后。 You may do that with (?<!\\d)\\d{5}(?!\\d)
. 您可以使用(?<!\\d)\\d{5}(?!\\d)
来做到这一点。 The (?<!\\d)
negative lookbehind makes sure there is no digit immediately to the left of the current location, \\d{5}
consumes 5 digits, and the (?!\\d)
negative lookahead makes sure there is no digit immediately to the right of the current location. 后面的(?<!\\d)
负数确保当前位置的左边没有数字, \\d{5}
消耗5位数字,并且(?!\\d)
负数提前确保没有数字。立即位于当前位置的右侧。 That makes the if len(l) == 5:
line redundant and you may omit the whole part of code related to list_of_numbers
. 这使得if len(l) == 5:
行成为多余的,您可以省略与list_of_numbers
相关的整个代码部分。
So, you may just use 因此,您可以使用
import xlwt, xlrd, re
from xlutils.copy import copy
workbook = xlrd.open_workbook("C:\\Documents\\num.xlsx")
old_sheet = workbook.sheet_by_name("Sheet1")
wb = copy(workbook)
sheet = wb.get_sheet(0)
number_of_ships = old_sheet.nrows
for row_index in range(0, old_sheet.nrows):
Column_a = old_sheet.cell(row_index, 0).value
Column_b = old_sheet.cell(row_index, 1).value
a_b = Column_a + Column_b
found_PB = re.findall(r"(?<!\d)\d{5}(?!\d)", a_b)
for f in found_PB:
sheet.write(row_index, 2, "";"".join(found_PB))
wb.save("C:\\Documents\\num-1.xls")
You may use this: ^(?:PB)?\\d{5}$
您可以使用: ^(?:PB)?\\d{5}$
Explained: 解释:
^ # Begin of line/string
(?: # Begin of group
PB # Literal 'PB'
) # End of group
? # Make the previous group optional (? means 0 or 1 times)
\d{5} # 5 digits
$ # End of line/string
It is important to use the $
, since if you just wrote ^(?:PB)?\\d{5}
you would match 6 digit numbers even if you wrote \\d{5}
this is because you would match the first five digits and you would stop there, without knowing if there are more digits. 使用$
很重要,因为如果您只写了^(?:PB)?\\d{5}
,即使您写了\\d{5}
也将匹配6位数字,这是因为您将匹配前五位数字并且您会停在那里,而不知道是否还有更多数字。
If your data may start or end with spaces you may use this instead: ^\\s*(?:PB)?\\d{5}\\s*$
It basically adds \\s*
at the beginning and the end of the regex. 如果您的数据可能以空格开头或结尾,则可以改用: ^\\s*(?:PB)?\\d{5}\\s*$
它基本上在正则表达式的开头和结尾添加了\\s*
。 \\s*
means 0 or more spaces. \\s*
表示0个或多个空格。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.