简体   繁体   English

带有正则表达式的条件语句的语法 Function

[英]Syntax for Conditional Statement With Regex Function

I have created a code to parse through multiple pdf files and return a line of data from each page.我创建了一个代码来解析多个 pdf 文件并从每个页面返回一行数据。 I came across the issue that some of the pages within my pdf files do not have this line.我遇到的问题是我的 pdf 文件中的某些页面没有此行。 When this happens my code just omits the page entirely;发生这种情况时,我的代码会完全忽略该页面; however I would like it to print a single 'none' for the pages where it can not find the specified line.但是我希望它为找不到指定行的页面打印一个“无”。 I thought this was a simple fix but its proving to be a little more complicated that I thought.我认为这是一个简单的修复,但事实证明它比我想象的要复杂一些。 Here is an example of the line I am pulling and what I have tried:这是我正在拉的线以及我尝试过的示例:

#pattern I told my code to look for within each page of pdf

sqft_re = re.compile('(\d+(sqft)\s+[$]\d+[.]\d+\s+\d{2}/\d{2})') 

#this is an example of what the line I want in each page looks like: 

'1600sqft $154.98 10/14' 

Basically I want the code to parse through every pdf and return the line if it can find it.基本上我希望代码能够解析每个 pdf 并返回该行,如果它可以找到它。 If it can not I want it to return a single 'none' for said page without that line.如果不能,我希望它在没有该行的情况下为所述页面返回一个“无”。 I have called the lines to a list like so:我已将这些行称为这样的列表:

lines = []

Here is how I set my for loop to look through each page of my pdf files:以下是我设置 for 循环以查看 pdf 文件的每一页的方法:

for files in os.listdir(directory):
  if files.endwith(".pdf"): 
       with pdfplumber.open(files) as pdf:
         pages = pdf.pages
         for page in pdf.pages:
           text = page.extract_text()
           for line in text.split('\n'):
             
             line = sqft_re.search(line)
             if line:
                 line.group(1)
                 lines.append(line)

Example of output: output 示例:

lines

'1600sqft $154.98 10/14' 
'1450qft $113.02 07/05' 
'90sqft $60.17 05/12' 
'3000sqft $500.98 09/20' 

This code successfully returns a the list of data for pages with the line.此代码成功返回具有该行的页面的数据列表。 However pages without the line are omitted.但是,没有该行的页面将被省略。 Here is what I thought would fix the problem and simply print none for pages without the line:这是我认为可以解决问题的方法,并且只需为没有该行的页面打印 none:

for files in os.listdir(directory):
  if files.endwith(".pdf"): 
       with pdfplumber.open(files) as pdf:
         pages = pdf.pages
         for page in pdf.pages:
           text = page.extract_text()
           for line in text.split('\n'):
             
             line = sqft_re.search(line)
             if line:
                 line.group(1)
             else:
                 line = 'None'
             lines.append(line)

However this did not work and now instead of just substituting 'None' for pages without the value every single line within the pdf page is printed as 'None' except for where it matches the line.然而,这并没有奏效,现在不是仅仅用“无”代替没有值的页面 pdf 页面中的每一行都打印为“无”,除了它与该行匹配的位置。 So basically I now have a list that looks like this:所以基本上我现在有一个看起来像这样的列表:

lines

'None'
'None'
'None'
'1600sqft $154.98 10/14' 
'None'
'None'
'None'
'1450qft $113.02 07/05' #etc.....

I have tried some other things like calling a different function when it does not match what I am looking for, making my own string to substitute the value with and a couple more.我尝试了其他一些方法,例如调用不同的 function 当它与我要查找的内容不匹配时,使用我自己的字符串来替换该值以及更多。 I am still getting the same problem.我仍然遇到同样的问题。 In my sample pdf there is only one page without this line so my list should look like:在我的示例 pdf 中,只有一页没有此行,因此我的列表应如下所示:

'1600sqft $154.98 10/14' 
'1450qft $113.02 07/05' 
'90sqft $60.17 05/12' 
'3000sqft $500.98 09/20' 
'None'

I am also pretty new to python (R is what I primarily work with) so I am sure I am overlooking something here but any guidance to what I am missing would be appreciated!我对 python 也很陌生(R 是我主要使用的),所以我确信我在这里忽略了一些东西,但是任何对我所缺少的东西的指导将不胜感激!

You should append the match to the lines variable, not the line itself, unless that is your intention.您应该 append 匹配lines变量,而不是行本身,除非这是您的意图。

Besides, you need to set a flag to False before checking each page and once there is a match, set it to True .此外,您需要在检查每个page之前将标志设置为False ,一旦匹配,将其设置为True If it is False at the end of the page, add None to the lines .如果在页面末尾为False ,则将None添加到lines

See a sample Python code with the loop:请参阅带有循环的示例 Python 代码:

for page in pdf.pages:
  text = page.extract_text()
  found = False
  for line in text.split('\n'):
    line = sqft_re.search(line)
    found = not found
    lines.append(line.group(1))
  if not found:
    lines.append('None')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM