简体   繁体   English

使用正则表达式避免使用特殊字符

[英]Avoiding special characters with regex

I'm writing a Python script to extract metadata from a PDF user pyPdf. 我正在编写一个Python脚本来从PDF用户pyPdf中提取元数据。

The output is something like this: 输出是这样的:

{'/Subject': u'Presentation from the 2011 Water Program Peer Review',
 '/Producer': u'Mac OS X 10.7.2 Quartz PDFContext', 
 '/Creator': u'PowerPoint', 
 '/ModDate': u"D:20120109085812-07'00'", 
 '/Keywords': u'', 
 '/Title': u'Wind Wave Float', 
 '/CreationDate': 'D:20111030043455Z'}

I only need the title and subject fields, so the printed output would ideally be: 我只需要titlesubject字段,因此打印输出最好是:

Wind Wave Float, Presentation from... Wind Wave Float,演示来自......

So I can easily input the data into a spreadsheet. 所以我可以轻松地将数据输入电子表格。

Can anyone help me with some regex? 有人可以帮我一些正则表达式吗? I can't seem to figure out how to get it done with all of the weird characters in the output. 我似乎无法弄清楚如何使用输出中的所有奇怪字符完成它。

Thanks. 谢谢。

The output you're looking at is a dictionary, so the information you want is already available. 您正在查看的输出是字典,因此您想要的信息已经可用。 The 'u' that you see in the output dictionary indicates that the string is Unicode format. 您在输出字典中看到的“u”表示该字符串是Unicode格式。

I think the easy way to proceed to reach your goal of getting the information into a spreadsheet, is to just add the following in your script: 我认为,实现将信息输入电子表格的目标的简单方法是在脚本中添加以下内容:

(in Python 2.x): (在Python 2.x中):

print outputdict['/Title'] + ", " + outputdict['/Subject']

This will give you output: 这会给你输出:

Wind Wave Float, Presentation from...

(replace outputdict above with whatever object is providing the dictionary output you've pasted in your question) (将上面的outputdict替换为提供您在问题中粘贴的字典输出的任何对象)

Try: 尝试:

(?i)((?<=subject': u')[^']+|(?<=title': u')[^']+)

This regex will match Presentation from the 2011 Water Program Peer Review and Wind Wave Float from 这个正则表达式将与Presentation from the 2011 Water Program Peer ReviewWind Wave Float Presentation from the 2011 Water Program Peer Review匹配

{'/Subject': u'Presentation from the 2011 Water Program Peer Review', '/Producer': u'Mac OS X 10.7.2 Quartz PDFContext', '/Creator': u'PowerPoint', '/ModDate': u"D:20120109085812-07'00'", '/Keywords': u'', '/Title': u'Wind Wave Float', '/CreationDate': u'D:20111030043455Z'}

It is basically matching anything after subject': u' or /Title': u' that isnt a ' . 它基本上匹配subject': u'/Title': u'之后的任何subject': u' /Title': u'不是'

Try this regex: 试试这个正则表达式:

'/(Subject|Title)':\s+u('[^']+'|"[^"]+")(?=, )

Description 描述

正则表达式可视化

Demo 演示

https://www.debuggex.com/r/k78fFPfoaOofugig https://www.debuggex.com/r/k78fFPfoaOofugig

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM