简体   繁体   English

Python(Regex):如何让Python忽略您尝试匹配的字符串模式之间的所有换行符?

[英]Python (Regex): How do you get Python to ignore all the newlines in between the string pattern you are trying match?

I am trying to create a list of personnel through the following regex code: 我正在尝试通过以下正则表达式代码创建人员列表:

list_of_electricians = re.findall(r'\d*\.<(\d*)<([\w+ ]*)<"([^"]*)"<"([^"]*)"', csvFile1.read(), re.S)
csvFile2 = open(r'C:\\Users\\Admin\\SkyDrive\\eCommerce\\Servi-fied\\Raw Data\\EMA - Electricians (ReProcessed).csv', 'w+')
writer2 = csv.writer(csvFile2, delimiter=';')

for item in list_of_electricians:
    writer2.writerow(item)

The data that I am trying to extract is in the string as follows: 我尝试提取的数据在字符串中,如下所示:

1.<7059184<ABDUL HALIM M<"ABDUL HALIM M
                                  639 #24-98
                                 ROWELL ROAD
                        200639"<"62971924(Tel)
                   93632009(Hp)"

2.<7055147<ABDULLAH SUNNY BIN ALI<"SINGAPORE MRT LTD
                                  251
                                 NORTH BRIDGE ROAD
                        179102"<"65476617(Tel)
                   96814905(Hp)"

3.<7063254<ANG CHUI POH<"AKP INDUSTRIES PTE LTD
                                  8B #05-08
                                 ADMIRALTY STREET
                        757440"<"64811528(Tel)
                   93890779(Hp)"

Any suggestions as to how I should go about changing the regex code so that all the newlines are ignored? 关于如何更改正则表达式代码,以便忽略所有换行符的任何建议? I understand that I could remove all the "\\n" or newline characters before running the regex. 我知道我可以在运行正则表达式之前删除所有“ \\ n”或换行符。 However, I need those lines later on so that it is easier to process the addresses. 但是,稍后我需要这些行,以便更轻松地处理地址。

At the end of the day, I am looking at creating a csv file with the data separated into license number, name, address and phone numbers. 归根结底,我正在考虑创建一个csv文件,其数据分为许可证号,名称,地址和电话号码。

Thanks! 谢谢!

Your regular expression is pretty hard for me to parse in my brain, so bear with me. 您的正则表达式对我的大脑来说很难解析,所以请耐心等待。 I might even try using string splitting with the chosen delimiters in this case, because it's pretty complicated 在这种情况下,我什至可以尝试将字符串拆分与选定的分隔符一起使用,因为它非常复杂

One tool that's pretty helpful for this sort of thing is http://pythex.org http://pythex.org就是其中一种非常有用的工具

Anyways, adding [] around the " magically fixes it. Don't ask me why. 无论如何,在“周围加上[]可以神奇地解决它。不要问我为什么。

\d*\.<(\d*)<([\w+ ]*)<"([^"]*)["]<"([^"]*)"
                              /\
                             here

The code that you have should give you an array of tuples that you can iterate by. 您拥有的代码应为您提供一个可以迭代的元组数组。

That means that your variable list_of_electricians will have something like this: 这意味着您的变量list_of_electricians将具有以下内容:

[('1',
'7059184',
'ABDUL HALIM M',
"ABDUL HALIM M 639 #24-98  ROWELL ROAD 200639"),
('2', 
'7055147', 
'ABDULLAH SUNNY BIN ALI',
"SINGAPORE MRT LTD    251  NORTH BRIDGE ROAD 179102"]

that you can iterate by using a typically a for loop 您可以使用典型的for循环进行迭代

Hope that helps 希望能有所帮助

Why not just use csv.reader and avoid the regex altogether?: 为什么不只使用csv.reader并完全避免使用正则表达式?:

>>> infile = StringIO(data)
>>> rdr = csv.reader(infile, delimiter="<")
>>> for row in rdr: print(row)

['1.', '7059184', 'ABDUL HALIM M', 'ABDUL HALIM M\n                                  639 #24-98\n                                 ROWELL ROAD\n                        200639', '62971924(Tel)\n                   93632009(Hp)']
[]
['2.', '7055147', 'ABDULLAH SUNNY BIN ALI', 'SINGAPORE MRT LTD\n                                  251\n                                 NORTH BRIDGE ROAD\n                        179102', '65476617(Tel)\n                   96814905(Hp)']
[]
['3.', '7063254', 'ANG CHUI POH', 'AKP INDUSTRIES PTE LTD\n                                  8B #05-08\n                                 ADMIRALTY STREET\n                        757440', '64811528(Tel)\n                   93890779(Hp)']
>>> 

That regex is a bit overly complex. 该正则表达式有点过于复杂。 This uses a simpler regex and keeps the lines less than 80 characters long (PEP 8): 这使用了更简单的正则表达式,并使行的长度少于80个字符(PEP 8):

list_of_electricians = \ 
    re.findall(r'.*?<(.*?)<(.*?)<"(.*?)"<"(.*?)"', csvFile1.read(), re.S)

The above will still capture the newlines and multiple spaces. 以上仍将捕获换行符和多个空格。 One way to get rid of them is to rebuild the list after the fact: 摆脱它们的一种方法是在事实发生后重建列表:

for i,x in enumerate(list_of_electricians) :
    list_of_electricians[i] = [' '.join(y.split()) for y in x]

Another way to get rid of them is to use list comprehensions so as to eliminate them from the very start: 摆脱它们的另一种方法是使用列表理解,以便从一开始就消除它们:

list_of_electricians = \ 
    [[' '.join(x.split()) for x in y] \
     for y in \
     re.findall(r'.*?<(.*?)<(.*?)<"(.*?)"<"(.*?)"', csvFile1.read(), re.S)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM