[英]Python regex to findall across multiple lines
在過去的一周里,我嘗試解決這個問題,但沒有取得任何進展。 非常感謝你們的任何幫助。
我有 1000 個帶有以下文本的文件:
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS,
MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES,
AND IT ENDS AS ABRUPTLY,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
但是一些文件也有這種方式
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS,IS,IN,ONE,LINE,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
我需要在 Python 中使用正則表達式提取大寫地址。
從技術上講,它是一個非常舊的系統導出的 CSV 文件。 它實際上無法用作 CSV,因此我選擇提取字符串,假設它是純文本文件。
我當前的代碼是這樣的,但我已經嘗試了很多其他組合,但沒有找到可行的解決方案。
location = re.findall(r'^Location:,,,(.*),,,,,,,,,,,,,\n$|^Location:,,,(.*)[\n.*]{1,2,3,4,5,6},,,,,,,,,,,,,', CSV, flags=re.DOTALL | re.MULTILINE)
我什至接近嗎? 或者有沒有更好的方法來解決這個問題?
我很感激這里的任何幫助。
這是一個想法:您可以使用簡單的循環來檢測和提取多行位置數據
# Test data
TEXT=""",,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS,
MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES,
AND IT ENDS AS ABRUPTLY,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS,IS,IN,ONE,LINE,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
"""
in_location = False
tmp_location = None
def extract_location(l):
global in_location
global tmp_location
if l.startswith("Location:"):
in_location = True
tmp_location = []
# special case
if l.endswith(',,,,,,,,,,,,,'):
print(l[13:-13])
in_location = False
else:
tmp_location.append(l[13:]) # Don't need 'Location:,,,'
else:
if in_location:
tmp_location.append(l)
if l.endswith(',,,,,,,,,,,,,'):
# The end
in_location = False
res = " ".join(tmp_location)
print(res[0:-13]) # Remove trailing commas
def main():
for line in TEXT.split("\n"):
extract_location(line)
if __name__ == "__main__":
main()
假設它被保存到一個名為concept.py
的文件中,
$ python3 concept.py
DDRESS_HERE_THAT I WANT BUT IT CAN ALSO BE ACROSS, MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES, AND IT ENDS AS ABRUPTLY
DDRESS,IS,IN,ONE,LINE
鑒於您提供的虛擬數據:
s = ''',,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS,
MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES,
AND IT ENDS AS ABRUPTLY,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS,IS,IN,ONE,LINE,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,'''
您可以使用以下正則表達式:
matches = re.findall(r'Location:((?:[^,]*,){16})', s, flags=re.MULTILINE)
這是比賽的樣子:
>>> print('\n\n'.join(matches))
,,,ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS,
MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES,
AND IT ENDS AS ABRUPTLY,,,,,,,,,,
,,,ADDRESS,IS,IN,ONE,LINE,,,,,,,,,
接下來要做什么取決於原始文件中逗號的含義。 例如,您可能想用空格替換它們:
addrs = [match.replace(',', ' ').strip() for match in matches]
看起來像這樣:
>>> print('\n\n'.join(addrs))
ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS
MULTIPLE LINES BUT NOT A SPECIFIC SET OF LINES
AND IT ENDS AS ABRUPTLY
ADDRESS IS IN ONE LINE
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.