簡體   English   中英

Python正則表達式在多行中查找所有

[英]Python regex to findall across multiple lines

在過去的一周里,我嘗試解決這個問題,但沒有取得任何進展。 非常感謝你們的任何幫助。

我有 1000 個帶有以下文本的文件:

,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS,
MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES,
AND IT ENDS AS ABRUPTLY,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,

但是一些文件也有這種方式

,,,,,,,,,,,,,,,,
Location:,,,ADDRESS,IS,IN,ONE,LINE,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,

我需要在 Python 中使用正則表達式提取大寫地址。

從技術上講,它是一個非常舊的系統導出的 CSV 文件。 它實際上無法用作 CSV,因此我選擇提取字符串,假設它是純文本文件。

我當前的代碼是這樣的,但我已經嘗試了很多其他組合,但沒有找到可行的解決方案。

location = re.findall(r'^Location:,,,(.*),,,,,,,,,,,,,\n$|^Location:,,,(.*)[\n.*]{1,2,3,4,5,6},,,,,,,,,,,,,', CSV, flags=re.DOTALL | re.MULTILINE)

我什至接近嗎? 或者有沒有更好的方法來解決這個問題?

我很感激這里的任何幫助。

這是一個想法:您可以使用簡單的循環來檢測和提取多行位置數據

# Test data
TEXT=""",,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS,
MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES,
AND IT ENDS AS ABRUPTLY,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS,IS,IN,ONE,LINE,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
"""

in_location = False
tmp_location = None

def extract_location(l):
    global in_location
    global tmp_location
    if l.startswith("Location:"):
        in_location = True
        tmp_location = []
        # special case
        if l.endswith(',,,,,,,,,,,,,'):
            print(l[13:-13])
            in_location = False
        else:
            tmp_location.append(l[13:]) # Don't need 'Location:,,,'
    else:
        if in_location:
            tmp_location.append(l)
            if l.endswith(',,,,,,,,,,,,,'):
                # The end
                in_location = False
                res =  " ".join(tmp_location)
                print(res[0:-13])  # Remove trailing commas


def main():
    for line in TEXT.split("\n"):
        extract_location(line)


if __name__ == "__main__":
    main()

假設它被保存到一個名為concept.py的文件中,

$ python3 concept.py
DDRESS_HERE_THAT I WANT BUT IT CAN ALSO BE ACROSS, MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES, AND IT ENDS AS ABRUPTLY
DDRESS,IS,IN,ONE,LINE

鑒於您提供的虛擬數據:

s = ''',,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS,
MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES,
AND IT ENDS AS ABRUPTLY,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS,IS,IN,ONE,LINE,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,'''

您可以使用以下正則表達式:

matches = re.findall(r'Location:((?:[^,]*,){16})', s, flags=re.MULTILINE)

這是比賽的樣子:

>>> print('\n\n'.join(matches))
,,,ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS,
MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES,
AND IT ENDS AS ABRUPTLY,,,,,,,,,,

,,,ADDRESS,IS,IN,ONE,LINE,,,,,,,,,

接下來要做什么取決於原始文件中逗號的含義。 例如,您可能想用空格替換它們:

addrs = [match.replace(',', ' ').strip() for match in matches]

看起來像這樣:

>>> print('\n\n'.join(addrs))
ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS 
MULTIPLE LINES  BUT NOT A SPECIFIC SET OF LINES 
AND IT ENDS AS ABRUPTLY

ADDRESS IS IN ONE LINE

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM