簡體   English   中英

從文本+ python正則表達式中提取行

[英]Extract the rows from the text + python regex

我正在嘗試從文本文件中提取整個行,但是它沒有按預期工作。

樣本文本文件內容:

data = """Add TTFF LEVERERGE 30 mp -5%
Some Text, Some Text
5882950 Abc Lahd
Pos Sequence Batch datax datay dataz dataa datab
1 00061680 904834 20.35 REV 177,650 5329,50
Bundled 2-rev 42al/xyz
Neon Classic Unit 1300 abc \ 1638\48
2 00012815 55244 815 FWD 164,720 18448,64
UnBundled 2-pag
Mathrine Classic straight Tilt 2 xyz / 23,2x23gb
150st/xyz 20 abc/xyz
3 90072815 65944 212 KRT 164,720 18448,64
UnBundled 2-pag
Mathrine Classic straight Tilt 2 xyz / 23,2x23gb
150st/bunt 20 bunt/bal
Some Valid Text
Some More Valid Text Some More Valid Text"""

我希望所有三行都以列表格式從中提取特定值。

邏輯是:

  1. 在開始新行之前停止提取
  2. 每行以一個序號(1、2、3,...,99等)分層
  3. 考慮以“某些有效文本”結尾的最后一行的末尾

(由於前兩個步驟不起作用,因此在re.findall的此步驟中未將#3視為正則表達式)

$re.findall(r'(^\d{1,2}\s.*?\n^\d)', data, re.DOTALL|re.M)

['1 00061680 904834 20.35 REV 177,650 5329,50\nBundled 2-rev 42al/xyz\nNeon Classic Unit 1300 abc \\ 1638\x048\n2',
 '3 90072815 65944 212 KRT 164,720 18448,64\nUnBundled 2-pag\nMathrine Classic straight Tilt 2 xyz / 23,2x23gb\n1']

預期結果是:

['1 00061680 904834 20.35 REV 177,650 5329,50\nBundled 2-rev 42al/xyz\nNeon Classic Unit 1300 abc \\ 1638\x048\n',
'2 00012815 55244 815 FWD 164,720 18448,64\n    UnBundled 2-pag\n    Mathrine Classic straight Tilt 2 xyz / 23,2x23gb\n    150st/xyz 20 abc/xyz',
'3 90072815 65944 212 KRT 164,720 18448,64\nUnBundled 2-pag\nMathrine Classic straight Tilt 2 xyz / 23,2x23gb\n150st/bunt 20 bunt/bal']

任何指導/幫助從文本中提取行?

如果您的正則表達式必須作為模式的一部分進行“計數”,則我不打算使用正則表達式,而應使用解析器-regex用於常規模式,而不是用於計數(盡管這里有些ppl創建了我的正則表達式)認為不可能)。

這是一種簡單明了的非正則表達式方法。 由於您沒有提供重要的“ STOP HERE”標記,因此必須清理最后一個項目。 我高度懷疑' Some Valid Text Some More Valid Text Some More Valid Text']'將成為您文本的一部分,因此不符合“停止”的條件。

輸出也不包含終止符'\\n' n'-我用它們將行分割為-well-行。 你可以添加一個'\\n'join()荷蘭國際集團的part ■如果你真的需要它們:

data = """Add TTFF LEVERERGE 30 mp -5%
Some Text, Some Text
5882950 Abc Lahd
Pos Sequence Batch datax datay dataz dataa datab
1 00061680 904834 20.35 REV 177,650 5329,50
Bundled 2-rev 42al/xyz
Neon Classic Unit 1300 abc \ 1638\48
2 00012815 55244 815 FWD 164,720 18448,64
UnBundled 2-pag
Mathrine Classic straight Tilt 2 xyz / 23,2x23gb
150st/xyz 20 abc/xyz
3 90072815 65944 212 KRT 164,720 18448,64
UnBundled 2-pag
Mathrine Classic straight Tilt 2 xyz / 23,2x23gb
150st/bunt 20 bunt/bal
Some Valid Text
Some More Valid Text Some More Valid Text"""

rdata = data.split('\n')
skipprows = rdata.index('Pos Sequence Batch datax datay dataz dataa datab')
lines = rdata[skipprows + 1:]

i = 1       # looking for this + space at string start to see when 1 line id done
part = []   # collects parts that belong to one line
result = [] # holds the joined lines from part
for li in lines:
    if li.startswith(f'{i} '):            # look for linenr + space
        if part:                          # do not add empty parts
            result.append(' '.join(part)) # add joined if something in it
        part = [li]                       # start with current li for next parts
        i += 1                            # increase so we look for next one
    else:
        part.append(li)

if part:                                  # add last part if not empty
    result.append(' '.join(part))

print(result)                             # print all

輸出:

['1 00061680 904834 20.35 REV 177,650 5329,50 Bundled 2-rev 42al/xyz Neon Classic Unit 1300 abc \\ 1638\x048', 
 '2 00012815 55244 815 FWD 164,720 18448,64 UnBundled 2-pag Mathrine Classic straight Tilt 2 xyz / 23,2x23gb 150st/xyz 20 abc/xyz', 
 '3 90072815 65944 212 KRT 164,720 18448,64 UnBundled 2-pag Mathrine Classic straight Tilt 2 xyz / 23,2x23gb 150st/bunt 20 bunt/bal Some Valid Text Some More Valid Text Some More Valid Text']

警告 :如果您的台詞恰好像:

1 Some thing to eat
and some more data of it, containing
2 packs each
2 Some other thing to eat to get more muscles
and even more text containing 
3 things that make you BIGGGER
3 Last text ....

解析將變得很困難,您將無法獲得正確的數據。

使用re.findall()函數和特定的正則表達式模式:

rows = re.findall(r'(^\d{1,2} .+?)(?=\n(?:\d+ |Some Valid Tex))', data, re.DOTALL | re.M)
print(rows)

輸出:

['1 00061680 904834 20.35 REV 177,650 5329,50\nBundled 2-rev 42al/xyz\nNeon Classic Unit 1300 abc \\ 1638\x048', '2 00012815 55244 815 FWD 164,720 18448,64\nUnBundled 2-pag\nMathrine Classic straight Tilt 2 xyz / 23,2x23gb\n150st/xyz 20 abc/xyz', '3 90072815 65944 212 KRT 164,720 18448,64\nUnBundled 2-pag\nMathrine Classic straight Tilt 2 xyz / 23,2x23gb\n150st/bunt 20 bunt/bal']

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM