[英]How to extract lines in text file and find duplicates
我有一個寫有很多行的文本文件,在文本文件中有很多次叫做“ @Testrun”的單詞,考慮到“ @Testrun”是凝視點,端點也被視為“ @Testrun”,考慮到這兩個之間的行“ @Testrun”作為這些文本的一部分,可以多於3-4個部分。 我的問題是如何提取零件中的這些行並在這些零件中找到重復的行:
我的文本文件如下所示:
@TestRun
And user validate message on screen "Switch to paperless"
And user click on "Manage accounts" label
And user click link with label "View all online services"
And user waits for 10 seconds
Then page is successfully launched
And user click link with label "Go paperless for complete convenience"
Then page is successfully launched
And user validate message on screen "#EmailAddress"
And user clicks on the button "Confirm"
Then page is successfully launched
And user validate message on screen "#MessageValidate"
Then page is successfully launched
And user click on "menu open user preferences" label
And user clicks on the link "Statement and letter preferences"
Then page is successfully launched
And user validate "Switch to paperless" button is disabled
And user validate message on screen "Online only"
When user click on "Log out" label
Then page is successfully launched
@TestRun
And user click on link "Mobile site"
And user set text "#Surname" on textbox name "surname"
Then page is successfully launched
And user click on link "#Account"
Then page is successfully launched
And user verify message on screen "#Account"
And user verify message on screen "Manage statements"
And user verify message on screen "Step 1 of 3"
Then page is successfully launched
And user verify message on screen "Current format type"
And user verify message on screen "Online"
When user selects the radio button "Paper"
@TestRun
Then user wait for page load
And user click on button "Continue to Online Banking"
Then user wait for page load
And user click on "menu open user preferences" label
And user clicks on the link "Statement and letter preferences"
Then page is successfully launched
And page is successfully launched
And user waits for 10 seconds
@TestRun
Then page is successfully launched
And user waits for 10 seconds
And user click checkbox "Telephone"
And user click checkbox "Post"
And user clicks on the button "Save"
Then page is successfully launched
我嘗試了以下代碼,但是這不起作用:
with open('CustPref.txt') as input_data:
for line in input_data:
if line.strip() == '@TestRun ':
break
for line in input_data:
if line.strip() == '@TestRun ':
break
print line
我得到輸出,但這是完全不正確的。 我只有一條線是意外的輸出。如何解決這個問題
您解決了2個問題:
分裂:
第一種選擇
逐行解析文件:
parts = [] # all lines between 2 @TestRun's
chunks = [] # all chunks of lines between 2 @TestRun's
startNow = False # wait till first @TestRun before keeping anything
for line in Text(): # see definition for Text() below - it mimics your open('...')
if line.strip() == '@TestRun':
startNow = True
if len(parts) > 0: # found a Testrun, if parts contains lines append to chunks
chunks.append(parts)
parts = []
elif startNow == True: # check if first TestRun hit, if so append line to parts
parts.append(line)
print(chunks) # done -> list of list of lines between chunks.
第二選擇
請勿按行分割文本,將其作為完整文本讀入,並使用列表推導對其進行分割:
biggerChunks = [x.strip() for x in TextTT().split("@TestRun") ]
chunkified = [x.splitlines() for x in biggerChunks if len(x.strip()) > 0 ]
您首先在@TestRun
上@TestRun
並獲取大文本塊的列表,然后將它們@TestRun
拆分。 結果大致相同:[[2 @TestRun的所有行]
刪除重復項 (同時保持順序)
在這里得到了回答:您如何從一個列表中按保留順序刪除重復項 -這是一條SO鏈接,因此在這里不再重復使用 :)
Helpers Text()替代打開的文件,TestTT()是整個文本塊:
def Text(): # instead of file open, returns list of lines
return TextTT().splitlines()
def TextTT(): # unsplit text
return '''
@TestRun
And user validate message on screen "Switch to paperless"
And user click on "Manage accounts" label
And user click link with label "View all online services"
And user waits for 10 seconds
Then page is successfully launched
And user click link with label "Go paperless for complete convenience"
Then page is successfully launched
And user validate message on screen "#EmailAddress"
And user clicks on the button "Confirm"
Then page is successfully launched
And user validate message on screen "#MessageValidate"
Then page is successfully launched
And user click on "menu open user preferences" label
And user clicks on the link "Statement and letter preferences"
Then page is successfully launched
And user validate "Switch to paperless" button is disabled
And user validate message on screen "Online only"
When user click on "Log out" label
Then page is successfully launched
@TestRun
And user click on link "Mobile site"
And user set text "#Surname" on textbox name "surname"
Then page is successfully launched
And user click on link "#Account"
Then page is successfully launched
And user verify message on screen "#Account"
And user verify message on screen "Manage statements"
And user verify message on screen "Step 1 of 3"
Then page is successfully launched
And user verify message on screen "Current format type"
And user verify message on screen "Online"
When user selects the radio button "Paper"
@TestRun
Then user wait for page load
And user click on button "Continue to Online Banking"
Then user wait for page load
And user click on "menu open user preferences" label
And user clicks on the link "Statement and letter preferences"
Then page is successfully launched
And page is successfully launched
And user waits for 10 seconds
@TestRun
Then page is successfully launched
And user waits for 10 seconds
And user click checkbox "Telephone"
And user click checkbox "Post"
And user clicks on the button "Save"
Then page is successfully launched
'''
查看注釋以獲取解釋-如果需要,您可以使用fe itertools.chain重新組合內線
使用more_itertools
第三方庫,我們可以在所需目標之前分割文本。
更新 :我們可以使用itertools.dropwhile
在第一個目標之前放置行。
import itertools as it
import more_itertools as mit
with open("CustPref.txt", "r") as f:
lines = f.readlines()
pred = lambda x: x.startswith("@TestRun") # trailing-space protection
inv_pred = lambda x: not pred(x)
lines = it.dropwhile(inv_pred, lines) # optional
chunks = list(mit.split_before(lines, pred))
print(chunks)
輸出(縮寫)
[['@TestRun\n',
' And user validate message on screen "Switch to paperless" \n',
...],
['@TestRun \n',
' And user click on link "Mobile site" \n',
...],
['@TestRun\n',
'Then user wait for page load\n',
...],
...]
一種簡單的方法是記住您已經看到的線條。 您可以將它們收集到列表中,但是使用字典或集合會更有效。
一次讀取一行。 如果此行(不是新的TestRun標頭,並且已經被看到),請不要打印它。 如果它是TestRun標頭,請忘記所見。 打印所有在循環中達到目標的內容。 從下一行開始。
with open('CustPref.txt') as input_data:
seen = set()
for line in input_data:
# trim trailing newline
line = line.rstrip('\n')
if line == '@TestRun ': # really sure about the trailing space?
seen = set() # who am I? what day is it?
elif line in seen:
# skip the rest of the for loop and start over
continue
else:
seen.add(line)
print(line)
以編程方式,按此順序檢查“是否為@TestRun,是否已經看到,否則添加為已看到”是有意義的,因此您不必兩次檢查它是否為@TestRun。 我想在上面的說明中保留更自然的順序,以使其更簡單。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.