I have a text file with the following content:
0:00 txt txt e-mail1_to_extract txt_to_extract1 txt txt /data
0:00 txt txt e-mail2_to_extract txt_to_extract2 txt txt /data
0:00 txt txt txt e-mail3_to_extract txt_to_extract3 txt txt /var
0:00 txt txt txt txt e-mail4_to_extract txt_to_extract4 txt txt /var
0:00 txt txt e-mail5_to_extract txt_to_extract5 txt txt /data
First, I'd like to extract all these lines between "0:00" and "/data" or "/var". Second, I'd like to handle this data so that I can extract only two parts of it. The text contained in this already extracted range is not standardized, so I can't use something like "startwith"/"endwith", however, the entire text is joined (like a whole word) and its positions are always repeated after the email part. Is there any way to specifically map that part and extract the email + the next string?
Txt = extra text that I don't want to extract.
I've already tried to start with the code below but didn't get any results:
with open('content.txt') as infile, open('extraction.txt', 'w') as outfile:
copy = False
for line in infile:
if line.strip() == "0:00":
copy = True
continue
elif line.strip() == "/":
copy = False
continue
elif copy:
outfile.write(line)
Desired output:
e-mail1_to_extract txt_to_extract1
e-mail2_to_extract txt_to_extract2
e-mail3_to_extract txt_to_extract3
e-mail4_to_extract txt_to_extract4
e-mail5_to_extract txt_to_extract5
Thank you!
I used a sample file in format you provided -
0:00 txt txt123 abc@abs.com txt_to_extract1 txt6456 txtssss /data
0:00 txt11 txt111 abd@rtx.vg txt_to_extract2 txtssss txtffff /data
0:00 txt111 txt123 txt tyrr@rgahb.com txt_to_extract3 txtosvbsvs txtkkkk /var
0:00 txt456 txt3663 srsr31415s@gagha.gha txt e-mail4_to_extract txt_to_extract4 txabjahsjat txtasba /var
0:00 txtGJK txtfggg gfa456vaj@aghaha.com txt_to_extract5 txtbxajla txtzbaza /data
I used the following code (function to determine email, please change regex accordingly) -
import re
regex = '^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}$'
def check(email):
if(re.search(regex,email)):
return True
else:
return False
def getcols(row):
for i in row.keys():
if check(row[i]):
return str(row[i]) + " " + str(row[i+1])
else:
return ""
ls = []
with open('TestData.txt') as infile, open('extraction.txt', 'w') as outfile:
for line in infile:
ls = line.split()
for i in range(len(ls)):
if check(ls[i]):
try:
outfile.write(ls[i] + " " + ls[i+1]+"\n")
except:
pass
I get the following output -
abc@abs.com txt_to_extract1
abd@rtx.vg txt_to_extract2
tyrr@rgahb.com txt_to_extract3
srsr31415s@gagha.gha txt
gfa456vaj@aghaha.com txt_to_extract5
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.