I have a text file that contains URLs that I need to make shortcuts from. The file has other information that I don't need. For example: event number - xyz
More text here
And here
ALL https://.....
Atendees URLs
1 -tab- https://.....
2 -tab- https://...
etc.
Right now I remove the extra text and empty lines and keep lines with only the URLs(plus the \\n \\t). I then with this code in python put the URLs into a list.
def fileOpen(self):
self.skytap = []
with open(self.file_1, 'r') as f:
for line in f:
self.skytap.append(line.strip('\t\r\n'))
I would like to know if there is a way in Python to remove all text, numbering etc and keep the https://........ URLSs only in the order in which they are in the file and of course put them in the list so I can make the shortcuts(I have making of the shortcuts solved) I've looked at some of the questions online and some people suggested sed as a better tool for this. Would that be the case. I am new to programing and appreciate any insight given on this.
You tagged this with sed
but the tool you are looking for is grep
:
grep -o 'https\?://[^ ]\+' file.txt
It extracts the sequence http
plus subsequent non-space characters.
You can change the file in place using fileinput.input
finding the lines with https://
with re
:
import fileinput
import re
r = re.compile(r"https://.*")
urls = []
for line in fileinput.input("match.txt",inplace=True):
s = r.search(line)
if s:
print(line,end="")
urls.append(s.group())
If you are using python 2 add a from __future__ import print_function
at the top of your code.
If you want to remove the substring before the https in the file also replace print(line,end="")
with print(s.group(),end="")
Or as @Jon kindly pointed out, import sys and use sys.stdout.write
:
import fileinput
import re
import sys
r = re.compile(r"https://.*")
urls = []
for line in fileinput.input("match.txt",inplace=True):
s = r.search(line)
if s:
sys.stdout.write(line)
urls.append(s.group())
Maybe I can help you: How about you search for this regex: https?://[A-Za-z0-9-._~:/?#[\\]@!$&\\',*+,;=]*
It searches for all valid URL characters
In fact, here's a program which should do the job (Untested):
import re
string="""
More text here
And here
ALL https://.....
Atendees URLs
1 -tab- https://.....
2 -tab- https://...
etc.
"""
links = re.compile(r'https?://[A-Za-z0-9-._~:/?#[\]@!$&\',*+,;=]*').findall(string)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.