Removing text from a text file

Question

I have a text file that contains URLs that I need to make shortcuts from. The file has other information that I don't need. For example: event number - xyz

More text here
And here

ALL https://.....

Atendees URLs 

1 -tab- https://.....
2 -tab- https://...
etc.

Right now I remove the extra text and empty lines and keep lines with only the URLs(plus the \\n \\t). I then with this code in python put the URLs into a list.

def fileOpen(self):

        self.skytap = []
        with open(self.file_1, 'r') as f:   
            for line in f:
                self.skytap.append(line.strip('\t\r\n'))

I would like to know if there is a way in Python to remove all text, numbering etc and keep the https://........ URLSs only in the order in which they are in the file and of course put them in the list so I can make the shortcuts(I have making of the shortcuts solved) I've looked at some of the questions online and some people suggested sed as a better tool for this. Would that be the case. I am new to programing and appreciate any insight given on this.

Answer 1

You tagged this with sed but the tool you are looking for is grep :

grep -o 'https\?://[^ ]\+' file.txt

It extracts the sequence http plus subsequent non-space characters.

Answer 2

You can change the file in place using fileinput.input finding the lines with https:// with re :

import  fileinput
import  re

r = re.compile(r"https://.*")
urls = []
for line in fileinput.input("match.txt",inplace=True):
     s = r.search(line)
     if s:
        print(line,end="")
        urls.append(s.group())

If you are using python 2 add a from __future__ import print_function at the top of your code.

If you want to remove the substring before the https in the file also replace print(line,end="") with print(s.group(),end="")

Or as @Jon kindly pointed out, import sys and use sys.stdout.write :

import  fileinput
import  re
import sys

r = re.compile(r"https://.*")
urls = []
for line in fileinput.input("match.txt",inplace=True):
     s = r.search(line)
     if s:
        sys.stdout.write(line)
        urls.append(s.group())

Answer 3

Maybe I can help you: How about you search for this regex: https?://[A-Za-z0-9-._~:/?#[\\]@!$&\\',*+,;=]* It searches for all valid URL characters

In fact, here's a program which should do the job (Untested):

import re

string="""
More text here
And here

ALL https://.....

Atendees URLs 

1 -tab- https://.....
2 -tab- https://...
etc.
"""

links = re.compile(r'https?://[A-Za-z0-9-._~:/?#[\]@!$&\',*+,;=]*').findall(string)

Removing text from a text file

Question

3 answers

solution1
1 ACCPTED 2015-05-18 22:45:53

solution2
1 2015-05-18 22:48:29

solution3
0 2015-05-18 22:51:45

Removing text from a text file

Question

3 answers

solution1 1 ACCPTED 2015-05-18 22:45:53

solution2 1 2015-05-18 22:48:29

solution3 0 2015-05-18 22:51:45

solution1
1 ACCPTED 2015-05-18 22:45:53

solution2
1 2015-05-18 22:48:29

solution3
0 2015-05-18 22:51:45