简体   繁体   中英

Removing text from a text file

I have a text file that contains URLs that I need to make shortcuts from. The file has other information that I don't need. For example: event number - xyz

More text here
And here

ALL https://.....

Atendees URLs 

1 -tab- https://.....
2 -tab- https://...
etc.

Right now I remove the extra text and empty lines and keep lines with only the URLs(plus the \\n \\t). I then with this code in python put the URLs into a list.

def fileOpen(self):

        self.skytap = []
        with open(self.file_1, 'r') as f:   
            for line in f:
                self.skytap.append(line.strip('\t\r\n'))

I would like to know if there is a way in Python to remove all text, numbering etc and keep the https://........ URLSs only in the order in which they are in the file and of course put them in the list so I can make the shortcuts(I have making of the shortcuts solved) I've looked at some of the questions online and some people suggested sed as a better tool for this. Would that be the case. I am new to programing and appreciate any insight given on this.

You tagged this with sed but the tool you are looking for is grep :

grep -o 'https\?://[^ ]\+' file.txt

It extracts the sequence http plus subsequent non-space characters.

You can change the file in place using fileinput.input finding the lines with https:// with re :

import  fileinput
import  re

r = re.compile(r"https://.*")
urls = []
for line in fileinput.input("match.txt",inplace=True):
     s = r.search(line)
     if s:
        print(line,end="")
        urls.append(s.group())

If you are using python 2 add a from __future__ import print_function at the top of your code.

If you want to remove the substring before the https in the file also replace print(line,end="") with print(s.group(),end="")

Or as @Jon kindly pointed out, import sys and use sys.stdout.write :

import  fileinput
import  re
import sys

r = re.compile(r"https://.*")
urls = []
for line in fileinput.input("match.txt",inplace=True):
     s = r.search(line)
     if s:
        sys.stdout.write(line)
        urls.append(s.group())

Maybe I can help you: How about you search for this regex: https?://[A-Za-z0-9-._~:/?#[\\]@!$&\\',*+,;=]* It searches for all valid URL characters

In fact, here's a program which should do the job (Untested):

import re

string="""
More text here
And here

ALL https://.....

Atendees URLs 

1 -tab- https://.....
2 -tab- https://...
etc.
"""

links = re.compile(r'https?://[A-Za-z0-9-._~:/?#[\]@!$&\',*+,;=]*').findall(string)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM