简体   繁体   中英

How to read until a certain string and repeat in Python?

So the problem is that given the below input, I would like to separate the URLs (that starts with either [URL or [LINK or [WEBSITE) and the text. I would like to put every URL in order into a list and every text into a text.

I also would like to combine all of the text into one line, so that every link matches with its corresponding text. Below is an example.

[URL - https://url1.com]
news_line1 word
news_line2 word word
news_line3 word word word

[LINK - https://url2.com]
headline_line1 letter
headline_line2 letter letter
headline_line3 letter letter letter

[WEBSITE - https://url3.com]
date_line1 sentence
date_line2 sentence sentence
date_line3 sentence sentence sentence

output would be Links:

[URL - https://url1.com]
[LINK - https://url2.com]
[WEBSITE - https://url3.com]

and Text:

news_line1 word news_line2 word word news_line3 word word word
headline_line1 letter headline_line2 letter letter headline_line3 letter letter letter
date_line1 sentence date_line2 sentence sentence date_line3 sentence sentence sentence

The current code I have is

import sys

inFile = sys.argv[1]

with open(inFile) as f:
    content = f.readlines()

content = [x.strip() for x in content]
url_links = []
sentences = []

for entry in content:
    sentence = ""
    if entry.startswith(("[URL", "[LINK", "[WEBSITE")):
        url_links.append(entry)

    else:
        sentence = sentence + entry

    sentences.append(sentence)

for sentence in sentences:
    print(sentence)

And the current output I have is


news_line1 word
news_line2 word word
news_line3 word word word


headline_line1 letter
headline_line2 letter letter
headline_line3 letter letter letter


date_line1 sentence
date_line2 sentence sentence
date_line3 sentence sentence sentence

How can I tweak it such that it gives me the correct output?

Again, the desired output is

news_line1 word news_line2 word word news_line3 word word word
headline_line1 letter headline_line2 letter letter headline_line3 letter letter letter
date_line1 sentence date_line2 sentence sentence date_line3 sentence sentence sentence

Instead of concatenating strings to a variable, you can append an empty string into sentences everytime you get a "[URL" "[WEBSITE" "[LINK" . And make all text appends to last sentence of sentences.

import sys

inFile = sys.argv[1]

with open(inFile) as f:
    content = f.readlines()

content = [x.strip() for x in content]
url_links = []
sentences = []

for entry in content:
    if entry.startswith(("[URL", "[LINK", "[WEBSITE")):
        url_links.append(entry)
        sentences.append("")

    else:
        sentences[-1] += entry


for sentence in sentences:
    print(sentence)

Here, I am concatenating strings using "+" however according to your requirements and python version there maybe faster alternatives to it.

Which is the preferred way to concatenate a string in Python?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM