简体   繁体   中英

Extract text present in between two strings in a text file using Python

Lets say I have a Text file with the below content:(Contents added post original answer)

    Quetiapine fumarate Drug substance  This document
    Povidone    Binder  USP
    This line doesn't contain any medicine name.
    This line contains Quetiapine fumarate which shouldn't be extracted as it not present at the 
    beginning of the line.
    Dibasic calcium phosphate dihydrate Diluent USP is not present in the csv
    Lactose monohydrate Diluent USNF
    Magnesium stearate  Lubricant   USNF


    Lactose monohydrate, CI 77491   
    0.6
    Colourant
    E 172

    Some lines to break the group.
    Silicon dioxide colloidal anhydrous
    (0.004
    Gliding agent
    Ph Eur

    Adding some random lines.

    Povidone
    (0.2
    Lubricant
    Ph Eur

I have a csv containing a list of medicine name which I want to match inside the .txt file and extract all the data that is present between 2 unique medicines(when the medicine name is at the beginning of the line).(Example of medicines from the csv file are 'Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate' etc etc.)

I want to iterate each line of my text file and create groups from one medicine to another.

This should only happen if the medicine name is present at the start of the newline and is not present in between a line.

Expected output:

['Quetiapine fumarate   Drug substance  This document'],
['Povidone  Binder  USP'],
['Lactose monohydrate   Diluent USNF'],
['Magnesium stearate    Lubricant   USNF'],
[Lactose monohydrate, CI 77491  
    0.6
    Colourant
    E 172],

[Povidone
    (0.2
    Lubricant
    Ph Eur]

Can someone please help me with the same to do this in Python?

Attempt till now:

with open('C:/Users/test1.txt', 'r', encoding='utf8') as file:
data = file.read()

medicines = ('Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate')

result = []
#with open('C:\Users\substancecopy.csv') as f:
for line in data:
    if any(line.startswith(med) for med in medicines):
        result.append(line.strip())

I need to capture all the text from one medicine to another as shown in Expected Output which is not happening with this piece of code

You can do it without regular expressions using str.startswith() :

medicines = ('Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate')

result = []
with open('C:\Users\substancecopy.csv') as f:
    for line in f:
        if any(line.startswith(med) for med in medicines):
            result.append(line.strip())

I'm not sure why your expected output contains list of lists with single string, but if you really needed use result.append([line.strip()]) .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM