简体   繁体   中英

Python parsing txt, cut specific part of the string between two characters

I am trying make a script that will cut specific part of the string from loaded file.

For example string in file is (there are multiple lines like this and on all of them same should be done):

C:\d\projects\project1\folder1\folder2\folder3\folder4\file.h

Wanted output would be:

C:\d\projects\project1\folder1\folder2\folder3\folder4

So in each line only path to the folder should stay, without file itself.

What would be best way to do this?

Why not split() using an escaped \ like this and join all but the final filename item. If you require the filename for other purposes then split() and just use the index -1 to get that part.

Note I have added an r in front of the string so all backslashes are left unchanged. You can read about this here .

my_file_location = r"C:\d\projects\project1\folder1\folder2\folder3\folder4\file.h"

print('\\'.join(my_file_location.split('\\')[0:-1])) # path
>> C:\d\projects\project1\folder1\folder2\folder3\folder4

print(my_file_location.split('\\')[-1]) # filename
>> file.h

If you want to iterate over a list of these filenames then you could do something like this:

import csv

output_list = []
with open('my_csv_example.csv', 'r') as f:
    csv_reader = csv.reader(f)
    for row in csv_reader:
        output_list.append('\\'.join(row[0].split('\\')[0:-1]))

with open('my_csv_output_example.csv', mode='w') as f2:
    csv_writer = csv.writer(f2, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in output_list:
        csv_writer.writerow([row])

Input File (my_csv_example.csv):

C:\d\projects\project1\folder1\folder2\folder5\folder1\file3.h
C:\d\projects\project1\folder1\folder2\folder4\folder2\file5.h
C:\d\projects\project1\folder1\folder2\folder3\folder3\file3.h
C:\d\projects\project1\folder1\folder2\folder2\folder4\file4.h
C:\d\projects\project1\folder1\folder2\folder1\folder5\file2.h

Output File (my_csv_output_example.csv):

C:\d\projects\project1\folder1\folder2\folder5\folder1
C:\d\projects\project1\folder1\folder2\folder4\folder2
C:\d\projects\project1\folder1\folder2\folder3\folder3
C:\d\projects\project1\folder1\folder2\folder2\folder4
C:\d\projects\project1\folder1\folder2\folder1\folder5

Updating due to comment, I think the bit you are missing is trying to run a string function on a list. You need likely need the first element in the list ie 0 so this is the key bit for you:

row[0].split('\\')[0:-1])

You can use pathlib for extensive and cross platform path support.

In your particular example:

from pathlib import PureWindowsPath

p=PureWindowsPath(r'C:\d\projects\project1\folder1\folder2\folder3\folder4\file.h')

Then you can access the parts at will:

>>> p.name
file.h
>>> p.parents[0]
C:\d\projects\project1\folder1\folder2\folder3\folder4
>>> p.parents[1]
C:\d\projects\project1\folder1\folder2\folder3
# etc

You can change the type of path:

>>> p.as_uri()
file:///C:/d/projects/project1/folder1/folder2/folder3/folder4/file.h
>>> p.as_posix()
C:/d/projects/project1/folder1/folder2/folder3/folder4/file.h

Pathlib also has built-in support for globbing.

Given a file tree like this:

.
├── a
│   └── sub_a
│       └── sub_sub_a
│           └── file.txt
├── b
│   └── sub_b
│       └── file2.txt
└── c
    └── file3.txt

You can do:

for pn in (n for n in p.glob('**/*') if n.is_file()):
    print(pn)

Prints:

/tmp/test/a/sub_a/sub_sub_a/file.txt
/tmp/test/c/file3.txt
/tmp/test/b/sub_b/file2.txt

Which can be the path of any path that contains a file:

for pn in (n for n in p.glob('**/*') if n.is_file()):
    print(pn.parents[0])

/tmp/test/a/sub_a/sub_sub_a
/tmp/test/c
/tmp/test/b/sub_b

It is definitely a superior approach.

So since I did not explained well my question at first (did it in comments after), I had to work this out and finally this piece of code does the job:

for file in directory:
    f = open(file,'r')
    rows = f.readlines()
    array = []
    for i in rows:
        if i.endswith('.h\n'):
            array.append(i.replace(os.path.basename(i), ''))
    wf = open(file,'w')
    for row in array:
        wf.write(row+'\n')
    wf.close()

So it will go through all files inside of a folder and from this: C:\d\projects\project1\folder1\folder2\folder5\folder1\file3.h C:\d\projects\project1\folder1\folder2\folder4\folder2\file5.h C:\d\projects\project1\folder1\folder2\folder3\folder3\file3.h C:\d\projects\project1\folder1\folder2\folder2\folder4\file4.h C:\d\projects\project1\folder1\folder2\folder1\folder5\file2.h

make this:

C:\d\projects\project1\folder1\folder2\folder5\folder1 C:\d\projects\project1\folder1\folder2\folder4\folder2 C:\d\projects\project1\folder1\folder2\folder3\folder3 C:\d\projects\project1\folder1\folder2\folder2\folder4 C:\d\projects\project1\folder1\folder2\folder1\folder5

Hopefully it will help someone!

This seems like a perfect situation forstr.rfind() . It finds the right-most index of a given substring, in this case \ .

list = [C:\d\projects\project1\folder1\folder2\folder5\folder1\file3.h, 
        C:\d\projects\project1\folder1\folder2\folder4\folder2\file5.h, 
        C:\d\projects\project1\folder1\folder2\folder3\folder3\file3.h, 
        C:\d\projects\project1\folder1\folder2\folder2\folder4\file4.h, 
        C:\d\projects\project1\folder1\folder2\folder1\folder5\file2.h]
for line in list:
    line = line[ 0 : line.rfind("\\") ]
    print(f"{line}\n")

Output:

C:\d\projects\project1\folder1\folder2\folder5\folder1
C:\d\projects\project1\folder1\folder2\folder4\folder2
C:\d\projects\project1\folder1\folder2\folder3\folder3
C:\d\projects\project1\folder1\folder2\folder2\folder4
C:\d\projects\project1\folder1\folder2\folder1\folder5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM