简体   繁体   中英

How to group the files in a single line based on a string value in Python?

The data in input file is of this format

XD56783E
clutch.pdf
brake0.pdf
brake1.pdf
XT6111GT
clutch.pdf
brake0.pdf
brake1.pdf
brake2.pdf
brake3.pdf

The expected output should be as

XD56783E    clutch.pdf  brake0.pdf  brake1.pdf
XT6111GT    clutch.pdf  brake0.pdf  brake1.pdf  brake2.pdf  brake3.pdf

The below code does not give the desired output as above.

Wrote a simple vague code:

check_filetype = ['c','b']
revisedData = None
temp = None

for keyvalue in open("sampl.txt"):
     if 'X' in keyvalue[0]:
        revisedData = keyvalue.strip() + ' '
     else:
        if 'c' or 'b' in keyvalue[0]:
            temp = keyvalue.strip()
            revisedData = revisedData + ' ' + temp                  
print(revisedData)

You can use the pandas library to read the data in and transform as required with .groupby and ' '.join to bring the group together. The code to create the groups returns True or False if the string contains a capital letter X . True and False are boolean values that actually represent 1 and 0 respectively, so when you take the cumulative sum of [True, False, False, False, True, False, False, False, False, False] , you get [1,1,1,1,2,2,2,2,2,2] . For example, to calculate the last 2, the calculation is (1 + 0 + 0 + 0 + 1 + 0 + 0 + 0 + 0 + 0) while the calculation for the second 1 is 1 + 0 . You end up with group 1 and group 2 with this sample data, which you then .groupby in order to join the the values in the different rows into one string as desired.

import pandas as pd
df = pd.read_csv('sampl.txt', header=None)
df = df.groupby(df[0].str.contains('X').cumsum().astype(str))[0].apply(' '.join)
df.to_csv('sampl.txt', index=False)
df

Out[1]:
0
1            XD56783E clutch.pdf brake0.pdf brake1.pdf
2    XT6111GT clutch.pdf brake0.pdf brake1.pdf brak...

Your current code doesn't work because every time the program finds a line starting with X , revisedData will be assigned a new value and the previous value will be overlapped. To fix that, you can just change the line

revisedData = None
...
if 'X' in keyvalue[0]:
    revisedData = keyvalue.strip() + ' '

to

revisedData = ""
...
if 'X' in keyvalue[0]:
    revisedData += keyvalue.strip() + ' '

So that the new value will be added to the end of the string. But for operating with strings, I recommend you to manage it with an array as it's more efficient to change the values.

Apart from that, instead of doing if 'X' in keyvalue[0] , you can use startswith instead to achieve the same effect. Another problem with your code is that after each line that starts with X there should be a new line. So the revised code is as follows:

check_filetype = ['c', 'b']

revisedData = []

for keyvalue in open("sampl.txt"):
    if keyvalue.startswith('X'):
        # Note that the newline is in the front of the string, because you want the elements that follow to be in the same line
        revisedData.append("\n" + keyvalue.strip())
    else:
        if keyvalue.startswith(('c','b')):
            revisedData.append(keyvalue.strip())

# Passing the array to join " " will create a result with all the elements in the array separated by " "
revisedData = " ".join(revisedData).strip()

print(revisedData)

Output

XD56783E clutch.pdf brake0.pdf brake1.pdf 
XT6111GT clutch.pdf brake0.pdf brake1.pdf brake2.pdf brake3.pdf

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM