简体   繁体   中英

How can I get specific columns form txt file and save them to new file using python

I have this txt file sentences.txt that contains texts below

a01-000u-s00-00 0 ok 154 19 408 746 1661 89 A|MOVE|to|stop|Mr.|Gaitskell|from

a01-000u-s00-01 0 ok 156 19 395 932 1850 105 nominating|any|more|Labour|life|Peers

which contains 10 columns I want to use the panda's data frame to extract only the file name (at column 0) and corresponding text (column 10) without the (|) character I wrote this code

def load() -> pd.DataFrame:

 df = pd.read_csv('sentences.txt',sep=' ', header=None)
 data = []
 with open('sentences.txt') as infile:
    for line in infile:
        file_name, _, _, _, _, _, _, _, _, text = line.strip().split(' ')
        data.append((file_name, cl_txt(text)))

 df = pd.DataFrame(data, columns=['file_name', 'text'])
 df.rename(columns={0: 'file_name', 9: 'text'}, inplace=True)
 df['file_name'] = df['file_name'].apply(lambda x: x + '.jpg')
 df = df[['file_name', 'text']]
 return df

def cl_txt(input_text: str) -> str:
 text = input_text.replace('+', '-')
 text = text.replace('|', ' ')
 return text

load()

the error I got

ParserError: Error tokenizing data. C error: Expected 10 fields in line 4, saw 11

where my expected process.txt file results should look like below without \n

a01-000u-s00-00  A MOVE to stop Mr. Gaitskell from
a01-000u-s00-01  nominating any more Labour life Peers

在此处输入图像描述

IIUC, you just need pandas.read_csv to read your .txt and then select the two columns:

Try this:

import pandas as pd

df= ( 
        pd.read_csv("test.txt", header=None, sep=r"(\d+)\s(?=\D)", engine="python",
                    usecols=[0,4], names=["filename", "text"])
            .assign(filename= lambda x: x["filename"].str.strip().add(".jpg"),
                    text= lambda x: x["text"].str.replace(r'[\|"]', " ", regex=True)
                                             .str.replace(r"\s+", " ", regex=True))
    )

# Output:

print(df)

              filename                                         text
0  a01-000u-s00-00.jpg            A MOVE to stop Mr. Gaitskell from
1  a01-000u-s00-01.jpg        nominating any more Labour life Peers
2   a01-003-s00-01.jpg  large majority of Labour M Ps are likely to

#.txt used:

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM