如何從 txt 文件中獲取特定列並使用 python 將它們保存到新文件中

Question

我有這個 txt 文件sentences.txt ，其中包含以下文本

a01-000u-s00-00 0 ok 154 19 408 746 1661 89 A|MOVE|to|stop|Mr.|Gaitskell|from

其中包含 10 列，我想使用熊貓的數據框僅提取文件名（第 0 列）和相應的文本（第 10 列），不帶（|）字符我寫了這段代碼

def load() -> pd.DataFrame:

 df = pd.read_csv('sentences.txt',sep=' ', header=None)
 data = []
 with open('sentences.txt') as infile:
    for line in infile:
        file_name, _, _, _, _, _, _, _, _, text = line.strip().split(' ')
        data.append((file_name, cl_txt(text)))

 df = pd.DataFrame(data, columns=['file_name', 'text'])
 df.rename(columns={0: 'file_name', 9: 'text'}, inplace=True)
 df['file_name'] = df['file_name'].apply(lambda x: x + '.jpg')
 df = df[['file_name', 'text']]
 return df

def cl_txt(input_text: str) -> str:
 text = input_text.replace('+', '-')
 text = text.replace('|', ' ')
 return text

load()

我得到的錯誤

ParserError：錯誤標記數據。 C 錯誤：第 4 行應有 10 個字段，結果為 11

我預期的 process.txt 文件結果應該如下所示，沒有 \n

a01-000u-s00-00  A MOVE to stop Mr. Gaitskell from
a01-000u-s00-01  nominating any more Labour life Peers

Answer 1

IIUC，你只需要pandas.read_csv來閱讀你的.txt然后 select 兩列：

嘗試這個：

import pandas as pd

df= ( 
        pd.read_csv("test.txt", header=None, sep=r"(\d+)\s(?=\D)", engine="python",
                    usecols=[0,4], names=["filename", "text"])
            .assign(filename= lambda x: x["filename"].str.strip().add(".jpg"),
                    text= lambda x: x["text"].str.replace(r'[\|"]', " ", regex=True)
                                             .str.replace(r"\s+", " ", regex=True))
    )

＃Output：

print(df)

              filename                                         text
0  a01-000u-s00-00.jpg            A MOVE to stop Mr. Gaitskell from
1  a01-000u-s00-01.jpg        nominating any more Labour life Peers
2   a01-003-s00-01.jpg  large majority of Labour M Ps are likely to

如何從 txt 文件中獲取特定列並使用 python 將它們保存到新文件中

問題描述

1 個解決方案

解決方案1
2 已采納 2022-11-21 12:15:57

＃Output：

#.txt 使用：

如何從 txt 文件中獲取特定列並使用 python 將它們保存到新文件中

問題描述

1 個解決方案

解決方案1 2 已采納 2022-11-21 12:15:57

＃Output：

#.txt 使用：

解決方案1
2 已采納 2022-11-21 12:15:57