简体   繁体   English

将 .txt 文件转换为具有特定列的 .csv PYTHON

[英]Convert .txt file to .csv with specific columns PYTHON

I have some text file that I want to load into my python code, but the format of the txt file is not suitable.我有一些文本文件要加载到我的 python 代码中,但 txt 文件的格式不合适。

Here is what it contains:这是它包含的内容:

SEQ  MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLASWNY
SS3  CCCHHHHHHHHHHHHCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
     95024445656543114678678999999999999999888889998886
SS8  CCHHHHHHHHHHHHHHCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
     96134445555554311253378999999999999999999999999987
SA   EEEbBBBBBBBBBBbEbEEEeeEeBeEbBEEbbEeBeEbbeebBbBbBbb
     41012123422000000103006262214011342311110000030001
TA   bhHHHHHHHHHHHHHgIihiHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
     00789889988663201010099999999999999999898999998741
CD   NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
     54433221111112221122124212411342243234323333333333

I want to convert it into panda Dataframe to have SEQ SS4 SA TA CD SS8 as columns of the DataFrame and the line next to them as the rows.我想将它转换为熊猫数据帧,以将 SEQ SS4 SA TA CD SS8 作为数据帧的列,并将它们旁边的行作为行。 Like this:像这样: 在此处输入图片说明

I tried pd.read_csv but it doesn't give me the result I want.我试过pd.read_csv但它没有给我想要的结果。

Thank you !谢谢 !

Steps脚步

  1. Use pd.read_fwf() to read files in a fixed-width format.使用pd.read_fwf()读取固定宽度格式的文件。
  2. Fill the missing values with the last available value by df.ffill() .通过df.ffill()用最后一个可用值填充缺失值。
  3. Assign group number gp for the row number in the output using a groupby- cumcount construct.使用 groupby- cumcount构造为输出中的行号分配组号 gp。
  4. Move gp=(0,1) to columns by df.pivot , and then transpose again into the desired output.将 gp=(0,1) 按df.pivot移动到列,然后再次转置为所需的输出。

Note: this solution works with arbitrary (includes zero, and of course not too many) consecutive lines with omitted values in the first column.注意:此解决方案适用于第一列中省略值的任意(包括零,当然不是太多)连续行。

Code代码

# data (3 characters for the second column only)
file_path = "/mnt/ramdisk/input.txt"
df = pd.read_fwf(file_path, names=["col", "val"])

# fill the blank values
df["col"].ffill(inplace=True)
# get correct row location
df["gp"] = df.groupby("col").cumcount()
# pivot group (0,1) to columns and then transpose. 
df_ans = df.pivot(index="col", columns="gp", values="val").transpose()

Result结果

print(df_ans)  # show the first 3 characters only

col   CD   SA  SEQ  SS3  SS8   TA
gp                               
0    NNN  EEE  MSS  CCC  CCH  bhH
1    544  410  NaN  950  961  007

Then you can save the resulting DataFrame using df_ans.to_csv() .然后您可以使用df_ans.to_csv()保存生成的 DataFrame。

To read a text file using pandas.read_csv() method, the text file should contain data separated with comma.要使用 pandas.read_csv() 方法读取文本文件,文本文件应包含用逗号分隔的数据。

 SEQ, SS3, ....
 MSSSSWLLLSLVAVTAAQSTIEEQ..., CCCHHHHHHHHHHHHCCCCCCHHHHHHH.....

You can use this script to load the .txt file to DataFrame and save it as csv file:您可以使用此脚本将 .txt 文件加载到 DataFrame 并将其保存为 csv 文件:

import pandas as pd


data = {}
with open('<your file.txt>', 'r') as f_in:
    for line in f_in:
        line = line.split()        
        if len(line) == 2:
            data[line[0]] = [line[1]]

df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv', index=False)

Saves this CSV:保存此 CSV:

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM