在 Python 中将文本表转换为 CSV

Question

I'm looking to convert tabular data into CSVs, but I'm hitting a roadblock when the table has rows with certain missing values.我希望将表格数据转换为 CSV，但是当表中的行具有某些缺失值时，我遇到了障碍。 Input looks like the following table,输入如下表所示，

systemd       1                   root  cwd       DIR                8|1      4096          2 /
systemd       1                   root  rtd       DIR                8|1      4096          2 /
systemd       1                   root  txt       REG                8|1   1612152     101375 /lib/systemd/systemd
systemd       1                   root  mem       REG                8|1   1700792      26009 /lib/x86_64-linux-gnu/libm-2.27.so
systemd       1                   root  mem       REG                8|1    121016       1715 /lib/x86_64-linux-gnu/libudev.so.1.6.9
node        697   698             user1 cwd       DIR               8|33      4096    7995393 /home/user1
node        697   698             user2 rtd       DIR                8|1      4096          2 /
node        697   698             user1 txt       REG               8|33  43680144    8003081 /home/user1/.vscode-server/bin/26076a4de974ead31f97692a0d32f90d735645c0/node
node        697   698             user1 mem       REG                8|1    101168      26021 /lib/x86_64-linux-gnu/libresolv-2.27.so
node        697   698             user1 mem       REG                8|1     26936      26014 /lib/x86_64-linux-gnu/libnss_dns-2.27.so

I want to convert this into a CSV with the number of columns preserved, the output should look something like,我想将其转换为保留列数的 CSV，输出应如下所示，

systemd,1,,root,cwd,DIR,8|1,4096,2,/
systemd,1,,root,rtd,DIR,8|1,4096,2,/
systemd,1,,root,txt,REG,8|1,1612152,101375,/lib/systemd/systemd
systemd,1,,root,mem,REG,8|1,1700792,26009,/lib/x86_64-linux-gnu/libm-2.27.so
systemd,1,,root,mem,REG,8|1,121016,1715,/lib/x86_64-linux-gnu/libudev.so.1.6.9
node,697,698,user1,cwd,DIR,8|33,4096,7995393,/home/user1
node,697,698,user2,rtd,DIR,8|1,4096,2,/
node,697,698,user1,txt,REG,8|33,43680144,8003081,/home/user1/.vscode-server/bin/26076a4de974ead31f97692a0d32f90d735645c0/node
node,697,698,user1,mem,REG,8|1,101168,26021,/lib/x86_64-linux-gnu/libresolv-2.27.so
node,697,698,user1,mem,REG ,8|1,26936,2601,/lib/x86_64-linux-gnu/libnss_dns-2.27.so

So far I've tried it using pandas read_fwf function and then converting it to CSV, but it's not evaluating the missing column value.到目前为止，我已经尝试使用 pandas read_fwf 函数然后将其转换为 CSV，但它没有评估缺失的列值。 So instead of getting 10 values for every row in the CSV, I'm getting only the visible 9. The same thing happens while using pandas read_table function as well.因此，我没有为 CSV 中的每一行获取 10 个值，而是只获取可见的 9 个值。使用 Pandas read_table 函数时也会发生同样的事情。 I also tried using Regex Patterns but I'm not expecting the table format to be same every time, upscaling the code to incorporate more tables becomes a problem我也尝试使用 Regex Patterns，但我不希望表格格式每次都相同，升级代码以合并更多表格成为一个问题

Any method to solve this problem is highly appreciated.任何解决此问题的方法都受到高度赞赏。 Thanks a lot!非常感谢！

Answer 1

You can make the problem smaller by splitting your data into valid and invalid rows.您可以通过将数据拆分为有效行和无效行来缩小问题。 Valid rows will have the expected number of columns and invalid will have one or more columns missing.有效行将具有预期的列数，而无效行将缺少一列或多列。 Not sure if you can automate this fully without knowing the exact delimiter between columns.不确定您是否可以在不知道列之间的确切分隔符的情况下完全自动化。

You mention that spaces can occur in description columns.您提到描述列中可以出现空格。 You can't really differentiate between user1 cwd which are two separate columns and a space inside a single column.您无法真正区分user1 cwd是两个单独的列和一个列内的空格。 Rows like that would be put into invalid list unless they happen to have a missing value to "balance" it out.这样的行将被放入invalid列表中，除非它们碰巧有一个缺失值来“平衡”它。 It's pretty brittle so best to either make sure you have a proper delimiter or that at least there are no spaces in your column values.它非常脆弱，因此最好确保您有一个正确的分隔符，或者至少您的列值中没有空格。

from io import StringIO
import pandas as pd
import re

data = StringIO("""
systemd       1                   root  cwd       DIR                8|1      4096          2 /
systemd       1                   root  rtd       DIR                8|1      4096          2 /
systemd       1                   root  txt       REG                8|1   1612152     101375 /lib/systemd/systemd
systemd       1                   root  mem       REG                8|1   1700792      26009 /lib/x86_64-linux-gnu/libm-2.27.so
systemd       1                   root  mem       REG                8|1    121016       1715 /lib/x86_64-linux-gnu/libudev.so.1.6.9
node        697   698             user1 cwd       DIR               8|33      4096    7995393 /home/user1
node        697   698             user2 rtd       DIR                8|1      4096          2 /
node        697   698             user1 txt       REG               8|33  43680144    8003081 /home/user1/.vscode-server/bin/26076a4de974ead31f97692a0d32f90d735645c0/node
node        697   698             user1 mem       REG                8|1    101168      26021 /lib/x86_64-linux-gnu/libresolv-2.27.so
node        697   698             user1 mem       REG                8|1     26936      26014 /lib/x86_64-linux-gnu/libnss_dns-2.27.so
""")

valid_rows = []
invalid_rows = []
num_of_columns = 10

for line in data.readlines():
    # note that in your data there is a new line
    # at the end of each line which is also captured by \s
    if len(re.findall(r"\s+", line)) == num_of_columns:
        valid_rows.append(line)
    else:
        invalid_rows.append(line)        

df = pd.read_csv(StringIO("".join(valid_rows)), delim_whitespace=True, names=range(10))

在 Python 中将文本表转换为 CSV

问题描述

1 个解决方案

解决方案1
0 2020-10-18 21:41:12

在 Python 中将文本表转换为 CSV

问题描述

1 个解决方案

解决方案1 0 2020-10-18 21:41:12

解决方案1
0 2020-10-18 21:41:12