Converting Text Tables Into CSVs in Python

Question

I'm looking to convert tabular data into CSVs, but I'm hitting a roadblock when the table has rows with certain missing values. Input looks like the following table,

systemd       1                   root  cwd       DIR                8|1      4096          2 /
systemd       1                   root  rtd       DIR                8|1      4096          2 /
systemd       1                   root  txt       REG                8|1   1612152     101375 /lib/systemd/systemd
systemd       1                   root  mem       REG                8|1   1700792      26009 /lib/x86_64-linux-gnu/libm-2.27.so
systemd       1                   root  mem       REG                8|1    121016       1715 /lib/x86_64-linux-gnu/libudev.so.1.6.9
node        697   698             user1 cwd       DIR               8|33      4096    7995393 /home/user1
node        697   698             user2 rtd       DIR                8|1      4096          2 /
node        697   698             user1 txt       REG               8|33  43680144    8003081 /home/user1/.vscode-server/bin/26076a4de974ead31f97692a0d32f90d735645c0/node
node        697   698             user1 mem       REG                8|1    101168      26021 /lib/x86_64-linux-gnu/libresolv-2.27.so
node        697   698             user1 mem       REG                8|1     26936      26014 /lib/x86_64-linux-gnu/libnss_dns-2.27.so

I want to convert this into a CSV with the number of columns preserved, the output should look something like,

systemd,1,,root,cwd,DIR,8|1,4096,2,/
systemd,1,,root,rtd,DIR,8|1,4096,2,/
systemd,1,,root,txt,REG,8|1,1612152,101375,/lib/systemd/systemd
systemd,1,,root,mem,REG,8|1,1700792,26009,/lib/x86_64-linux-gnu/libm-2.27.so
systemd,1,,root,mem,REG,8|1,121016,1715,/lib/x86_64-linux-gnu/libudev.so.1.6.9
node,697,698,user1,cwd,DIR,8|33,4096,7995393,/home/user1
node,697,698,user2,rtd,DIR,8|1,4096,2,/
node,697,698,user1,txt,REG,8|33,43680144,8003081,/home/user1/.vscode-server/bin/26076a4de974ead31f97692a0d32f90d735645c0/node
node,697,698,user1,mem,REG,8|1,101168,26021,/lib/x86_64-linux-gnu/libresolv-2.27.so
node,697,698,user1,mem,REG ,8|1,26936,2601,/lib/x86_64-linux-gnu/libnss_dns-2.27.so

So far I've tried it using pandas read_fwf function and then converting it to CSV, but it's not evaluating the missing column value. So instead of getting 10 values for every row in the CSV, I'm getting only the visible 9. The same thing happens while using pandas read_table function as well. I also tried using Regex Patterns but I'm not expecting the table format to be same every time, upscaling the code to incorporate more tables becomes a problem

Any method to solve this problem is highly appreciated. Thanks a lot!

Answer 1

You can make the problem smaller by splitting your data into valid and invalid rows. Valid rows will have the expected number of columns and invalid will have one or more columns missing. Not sure if you can automate this fully without knowing the exact delimiter between columns.

You mention that spaces can occur in description columns. You can't really differentiate between user1 cwd which are two separate columns and a space inside a single column. Rows like that would be put into invalid list unless they happen to have a missing value to "balance" it out. It's pretty brittle so best to either make sure you have a proper delimiter or that at least there are no spaces in your column values.

from io import StringIO
import pandas as pd
import re

data = StringIO("""
systemd       1                   root  cwd       DIR                8|1      4096          2 /
systemd       1                   root  rtd       DIR                8|1      4096          2 /
systemd       1                   root  txt       REG                8|1   1612152     101375 /lib/systemd/systemd
systemd       1                   root  mem       REG                8|1   1700792      26009 /lib/x86_64-linux-gnu/libm-2.27.so
systemd       1                   root  mem       REG                8|1    121016       1715 /lib/x86_64-linux-gnu/libudev.so.1.6.9
node        697   698             user1 cwd       DIR               8|33      4096    7995393 /home/user1
node        697   698             user2 rtd       DIR                8|1      4096          2 /
node        697   698             user1 txt       REG               8|33  43680144    8003081 /home/user1/.vscode-server/bin/26076a4de974ead31f97692a0d32f90d735645c0/node
node        697   698             user1 mem       REG                8|1    101168      26021 /lib/x86_64-linux-gnu/libresolv-2.27.so
node        697   698             user1 mem       REG                8|1     26936      26014 /lib/x86_64-linux-gnu/libnss_dns-2.27.so
""")

valid_rows = []
invalid_rows = []
num_of_columns = 10

for line in data.readlines():
    # note that in your data there is a new line
    # at the end of each line which is also captured by \s
    if len(re.findall(r"\s+", line)) == num_of_columns:
        valid_rows.append(line)
    else:
        invalid_rows.append(line)        

df = pd.read_csv(StringIO("".join(valid_rows)), delim_whitespace=True, names=range(10))

Converting Text Tables Into CSVs in Python

Question

1 answers

solution1
0 2020-10-18 21:41:12

Converting Text Tables Into CSVs in Python

Question

1 answers

solution1 0 2020-10-18 21:41:12

solution1
0 2020-10-18 21:41:12