I'm looking to convert tabular data into CSVs, but I'm hitting a roadblock when the table has rows with certain missing values. Input looks like the following table,
systemd 1 root cwd DIR 8|1 4096 2 /
systemd 1 root rtd DIR 8|1 4096 2 /
systemd 1 root txt REG 8|1 1612152 101375 /lib/systemd/systemd
systemd 1 root mem REG 8|1 1700792 26009 /lib/x86_64-linux-gnu/libm-2.27.so
systemd 1 root mem REG 8|1 121016 1715 /lib/x86_64-linux-gnu/libudev.so.1.6.9
node 697 698 user1 cwd DIR 8|33 4096 7995393 /home/user1
node 697 698 user2 rtd DIR 8|1 4096 2 /
node 697 698 user1 txt REG 8|33 43680144 8003081 /home/user1/.vscode-server/bin/26076a4de974ead31f97692a0d32f90d735645c0/node
node 697 698 user1 mem REG 8|1 101168 26021 /lib/x86_64-linux-gnu/libresolv-2.27.so
node 697 698 user1 mem REG 8|1 26936 26014 /lib/x86_64-linux-gnu/libnss_dns-2.27.so
I want to convert this into a CSV with the number of columns preserved, the output should look something like,
systemd,1,,root,cwd,DIR,8|1,4096,2,/
systemd,1,,root,rtd,DIR,8|1,4096,2,/
systemd,1,,root,txt,REG,8|1,1612152,101375,/lib/systemd/systemd
systemd,1,,root,mem,REG,8|1,1700792,26009,/lib/x86_64-linux-gnu/libm-2.27.so
systemd,1,,root,mem,REG,8|1,121016,1715,/lib/x86_64-linux-gnu/libudev.so.1.6.9
node,697,698,user1,cwd,DIR,8|33,4096,7995393,/home/user1
node,697,698,user2,rtd,DIR,8|1,4096,2,/
node,697,698,user1,txt,REG,8|33,43680144,8003081,/home/user1/.vscode-server/bin/26076a4de974ead31f97692a0d32f90d735645c0/node
node,697,698,user1,mem,REG,8|1,101168,26021,/lib/x86_64-linux-gnu/libresolv-2.27.so
node,697,698,user1,mem,REG ,8|1,26936,2601,/lib/x86_64-linux-gnu/libnss_dns-2.27.so
So far I've tried it using pandas read_fwf function and then converting it to CSV, but it's not evaluating the missing column value. So instead of getting 10 values for every row in the CSV, I'm getting only the visible 9. The same thing happens while using pandas read_table function as well. I also tried using Regex Patterns but I'm not expecting the table format to be same every time, upscaling the code to incorporate more tables becomes a problem
Any method to solve this problem is highly appreciated. Thanks a lot!
You can make the problem smaller by splitting your data into valid and invalid rows. Valid rows will have the expected number of columns and invalid will have one or more columns missing. Not sure if you can automate this fully without knowing the exact delimiter between columns.
You mention that spaces can occur in description columns. You can't really differentiate between user1 cwd
which are two separate columns and a space inside a single column. Rows like that would be put into invalid
list unless they happen to have a missing value to "balance" it out. It's pretty brittle so best to either make sure you have a proper delimiter or that at least there are no spaces in your column values.
from io import StringIO
import pandas as pd
import re
data = StringIO("""
systemd 1 root cwd DIR 8|1 4096 2 /
systemd 1 root rtd DIR 8|1 4096 2 /
systemd 1 root txt REG 8|1 1612152 101375 /lib/systemd/systemd
systemd 1 root mem REG 8|1 1700792 26009 /lib/x86_64-linux-gnu/libm-2.27.so
systemd 1 root mem REG 8|1 121016 1715 /lib/x86_64-linux-gnu/libudev.so.1.6.9
node 697 698 user1 cwd DIR 8|33 4096 7995393 /home/user1
node 697 698 user2 rtd DIR 8|1 4096 2 /
node 697 698 user1 txt REG 8|33 43680144 8003081 /home/user1/.vscode-server/bin/26076a4de974ead31f97692a0d32f90d735645c0/node
node 697 698 user1 mem REG 8|1 101168 26021 /lib/x86_64-linux-gnu/libresolv-2.27.so
node 697 698 user1 mem REG 8|1 26936 26014 /lib/x86_64-linux-gnu/libnss_dns-2.27.so
""")
valid_rows = []
invalid_rows = []
num_of_columns = 10
for line in data.readlines():
# note that in your data there is a new line
# at the end of each line which is also captured by \s
if len(re.findall(r"\s+", line)) == num_of_columns:
valid_rows.append(line)
else:
invalid_rows.append(line)
df = pd.read_csv(StringIO("".join(valid_rows)), delim_whitespace=True, names=range(10))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.