[英]Converting Text Tables Into CSVs in Python
I'm looking to convert tabular data into CSVs, but I'm hitting a roadblock when the table has rows with certain missing values.我希望将表格数据转换为 CSV,但是当表中的行具有某些缺失值时,我遇到了障碍。 Input looks like the following table,
输入如下表所示,
systemd 1 root cwd DIR 8|1 4096 2 /
systemd 1 root rtd DIR 8|1 4096 2 /
systemd 1 root txt REG 8|1 1612152 101375 /lib/systemd/systemd
systemd 1 root mem REG 8|1 1700792 26009 /lib/x86_64-linux-gnu/libm-2.27.so
systemd 1 root mem REG 8|1 121016 1715 /lib/x86_64-linux-gnu/libudev.so.1.6.9
node 697 698 user1 cwd DIR 8|33 4096 7995393 /home/user1
node 697 698 user2 rtd DIR 8|1 4096 2 /
node 697 698 user1 txt REG 8|33 43680144 8003081 /home/user1/.vscode-server/bin/26076a4de974ead31f97692a0d32f90d735645c0/node
node 697 698 user1 mem REG 8|1 101168 26021 /lib/x86_64-linux-gnu/libresolv-2.27.so
node 697 698 user1 mem REG 8|1 26936 26014 /lib/x86_64-linux-gnu/libnss_dns-2.27.so
I want to convert this into a CSV with the number of columns preserved, the output should look something like,我想将其转换为保留列数的 CSV,输出应如下所示,
systemd,1,,root,cwd,DIR,8|1,4096,2,/
systemd,1,,root,rtd,DIR,8|1,4096,2,/
systemd,1,,root,txt,REG,8|1,1612152,101375,/lib/systemd/systemd
systemd,1,,root,mem,REG,8|1,1700792,26009,/lib/x86_64-linux-gnu/libm-2.27.so
systemd,1,,root,mem,REG,8|1,121016,1715,/lib/x86_64-linux-gnu/libudev.so.1.6.9
node,697,698,user1,cwd,DIR,8|33,4096,7995393,/home/user1
node,697,698,user2,rtd,DIR,8|1,4096,2,/
node,697,698,user1,txt,REG,8|33,43680144,8003081,/home/user1/.vscode-server/bin/26076a4de974ead31f97692a0d32f90d735645c0/node
node,697,698,user1,mem,REG,8|1,101168,26021,/lib/x86_64-linux-gnu/libresolv-2.27.so
node,697,698,user1,mem,REG ,8|1,26936,2601,/lib/x86_64-linux-gnu/libnss_dns-2.27.so
So far I've tried it using pandas read_fwf function and then converting it to CSV, but it's not evaluating the missing column value.到目前为止,我已经尝试使用 pandas read_fwf 函数然后将其转换为 CSV,但它没有评估缺失的列值。 So instead of getting 10 values for every row in the CSV, I'm getting only the visible 9. The same thing happens while using pandas read_table function as well.
因此,我没有为 CSV 中的每一行获取 10 个值,而是只获取可见的 9 个值。使用 Pandas read_table 函数时也会发生同样的事情。 I also tried using Regex Patterns but I'm not expecting the table format to be same every time, upscaling the code to incorporate more tables becomes a problem
我也尝试使用 Regex Patterns,但我不希望表格格式每次都相同,升级代码以合并更多表格成为一个问题
Any method to solve this problem is highly appreciated.任何解决此问题的方法都受到高度赞赏。 Thanks a lot!
非常感谢!
You can make the problem smaller by splitting your data into valid and invalid rows.您可以通过将数据拆分为有效行和无效行来缩小问题。 Valid rows will have the expected number of columns and invalid will have one or more columns missing.
有效行将具有预期的列数,而无效行将缺少一列或多列。 Not sure if you can automate this fully without knowing the exact delimiter between columns.
不确定您是否可以在不知道列之间的确切分隔符的情况下完全自动化。
You mention that spaces can occur in description columns.您提到描述列中可以出现空格。 You can't really differentiate between
user1 cwd
which are two separate columns and a space inside a single column.您无法真正区分
user1 cwd
是两个单独的列和一个列内的空格。 Rows like that would be put into invalid
list unless they happen to have a missing value to "balance" it out.这样的行将被放入
invalid
列表中,除非它们碰巧有一个缺失值来“平衡”它。 It's pretty brittle so best to either make sure you have a proper delimiter or that at least there are no spaces in your column values.它非常脆弱,因此最好确保您有一个正确的分隔符,或者至少您的列值中没有空格。
from io import StringIO
import pandas as pd
import re
data = StringIO("""
systemd 1 root cwd DIR 8|1 4096 2 /
systemd 1 root rtd DIR 8|1 4096 2 /
systemd 1 root txt REG 8|1 1612152 101375 /lib/systemd/systemd
systemd 1 root mem REG 8|1 1700792 26009 /lib/x86_64-linux-gnu/libm-2.27.so
systemd 1 root mem REG 8|1 121016 1715 /lib/x86_64-linux-gnu/libudev.so.1.6.9
node 697 698 user1 cwd DIR 8|33 4096 7995393 /home/user1
node 697 698 user2 rtd DIR 8|1 4096 2 /
node 697 698 user1 txt REG 8|33 43680144 8003081 /home/user1/.vscode-server/bin/26076a4de974ead31f97692a0d32f90d735645c0/node
node 697 698 user1 mem REG 8|1 101168 26021 /lib/x86_64-linux-gnu/libresolv-2.27.so
node 697 698 user1 mem REG 8|1 26936 26014 /lib/x86_64-linux-gnu/libnss_dns-2.27.so
""")
valid_rows = []
invalid_rows = []
num_of_columns = 10
for line in data.readlines():
# note that in your data there is a new line
# at the end of each line which is also captured by \s
if len(re.findall(r"\s+", line)) == num_of_columns:
valid_rows.append(line)
else:
invalid_rows.append(line)
df = pd.read_csv(StringIO("".join(valid_rows)), delim_whitespace=True, names=range(10))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.