[英]Creating a DataFrame from a .txt file
I need some wisdom here!我需要一些智慧在这里!
I'm trying to create a script that takes two (2).txt files with the same format, appends one to the other and then creates a DataFrame from that resulting file, so I can manipulate it.我正在尝试创建一个脚本,该脚本需要两个 (2).txt 具有相同格式的文件,将一个附加到另一个,然后从该结果文件中创建一个 DataFrame,以便我可以操作它。
The files are inventory results, but they are a little bit messy.这些文件是库存结果,但它们有点混乱。
From these file I only needed the rows of the products, nothing more, to do so I'm using:从这些文件中,我只需要产品的行,仅此而已,我正在使用:
listados = ["analisis_diferencias.txt","no_contadas.txt"]
def unir_listados(listados):
with open("df_final.txt","w+") as merge:
for item in listados:
with open(item) as readable:
for line in readable:
if line[4] in ["1","2","3","4","5","6","7","8","9"]:
merge.write(line)
The result is a new.txt that looks perfect, since it only uses the lines where there is a product code.结果是一个看起来很完美的 new.txt,因为它只使用有产品代码的行。
But I just can't make it to a normal DataFrame, or any other structure that has columns.但我就是无法使用普通的 DataFrame 或任何其他具有列的结构。
The farthest I've been is to create a sigle column df using pd.read_table, with no idea on how to separate every single row on columns.我去过的最远的地方是使用 pd.read_table 创建一个单列 df,不知道如何分隔列上的每一行。
I tried replacing the whitespaces with ";"我尝试用“;”替换空格so I could later delete the empty columns it would generate but then I got a huge list of one row and more than 6k columns...所以我以后可以删除它会生成的空列,但后来我得到了一个包含一行和超过 6k 列的巨大列表......
Also tried to replace them with "\t", but nothing.还尝试用“\t”替换它们,但没有。
The pd.read_csv method isn't working either: pd.read_csv 方法也不起作用:
a = pd.read_csv("df_final.txt", header=None, encoding="latin-1")
ParserError: Expected 18 fields in line 3, saw 19. Error could possibly be due to quotes being ignored when a multi-char delimiter is used. ParserError: 预期第 3 行中的 18 个字段,看到 19。错误可能是由于使用多字符分隔符时忽略引号引起的。
I've seen a solution online that instead of creating a new.txt it creates a new df value by value when parsing every line of the original.txt我在网上看到了一个解决方案,它不是创建一个 new.txt,而是在解析 original.txt 的每一行时按值创建一个新的 df 值
But I understand there should be a simpler method once you have the data displayed as I have it right now.但是我知道,一旦您显示了我现在拥有的数据,应该有一个更简单的方法。
Thanks in advance for any help you can provide.提前感谢您提供的任何帮助。
Ps: BTW when appending the lines to my new.txt, if I used str([1,2,3,4,5,6,7,8,9]) it would select every single row, since it detected the empty string char "" was in the array. Ps:顺便说一句,将行附加到我的 new.txt 时,如果我使用 str([1,2,3,4,5,6,7,8,9]) 它会 select 每一行,因为它检测到空字符串 char "" 在数组中。 Any idea on this?对此有任何想法吗?
EDIT:编辑:
I added some rows of the final.txt, as requested.根据要求,我添加了一些 final.txt 行。
68.17.28 D-AA SPLIT HAIER TUNDRA AS-18 ] 0 1 1 562,00 562,000
42.50.10 Z-CAMARA INSTANT. FUJI INSTAX ] 1 3 2 111,80 55,900
54.15.88 Z-CAMARA INSTANT. FUJI INSTAX ] 2 2 0 0,00 59,900
67.05.04 A-CAMARA INSTANT. FUJI INSTAX ] 1 1 0 0,00 54,500
72.29.13 C-CAMARA INSTANT. FUJI INSTAX ] 1 1 0 0,00 121,950
21.08.75 D-MEMORIA MICRO SD ULTRA SANDI] 7 7 0 0,00 15,699
21.09.35 B-MEMORIA MICRO SD ULTRA SANDI] 16 16 0 0,00 3,616
21.09.70 D-MEMORIA MICRO SD ULTRA SANDI] 11 23 12 56,18 4,682
21.11.33 D-MEMORIA MICRO SD ULTRA SANDI] 4 4 0 0,00 7,830
23.36.92 A-MICROSD SAMSUNG EVO 32GB(MB-] 9 9 0 0,00 6,811
Without a sample of the text file, it is hard to know for sure.如果没有文本文件的样本,很难确定。 But could you try:但你能试试:
pd.read_table("df_final.txt", sep='\s+', header=None, encoding="latin-1")
This looks to separate the txt file columns based on whitespace.这看起来根据空格分隔 txt 文件列。
how about use delimiter \s{2,}
, in D-AA SPLIT HAIER TUNDRA AS-18 ]
only have 1 space.使用分隔符\s{2,}
怎么样,在D-AA SPLIT HAIER TUNDRA AS-18 ]
只有 1 个空格。
df = pd.read_csv(file, sep='\s{2,}',header=None, engine='python')
another way:另一种方式:
# read file with only one column
obj = pd.read_csv(file, sep='\n',header=None)[0]
def handle_row(row):
row_list = re.split(r'\s+', row)
# the first 2 columns
prt1 = ' '.join(row_list[:-5]).split(' ', maxsplit=1)
# the last 5 columns
prt2 = row_list[-5:]
return (prt1 + prt2)
df = pd.DataFrame(obj.map(handle_row).tolist())
print(df)
0 1 2 3 4 5 6
0 68.17.28 D-AA SPLIT HAIER TUNDRA AS-18 ] 0 1 1 562,00 562,000
1 42.50.10 Z-CAMARA INSTANT. FUJI INSTAX ] 1 3 2 111,80 55,900
2 54.15.88 Z-CAMARA INSTANT. FUJI INSTAX ] 2 2 0 0,00 59,900
3 67.05.04 A-CAMARA INSTANT. FUJI INSTAX ] 1 1 0 0,00 54,500
4 72.29.13 C-CAMARA INSTANT. FUJI INSTAX ] 1 1 0 0,00 121,950
5 21.08.75 D-MEMORIA MICRO SD ULTRA SANDI] 7 7 0 0,00 15,699
6 21.09.35 B-MEMORIA MICRO SD ULTRA SANDI] 16 16 0 0,00 3,616
7 21.09.70 D-MEMORIA MICRO SD ULTRA SANDI] 11 23 12 56,18 4,682
8 21.11.33 D-MEMORIA MICRO SD ULTRA SANDI] 4 4 0 0,00 7,830
9 23.36.92 A-MICROSD SAMSUNG EVO 32GB(MB-] 9 9 0 0,00 6,811
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.