简体   繁体   English

Python Pandas read_csv 到 dataframe 不带分隔符

[英]Python Pandas read_csv to dataframe without separator

I'm new to the Pandas library.我是 Pandas 库的新手。
I have shared code that works off of a dataframe.我有共享代码,该代码适用于 dataframe。

Is there a way to read a gzip file line by line without any delimiter (use the full line, the line can include commas and other characters) as a single row and use it in the dataframe?有没有办法在没有任何分隔符的情况下逐行读取 gzip 文件(使用整行,该行可以包含逗号和其他字符)作为单行并在 dataframe 中使用? It seems that you have to provide a delimiter and when I provide "\n" it is able to read but error_bad_lines will complain with something like "Skipping line xxx: expected 22 fields but got 23" fields since each line is different.似乎您必须提供分隔符,当我提供“\n”时,它可以读取,但 error_bad_lines 会抱怨“跳过第 xxx 行:预期 22 个字段但得到 23 个”字段,因为每行都不同。

I want it to treat each line as a single row in the dataframe.我希望它将每一行视为 dataframe 中的单行。 How can this be achieved?如何做到这一点? Any tips would be appreciated.任何提示将不胜感激。

if you just want each line to be one row and one column then dont use read_csv.如果您只希望每行是一行一列,则不要使用 read_csv。 Just read the file line by line and build the data frame from it.只需逐行读取文件并从中构建数据框。

You could do this manually by creating an empty data frame with a single columns header.您可以通过创建具有单列 header 的空数据框来手动执行此操作。 then iterate over each line in the file appending it to the data frame.然后遍历文件中的每一行,将其附加到数据框中。

#explicitly iterate over each line in the file appending it to the df.
import pandas as pd
with open("query4.txt") as myfile:
    df = pd.DataFrame([], columns=['line'])
    for line in myfile:
        df = df.append({'line': line}, ignore_index=True)
    print(df)

This will work for large files as we only process one line at a time and build the dataframe so we dont use more memory than needed.这适用于大文件,因为我们一次只处理一行并构建 dataframe,因此我们不会使用比需要更多的 memory。 This probably isnt the most efficent there is a lot of reassigning of the dataframe here but it would certainly work.这可能不是最有效的,这里有很多 dataframe 的重新分配,但它肯定会起作用。

However we can do this more cleanly since the pandas dataframe can take an iterable as the input for data.然而,我们可以更干净地做到这一点,因为 pandas dataframe 可以将迭代作为数据的输入。

#create a list to feed the data to the dataframe.
import pandas as pd
with open("query4.txt") as myfile:
    mydata = [line for line in myfile]
    df = pd.DataFrame(mydata, columns=['line'])
    print(df)

Here we read all the lines of the file into a list and then pass the list to pandas to create the data from.在这里,我们将文件的所有行读入一个列表,然后将该列表传递给 pandas 以从中创建数据。 However the down side to this is if our file was very large we would essentially have 2 copies of it in memory.然而,不利的一面是,如果我们的文件非常大,我们基本上会在 memory 中有 2 个副本。 One in list and one in the data frame.一个在列表中,一个在数据框中。

Given that we know pandas will accept an iterable for the data so we can use a generator expression to give us a generator that will feed each line of the file to the data frame.鉴于我们知道 pandas 将接受数据的迭代,因此我们可以使用生成器表达式为我们提供一个生成器,它将文件的每一行提供给数据帧。 Now the data frame will be built its self by reading each line one at a time from the file.现在数据框将通过从文件中一次读取每一行来构建自己的数据框。

#create a generator to feed the data to the dataframe.
import pandas as pd
with open("query4.txt") as myfile:
    mydata = (line for line in myfile)
    df = pd.DataFrame(mydata, columns=['line'])
    print(df)

In all three cases there is no need to use read_csv since the data you want to load isnt a csv.在所有三种情况下,都不需要使用 read_csv,因为您要加载的数据不是 csv。 Each solution provides the same data frame output每个方案提供相同的数据帧output

SOURCE DATA源数据

this is some data
this is other data
data is fun
data is weird
this is the 5th line

DATA FRAME数据框

                   line
0   this is some data\n
1  this is other data\n
2         data is fun\n
3       data is weird\n
4  this is the 5th line

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM