简体   繁体   English

使用熊猫从.txt文件导入数据,并在各列之间使用换行符

[英]Importing data from .txt file with line breaks in between columns using pandas

I am importing data from Boston Housing Data into a pandas dataframe. 我正在将数据从Boston Housing Data导入到pandas数据框中。 The last 3 items for every row is separated into the next row. 每行的最后3个项目被分隔到下一行。 Is there a way to import the data using pd.read_csv to include these off items? 有没有一种方法可以使用pd.read_csv导入数据以包括这些项目? Here is my code: 这是我的代码:

import pandas as pd
path = '/Users/Main/Desktop/boston.txt'
df = pd.read_csv(path, skiprows=21, sep='\s+', header=None)

This provides me with a dataframe with 11 columns, but I need 14 columns. 这为我提供了11列的数据框,但我需要14列。 Also, is there a better way to skip all the text at the top of the file without manually counting each row? 另外,是否有更好的方法跳过文件顶部的所有文本而无需手动计算每一行?

First of all, you can just use the boston housing dataset from scikit-learn. 首先,您可以只使用scikit-learn的波士顿房屋数据集。 http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html . http://scikit-learn.org/stable/modules/generation/sklearn.datasets.load_boston.html If you still want to use the text file, unfortunately I think you will have to do some processing on the text file, to remove the line breaks. 如果您仍然想使用文本文件,那么不幸的是,我认为您必须对文本文件进行一些处理以删除换行符。 I have tried to give an example of the kind of processing needed. 我试图举一个例子说明所需的处理方式。

# read the file, and separate the lines.
with open('boston.txt', 'r') as f:
    text = [line for line in f.readlines()]

# starting from first row of data, remove \n from even numbered rows,
# and append the next row to it.
start_row = 22
new_rows = []
for i,l in enumerate(text[start_row:]):
    if not i%2:
        newl = l.strip('\n')+text[start_row+i+1]
        new_rows.append(newl)

new_data = ''.join(new_rows)

# finally save the data.
with open('boston_new.txt', 'w') as f:
    f.write(new_data)

Now you can read the data easily. 现在,您可以轻松读取数据。 The delim_whitespace is similar to using sep='\\s+'. delim_whitespace与使用sep ='\\ s +'类似。

col_names = ['CRIM', 'ZN', 'INDUS', 'CHAS','NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
pd.read_csv('boston_new.txt', delim_whitespace=True, header=None, names=col_names)

After doing this once, you should save the data in a proper .csv format that is readable by pandas without giving so many parameters. 完成一次之后,您应该将数据保存为熊猫可以读取的正确的.csv格式,而无需提供太多参数。

pd.to_csv('boston_final.csv')

I ended up trying the same idea, appending each overflow line to the line before it. 我最终尝试了相同的想法,将每个溢出行附加到它之前的行。

boston = pd.read_csv("FILE_LOCATION", sep='\s+', header = None)

oklist = []

for row in range(1012):

    if row % 2 == 0:
        rowa = boston.iloc[row,]
        row = row + 1
        rowb = boston.iloc[row,]

        new_row = rowa.append(rowb)
        clean_list = new_row.iloc[0:14].tolist()
        oklist.append(clean_list)

pd.DataFrame(oklist)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM