使用熊猫从.txt文件导入数据，并在各列之间使用换行符

Question

I am importing data from Boston Housing Data into a pandas dataframe. 我正在将数据从Boston Housing Data导入到pandas数据框中。 The last 3 items for every row is separated into the next row. 每行的最后3个项目被分隔到下一行。 Is there a way to import the data using pd.read_csv to include these off items? 有没有一种方法可以使用pd.read_csv导入数据以包括这些项目？ Here is my code: 这是我的代码：

import pandas as pd
path = '/Users/Main/Desktop/boston.txt'
df = pd.read_csv(path, skiprows=21, sep='\s+', header=None)

This provides me with a dataframe with 11 columns, but I need 14 columns. 这为我提供了11列的数据框，但我需要14列。 Also, is there a better way to skip all the text at the top of the file without manually counting each row? 另外，是否有更好的方法跳过文件顶部的所有文本而无需手动计算每一行？

Answer 1

First of all, you can just use the boston housing dataset from scikit-learn. 首先，您可以只使用scikit-learn的波士顿房屋数据集。 http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html . http://scikit-learn.org/stable/modules/generation/sklearn.datasets.load_boston.html 。 If you still want to use the text file, unfortunately I think you will have to do some processing on the text file, to remove the line breaks. 如果您仍然想使用文本文件，那么不幸的是，我认为您必须对文本文件进行一些处理以删除换行符。 I have tried to give an example of the kind of processing needed. 我试图举一个例子说明所需的处理方式。

# read the file, and separate the lines.
with open('boston.txt', 'r') as f:
    text = [line for line in f.readlines()]

# starting from first row of data, remove \n from even numbered rows,
# and append the next row to it.
start_row = 22
new_rows = []
for i,l in enumerate(text[start_row:]):
    if not i%2:
        newl = l.strip('\n')+text[start_row+i+1]
        new_rows.append(newl)

new_data = ''.join(new_rows)

# finally save the data.
with open('boston_new.txt', 'w') as f:
    f.write(new_data)

Now you can read the data easily. 现在，您可以轻松读取数据。 The delim_whitespace is similar to using sep='\\s+'. delim_whitespace与使用sep ='\\ s +'类似。

col_names = ['CRIM', 'ZN', 'INDUS', 'CHAS','NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
pd.read_csv('boston_new.txt', delim_whitespace=True, header=None, names=col_names)

After doing this once, you should save the data in a proper .csv format that is readable by pandas without giving so many parameters. 完成一次之后，您应该将数据保存为熊猫可以读取的正确的.csv格式，而无需提供太多参数。

pd.to_csv('boston_final.csv')

Answer 2

I ended up trying the same idea, appending each overflow line to the line before it. 我最终尝试了相同的想法，将每个溢出行附加到它之前的行。

boston = pd.read_csv("FILE_LOCATION", sep='\s+', header = None)

oklist = []

for row in range(1012):

    if row % 2 == 0:
        rowa = boston.iloc[row,]
        row = row + 1
        rowb = boston.iloc[row,]

        new_row = rowa.append(rowb)
        clean_list = new_row.iloc[0:14].tolist()
        oklist.append(clean_list)

pd.DataFrame(oklist)

使用熊猫从.txt文件导入数据，并在各列之间使用换行符

问题描述

2 个解决方案

解决方案1
0 已采纳 2018-10-11 05:24:43

解决方案2
0 2019-04-26 16:36:47

使用熊猫从.txt文件导入数据，并在各列之间使用换行符

问题描述

2 个解决方案

解决方案1 0 已采纳 2018-10-11 05:24:43

解决方案2 0 2019-04-26 16:36:47

解决方案1
0 已采纳 2018-10-11 05:24:43

解决方案2
0 2019-04-26 16:36:47