简体   繁体   English

DataScience DataFrame 的 Python 编程错误

[英]Python Programming Error for DataScience DataFrame

I am reading my data from a CSV file using pandas and it works well with range 700. But as soon as I go above 700 and trying to append to a list in python it is showing me list index out of range.我正在使用 Pandas 从 CSV 文件读取我的数据,它在 700 范围内运行良好。但是一旦我超过 700 并尝试附加到 python 中的列表,它就会显示我列表索引超出范围。 But the CSV has around 500K of rows Can anyone help me with that why is it happening?但是 CSV 大约有 500K 行任何人都可以帮助我解决为什么会发生这种情况? Thanks in advance.提前致谢。

import pandas as pd

df_email = pd.read_csv('emails.csv',nrows=800)
test_email = df_email.iloc[:,-1]


list_of_emails = []

for i in range(len(test_email)):    
    var_email = test_email[i].split("\n") #this code takes one single email splits based on a new line giving a python list of all the strings in the email


    email = {}
    message_body = ''

    for _ in var_email:
        if ":" in _:
            var_sentence = _.split(":") #this part actually uses the ":" to find the elements in the list that have ":" present

            for j in range(len(var_sentence)):           
                if var_sentence[j].lower().strip() == "from":
                    email['from'] = var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip()
                elif  var_sentence[j].lower().strip() == "to":  
                    email['to'] = var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip()
                elif var_sentence[j].lower().strip() == 'subject':
                    if var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip() == 're':
                        email['subject'] = var_sentence[var_sentence.index(var_sentence[j+2])].lower().strip()
                    else:
                        email['subject'] = var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip()

        elif ":" not in _:
            message_body += _.strip()
            email['body'] = message_body

    list_of_emails.append(email)

I am not sure of what you are trying to say here (might as well put example inputs and outputs here), but I came across this problem, which might be of the same nature, sometime weeks ago.我不确定您在这里想说什么(也可以在此处放置示例输入和输出),但是几周前我遇到了这个问题,这可能具有相同的性质。

CSV files are comma-separated, which means it always takes note of every comma in a line to separate them in columns. CSV 文件以逗号分隔,这意味着它总是记下一行中的每个逗号以将它们分隔成列。 If some dirty input from strings in your CSV file are present, then it will mess up the columns that you are expecting to have.如果您的 CSV 文件中存在一些来自字符串的脏输入,那么它会弄乱您期望的列。

Best solution here is have some code to cleanup your CSV file, change its delimiter to another character (probably '|', '&', or anything that also doesn't mess up with the data), and revise your code to reflect these changes to the CSV.这里的最佳解决方案是使用一些代码来清理 CSV 文件,将其分隔符更改为另一个字符(可能是“|”、“&”或任何不会与数据混淆的内容),然后修改您的代码以反映这些CSV 的更改。

use the pandas library to read the file.使用 pandas 库读取文件。

it is very efficient and saves you time in writing the code yourself.它非常有效,可以节省您自己编写代码的时间。

eg :例如:

import pandas as pd
training_data = pd.read_csv( "train.csv", sep = ",", header = None )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM