简体   繁体   中英

Read data from text format into Python Pandas dataframe

I am running Python 2.7 on Windows.

I have a large text file (2 GB) that refers to 500K+ emails. The file has no explicit file type and is in the format:

email_message#: 1
email_message_sent: 10/10/1991 02:31:01
From: tomf@abc.com| Tom Foo |abc company|
To: adee@abc.com| Alex Dee |abc company|
To: benfor12@xyz.com| Ben For |xyz company|
email_message#: 2
email_message_sent: 10/12/1991 01:28:12
From: timt@abc.com| Tim Tee |abc company|
To: tomf@abc.com| Tom Foo |abc company|
To: adee@abc.com| Alex Dee |abc company|
To: benfor12@xyz.com| Ben For|xyz company|
email_message#: 3
email_message_sent: 10/13/1991 12:01:16
From: benfor12@xyz.com| Ben For |xyz company|
To: tomfoo@abc.com| Tom Foo |abc company|
To: t212@123.com| Tatiana Xocarsky |numbers firm |
...

As you can see, each email has the following data associated with it:

1) the time it was sent

2) the email address who sent it

3) the name of the person who sent it

4) the company that person works for

5) every email address that received the email

6) the name of every person who received the email

7) the company of every person who received the email

In the text files there are 500K+ emails, and emails can have up to 16K recipients. There is no pattern in the emails in how they refer to names of people or the company they work at.

I would like to take this large file and manipulate it in python so that it ends up as a Pandas Dataframe . I would like the pandas dataframe in the format like the screenshot from excel below:

样本数据结构

EDIT

My plan to solve this is to write a "parser" that takes this text file and reads in each line, assigning the text in each line to a particular columns of a pandas dataframe .

I plan to write something like the below. Can someone confirm that this is the correct way to go about executing this? I want to make sure I am not missing a built-in pandas function or function from a different module .

#connect to object 
data = open('.../Emails', 'r')

#build empty dataframe
import pandas as pd
df = pd.DataFrame()

#function to read lines of the object and put pieces of text into the
# correct column of the dataframe
for line in data:
     n = data.readline()
    if n.startswith("email_message#:"):
        #put a slice of the text into a dataframe
    elif n.startswith("email_message_sent:"):
        #put a slice of the text into a dataframe
    elif n.startswith("From:"):
        #put slices of the text into a dataframe
    elif n.startswith("To:"):
        #put slices of the text into a dataframe

I don't know the absolute best way to do this. You're certainly not overlooking an obvious one-liner, which may reassure you.

It looks like your current parser (call it my_parse ) does all the processing. In pseudocode:

finished_df = my_parse(original_text_file)

However, for such a large file, this is a little like cleaning up after a hurricane using tweezers. a two-stage solution may be faster, where you first roughly hew the file into the structure you want, then use pandas series operations to refine the rest. Continuing the pseudocode, you could do something like the following:

rough_df = rough_parse(original_text_file)
finished_df = refine(rough_df)

Where rough_parse uses Python standard-library stuff, and refine uses pandas series operations, particularly the Series.str methods .

I would suggest that the main goal of rough_parse would be simply to achieve a one-email--one-row structure. So basically you'd go through and replace all newline characters with some sort of unique delimiter that appears nowhere else in the file like "$%$%$" , except where the next thing after the newline is "email_message#:"

Then Series.str is really good at wrangling the rest of the strings how you want them.

I could not resist the itch so here is my approach.

from __future__ import unicode_literals

import io

import pandas as pd
from pandas.compat import string_types


def iter_fields(buf):
    for l in buf:
        yield l.rstrip('\n\r').split(':', 1)


def iter_messages(buf):
    it = iter_fields(buf)
    k, v = next(it)
    while True:
        n = int(v)
        _, v = next(it)
        date = pd.Timestamp(v)
        _, v = next(it)
        from_add, from_name, from_comp = v.split('|')[:-1]
        k, v = next(it)
        to = []
        while k == 'To':
            to_add, to_name, to_comp = v.split('|')[:-1]
            yield (n, date, from_add[1:], from_name[1:-1], from_comp,
                   to_add[1:], to_name[1:-1], to_comp)
            k, v = next(it)

    if not hasattr(filepath_or_buffer, read):
        filepath_or_buffer


def _read_email_headers(buf):
    columns=['email_message#', 'email_message_sent',
             'from_address', 'from_name', 'from_company',
             'to_address', 'to_name', 'to_company']
    return pd.DataFrame(iter_messages(buf), columns=columns)


def read_email_headers(path_or_buf):
    close_buf = False
    if isinstance(path_or_buf, string_types):
        path_or_buf = io.open(path_or_buf)
        close_buf = True
    try:
        return _read_email_headers(path_or_buf)
    finally:
        if close_buf:
            path_or_buf.close

This is how you would use it:

df = read_email_headers('.../data_file')

Just call it with the path to your file and you have your dataframe.

Now, what follows is for test purposes only. You wouldn't do this to work with your actual data in the real life.

Since I (or a random StackOverflow reader) do not have a copy of your file, I have to fake it using a string:

text = '''email_message#: 1
email_message_sent: 10/10/1991 02:31:01
From: tomf@abc.com| Tom Foo |abc company|
To: adee@abc.com| Alex Dee |abc company|
To: benfor12@xyz.com| Ben For |xyz company|
email_message#: 2
email_message_sent: 10/12/1991 01:28:12
From: timt@abc.com| Tim Tee |abc company|
To: tomf@abc.com| Tom Foo |abc company|
To: adee@abc.com| Alex Dee |abc company|
To: benfor12@xyz.com| Ben For|xyz company|'''

Then I can create a file-like object and pass it to the function:

df = read_email_headers(io.StringIO(text))
print(df.to_string())

   email_message#  email_message_sent  from_address from_name from_company        to_address   to_name   to_company
0               1 1991-10-10 02:31:01  tomf@abc.com   Tom Foo  abc company      adee@abc.com  Alex Dee  abc company
1               1 1991-10-10 02:31:01  tomf@abc.com   Tom Foo  abc company  benfor12@xyz.com   Ben For  xyz company
2               2 1991-10-12 01:28:12  timt@abc.com   Tim Tee  abc company      tomf@abc.com   Tom Foo  abc company
3               2 1991-10-12 01:28:12  timt@abc.com   Tim Tee  abc company      adee@abc.com  Alex Dee  abc company
4               2 1991-10-12 01:28:12  timt@abc.com   Tim Tee  abc company  benfor12@xyz.com    Ben Fo  xyz company

Or, if I wanted to work with an actual file:

with io.open('test_file.txt', 'w') as f:
    f.write(text)

df = read_email_headers('test_file.txt')
print(df.to_string())  # Same output as before.

But, again, you do not have to do this to use the function with your data . Just call it with a file path.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM