I am running Python 2.7 on Windows.
I have a large text file (2 GB) that refers to 500K+ emails. The file has no explicit file type and is in the format:
email_message#: 1
email_message_sent: 10/10/1991 02:31:01
From: tomf@abc.com| Tom Foo |abc company|
To: adee@abc.com| Alex Dee |abc company|
To: benfor12@xyz.com| Ben For |xyz company|
email_message#: 2
email_message_sent: 10/12/1991 01:28:12
From: timt@abc.com| Tim Tee |abc company|
To: tomf@abc.com| Tom Foo |abc company|
To: adee@abc.com| Alex Dee |abc company|
To: benfor12@xyz.com| Ben For|xyz company|
email_message#: 3
email_message_sent: 10/13/1991 12:01:16
From: benfor12@xyz.com| Ben For |xyz company|
To: tomfoo@abc.com| Tom Foo |abc company|
To: t212@123.com| Tatiana Xocarsky |numbers firm |
...
As you can see, each email has the following data associated with it:
1) the time it was sent
2) the email address who sent it
3) the name of the person who sent it
4) the company that person works for
5) every email address that received the email
6) the name of every person who received the email
7) the company of every person who received the email
In the text files there are 500K+ emails, and emails can have up to 16K recipients. There is no pattern in the emails in how they refer to names of people or the company they work at.
I would like to take this large file and manipulate it in python
so that it ends up as a Pandas
Dataframe
. I would like the pandas
dataframe
in the format like the screenshot from excel
below:
EDIT
My plan to solve this is to write a "parser" that takes this text file and reads in each line, assigning the text in each line to a particular columns of a pandas
dataframe
.
I plan to write something like the below. Can someone confirm that this is the correct way to go about executing this? I want to make sure I am not missing a built-in pandas
function or function from a different module
.
#connect to object
data = open('.../Emails', 'r')
#build empty dataframe
import pandas as pd
df = pd.DataFrame()
#function to read lines of the object and put pieces of text into the
# correct column of the dataframe
for line in data:
n = data.readline()
if n.startswith("email_message#:"):
#put a slice of the text into a dataframe
elif n.startswith("email_message_sent:"):
#put a slice of the text into a dataframe
elif n.startswith("From:"):
#put slices of the text into a dataframe
elif n.startswith("To:"):
#put slices of the text into a dataframe
I don't know the absolute best way to do this. You're certainly not overlooking an obvious one-liner, which may reassure you.
It looks like your current parser (call it my_parse
) does all the processing. In pseudocode:
finished_df = my_parse(original_text_file)
However, for such a large file, this is a little like cleaning up after a hurricane using tweezers. a two-stage solution may be faster, where you first roughly hew the file into the structure you want, then use pandas series operations to refine the rest. Continuing the pseudocode, you could do something like the following:
rough_df = rough_parse(original_text_file)
finished_df = refine(rough_df)
Where rough_parse
uses Python standard-library stuff, and refine
uses pandas series operations, particularly the Series.str methods .
I would suggest that the main goal of rough_parse
would be simply to achieve a one-email--one-row structure. So basically you'd go through and replace all newline characters with some sort of unique delimiter that appears nowhere else in the file like "$%$%$"
, except where the next thing after the newline is "email_message#:"
Then Series.str is really good at wrangling the rest of the strings how you want them.
I could not resist the itch so here is my approach.
from __future__ import unicode_literals
import io
import pandas as pd
from pandas.compat import string_types
def iter_fields(buf):
for l in buf:
yield l.rstrip('\n\r').split(':', 1)
def iter_messages(buf):
it = iter_fields(buf)
k, v = next(it)
while True:
n = int(v)
_, v = next(it)
date = pd.Timestamp(v)
_, v = next(it)
from_add, from_name, from_comp = v.split('|')[:-1]
k, v = next(it)
to = []
while k == 'To':
to_add, to_name, to_comp = v.split('|')[:-1]
yield (n, date, from_add[1:], from_name[1:-1], from_comp,
to_add[1:], to_name[1:-1], to_comp)
k, v = next(it)
if not hasattr(filepath_or_buffer, read):
filepath_or_buffer
def _read_email_headers(buf):
columns=['email_message#', 'email_message_sent',
'from_address', 'from_name', 'from_company',
'to_address', 'to_name', 'to_company']
return pd.DataFrame(iter_messages(buf), columns=columns)
def read_email_headers(path_or_buf):
close_buf = False
if isinstance(path_or_buf, string_types):
path_or_buf = io.open(path_or_buf)
close_buf = True
try:
return _read_email_headers(path_or_buf)
finally:
if close_buf:
path_or_buf.close
This is how you would use it:
df = read_email_headers('.../data_file')
Just call it with the path to your file and you have your dataframe.
Now, what follows is for test purposes only. You wouldn't do this to work with your actual data in the real life.
Since I (or a random StackOverflow reader) do not have a copy of your file, I have to fake it using a string:
text = '''email_message#: 1
email_message_sent: 10/10/1991 02:31:01
From: tomf@abc.com| Tom Foo |abc company|
To: adee@abc.com| Alex Dee |abc company|
To: benfor12@xyz.com| Ben For |xyz company|
email_message#: 2
email_message_sent: 10/12/1991 01:28:12
From: timt@abc.com| Tim Tee |abc company|
To: tomf@abc.com| Tom Foo |abc company|
To: adee@abc.com| Alex Dee |abc company|
To: benfor12@xyz.com| Ben For|xyz company|'''
Then I can create a file-like object and pass it to the function:
df = read_email_headers(io.StringIO(text))
print(df.to_string())
email_message# email_message_sent from_address from_name from_company to_address to_name to_company
0 1 1991-10-10 02:31:01 tomf@abc.com Tom Foo abc company adee@abc.com Alex Dee abc company
1 1 1991-10-10 02:31:01 tomf@abc.com Tom Foo abc company benfor12@xyz.com Ben For xyz company
2 2 1991-10-12 01:28:12 timt@abc.com Tim Tee abc company tomf@abc.com Tom Foo abc company
3 2 1991-10-12 01:28:12 timt@abc.com Tim Tee abc company adee@abc.com Alex Dee abc company
4 2 1991-10-12 01:28:12 timt@abc.com Tim Tee abc company benfor12@xyz.com Ben Fo xyz company
Or, if I wanted to work with an actual file:
with io.open('test_file.txt', 'w') as f:
f.write(text)
df = read_email_headers('test_file.txt')
print(df.to_string()) # Same output as before.
But, again, you do not have to do this to use the function with your data . Just call it with a file path.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.