简体   繁体   English

我需要解析大型文本文件(大小为6GB)的建议

[英]I need advice on parsing a large text file (6GB in size)

I need advice on parsing a large text file - 6GB in size 我需要解析大型文本文件的建议-大小为6GB

What I have done is download all my Gmail using Thundervird I now have an mbox file with all my email in - this is a text file - of size 6GB 我要做的是使用Thundervird下载我的所有Gmail,我现在有一个mbox文件,其中所有电子邮件都在其中-这是一个文本文件-大小为6GB

I need to parse this file and pull out specific data that follows a specific pattern 我需要解析此文件并提取遵循特定模式的特定数据

First question : what language should I use? 第一个问题 :我应该使用哪种语言? I've searched some other threads similar to this and understand that Perl or Python (and one or 2 others) would be fine 我搜索了与此类似的其他一些线程,并了解Perl或Python(以及另外一两个)也可以

Second question though : I read in one of the post replies that it might be better to load the text file into a database and let the database search through the text file? 但是,第二个问题是 :我读了一篇帖子的回复,将文本文件加载到数据库中并让数据库搜索文本文件可能会更好?

I need to have a CSV generated as an output 我需要生成一个CSV作为输出

So... is it wiser for me to go the DB route? 那么...走数据库路线对我来说是否更明智?

Third question : How long is a piece of string... erm I mean... how long will it take to go through my 6Gb file... OK, not possible to answer without some details! 第三个问题 :一个字符串要花多长时间...呃,我的意思是...要遍历我的6Gb文件需要多长时间...好吧,没有一些细节就无法回答!

I need to pull out the following data: 我需要提取以下数据:

First Name: 
Last Name: 

Address:

Telephone: 
Mobile:
Email:

So... I need to know if I need to run the script and leave my machine running overnight I'm not sure if the above is a really dumb question or not - but I thought I'd ask anyway 所以...我需要知道是否需要运行脚本并使机器整夜运行,我不确定上面是否是一个很愚蠢的问题-但我想我还是会问

ANY replies would be great 任何回复都很好

Thanks 谢谢

Omar 奥马尔

  1. You should use whatever language you're more familiar with. 您应该使用更熟悉的任何语言。 In terms of performance, Perl programs generally can parse text data faster than python. 在性能方面,Perl程序通常可以比python更快地解析文本数据。

  2. You need to parse the data regardless of using database or not. 无论是否使用数据库,都需要解析数据。 If you're going to be doing a lot of queries/searches afterwards, then you should consider loading the parsed data into a database. 如果之后要进行大量查询/搜索,则应考虑将已解析的数据加载到数据库中。

  3. Depends on how complex the pattern you're trying to match on. 取决于您要匹配的模式的复杂程度。 Probably no more than 1 hour. 大概不超过1小时。

You can use the aperture project to query the mbox contents: 您可以使用光圈项目来查询mbox内容:

http://aperture.sourceforge.net/ http://aperture.sourceforge.net/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM