[英]How to convert pipe delimited to CSV or JSON
I have a ~4GB txt file which is pipe delimited.我有一个 ~4GB 的 txt 文件,它是 pipe 分隔的。 I am trying to import this text to MongoDB but as you know MongoDB supports only JSON and CSV files.我正在尝试将此文本导入 MongoDB 但如您所知 MongoDB 仅支持 JSON 和 ZCC8D68C551C4A9A6DD531E 文件。 Below is the code so far.以下是到目前为止的代码。
import pandas as pd
import csv
from pymongo import MongoClient
url = "mongodb://localhost:27017"
client = MongoClient(url)
# Creating Database Office
db = client.Office
# Creating Collection Customers
customers = db.Customers
filename = "Names.txt"
data_df = pd.read_fwf(filename, sep="|", engine="python", encoding="latin-1")
fileout = "Names.csv"
output = data_df.to_csv(fileout, sep=",")
print("Finished")
fin = open("Names.csv", "r")
file_data = fin.read()
file_csv = csv.reader(file_data)
Customers.insert_many(file_csv)
The input file "Name.txt" looks like this输入文件“Name.txt”如下所示
Reg|Name|DOB|Friend|Nationality|Profession^M
1122|Sam|01/01/2001|John|USA|Lawyer^M
2456|George|05/10/1999|Pit|Canada|Engineer^M
5645|Brad|02/06/2000|Adam|UK|Doctor^M
If the provided text file is CSV then simply import it to MongoDB or if the txt file is pipe delimited or any other delimited then import it to MongoDB after only after processing the text file to a CSV file. If the provided text file is CSV then simply import it to MongoDB or if the txt file is pipe delimited or any other delimited then import it to MongoDB after only after processing the text file to a CSV file. The CSV file that I get in fileout, when imported manually to MongoDB the result looks like this.我在 fileout 中获得的 CSV 文件,当手动导入到 MongoDB 时,结果如下所示。
col1 col2
id Reg|Name|DOB|Friend|Nationality|Profession
1 1122|Sam|01/01/2001|John|USA|Lawyer
2 2456|George|05/10/1999|Pit|Canada|Engineer
3 5645|Brad|02/06/2000|Adam|UK|Doctor
What I want to achieve is shown below.我想要达到的目标如下所示。 This was done with the sed
command.这是通过sed
命令完成的。 First I replaced any "," if in the txt file with "-" using the command首先,我使用命令将txt文件中的任何“,”替换为“-”
sed -i 's/,/-/g' Names.txt
then I replaced the pipe delimiter with ",":然后我用“,”替换了 pipe 分隔符:
sed -i 's/|/,/g' Names.txt
col1 col2 col3 col4 col5 col6 col7
id Reg Name DOB Friend Nationality Profession
1 1122 Sam 01/01/2001 John USA Lawyer
2 2456 George 05/10/1999 Pit Canada Engineer
3 5645 Brad 02/06/2000 Adam UK Doctor
I know that the code is not doing anything.我知道代码没有做任何事情。 But I can't figure out how to make it work.但我不知道如何使它工作。
I am new to all type of programming and I have searched through various answers regarding this question and various other related questions in the site, but none fits my needs.我对所有类型的编程都是新手,我已经搜索了关于这个问题的各种答案以及网站中的各种其他相关问题,但没有一个适合我的需要。
UPDATE更新
import csv
import json
from pymongo import MongoClient
url = "mongodb://localhost:27017"
client = MongoClient(url)
db = client.Office
customer = db.Customer
jsonArray = []
with open("Names.txt", "r") as csv_file:
csv_reader = csv.DictReader(csv_file, dialect='excel', delimiter='|', quoting=csv.QUOTE_NONE)
for row in csv_reader:
jsonArray.append(row)
jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
jsonfile = json.loads(jsonString)
customer.insert_many(jsonfile)
This is the new code I came up with after getting some ideas from comments.这是我从评论中得到一些想法后想出的新代码。 But now the only problem is I get this error.但现在唯一的问题是我得到这个错误。
Traceback (most recent call last):
File "E:\Anaconda Projects\Mongo Projects\Office Tool\csvtojson.py", line 16, in <module>
jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
File "C:\Users\Predator\anaconda3\lib\json\__init__.py", line 234, in dumps
return cls(
File "C:\Users\Predator\anaconda3\lib\json\encoder.py", line 201, in encode
chunks = list(chunks)
MemoryError
Pandas read_fwf() is for data files where data is in a fixed column. Pandas read_fwf()适用于数据位于固定列中的数据文件。 Sometimes they might have a separator as well (usually a pipe character to make the data table easier to read).有时它们也可能有分隔符(通常是 pipe 字符,以使数据表更易于阅读)。
You can read a pipe-separated file with readcsv() .您可以使用readcsv()读取管道分隔的文件。 Just use the sep='|'
只需使用sep='|'
: :
df = pd.read_csv(filename, sep='|')
Now you can insert the data into the mongo collection converting the dataframe to a dict this way:现在您可以将数据插入到 mongo 集合中,以这种方式将 dataframe 转换为 dict:
Customers.insert_many( df.to_dict(orient='records') )
Finally found the solution.终于找到了解决办法。
I tested it on a 5GB file although slow it still works.我在一个 5GB 的文件上测试了它,虽然速度很慢,但它仍然可以工作。 It imports all data from a pipe delimited txt file to MongoDB.它将 pipe 分隔的 txt 文件中的所有数据导入到 MongoDB。
import csv
import json
from pymongo import MongoClient
url_mongo = "mongodb://localhost:27017"
client = MongoClient(url_mongo)
db = client.Office
customer = db.Customer
jsonArray = []
file_txt = "Text.txt"
rowcount = 0
with open(file_txt, "r") as txt_file:
csv_reader = csv.DictReader(txt_file, dialect="excel", delimiter="|", quoting=csv.QUOTE_NONE)
for row in csv_reader:
rowcount += 1
jsonArray.append(row)
for i in range(rowcount):
jsonString = json.dumps(jsonArray[i], indent=1, separators=(",", ":"))
jsonfile = json.loads(jsonString)
customer.insert_one(jsonfile)
print("Finished")
Thank You All for your Ideas谢谢大家的想法
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.