简体   繁体   English

如何将 pipe 分隔为 CSV 或 JSON

[英]How to convert pipe delimited to CSV or JSON

I have a ~4GB txt file which is pipe delimited.我有一个 ~4GB 的 txt 文件,它是 pipe 分隔的。 I am trying to import this text to MongoDB but as you know MongoDB supports only JSON and CSV files.我正在尝试将此文本导入 MongoDB 但如您所知 MongoDB 仅支持 JSON 和 ZCC8D68C551C4A9A6DD531E 文件。 Below is the code so far.以下是到目前为止的代码。

import pandas as pd
import csv
from pymongo import MongoClient

url = "mongodb://localhost:27017"
client = MongoClient(url)
# Creating Database Office
db = client.Office
# Creating Collection Customers
customers = db.Customers

filename = "Names.txt"
data_df = pd.read_fwf(filename, sep="|", engine="python", encoding="latin-1")
fileout = "Names.csv"
output = data_df.to_csv(fileout, sep=",")
print("Finished")
fin = open("Names.csv", "r")
file_data = fin.read()
file_csv = csv.reader(file_data)
Customers.insert_many(file_csv)

The input file "Name.txt" looks like this输入文件“Name.txt”如下所示

Reg|Name|DOB|Friend|Nationality|Profession^M
1122|Sam|01/01/2001|John|USA|Lawyer^M
2456|George|05/10/1999|Pit|Canada|Engineer^M
5645|Brad|02/06/2000|Adam|UK|Doctor^M

If the provided text file is CSV then simply import it to MongoDB or if the txt file is pipe delimited or any other delimited then import it to MongoDB after only after processing the text file to a CSV file. If the provided text file is CSV then simply import it to MongoDB or if the txt file is pipe delimited or any other delimited then import it to MongoDB after only after processing the text file to a CSV file. The CSV file that I get in fileout, when imported manually to MongoDB the result looks like this.我在 fileout 中获得的 CSV 文件,当手动导入到 MongoDB 时,结果如下所示。

col1          col2
id    Reg|Name|DOB|Friend|Nationality|Profession
1     1122|Sam|01/01/2001|John|USA|Lawyer
2     2456|George|05/10/1999|Pit|Canada|Engineer
3     5645|Brad|02/06/2000|Adam|UK|Doctor

What I want to achieve is shown below.我想要达到的目标如下所示。 This was done with the sed command.这是通过sed命令完成的。 First I replaced any "," if in the txt file with "-" using the command首先,我使用命令将txt文件中的任何“,”替换为“-”

sed -i 's/,/-/g' Names.txt

then I replaced the pipe delimiter with ",":然后我用“,”替换了 pipe 分隔符:

sed -i 's/|/,/g' Names.txt
col1 col2  col3   col4       col5    col6        col7
id   Reg   Name   DOB        Friend  Nationality Profession
1    1122  Sam    01/01/2001 John    USA         Lawyer
2    2456  George 05/10/1999 Pit     Canada      Engineer
3    5645  Brad   02/06/2000 Adam    UK          Doctor

I know that the code is not doing anything.我知道代码没有做任何事情。 But I can't figure out how to make it work.但我不知道如何使它工作。

I am new to all type of programming and I have searched through various answers regarding this question and various other related questions in the site, but none fits my needs.我对所有类型的编程都是新手,我已经搜索了关于这个问题的各种答案以及网站中的各种其他相关问题,但没有一个适合我的需要。

UPDATE更新

import csv
import json
from pymongo import MongoClient

url = "mongodb://localhost:27017"
client = MongoClient(url)
db = client.Office
customer = db.Customer
jsonArray = []

with open("Names.txt", "r") as csv_file:
    csv_reader = csv.DictReader(csv_file, dialect='excel', delimiter='|', quoting=csv.QUOTE_NONE)
    for row in csv_reader:
        jsonArray.append(row)
    jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
    jsonfile = json.loads(jsonString)
    customer.insert_many(jsonfile)

This is the new code I came up with after getting some ideas from comments.这是我从评论中得到一些想法后想出的新代码。 But now the only problem is I get this error.但现在唯一的问题是我得到这个错误。

Traceback (most recent call last):
  File "E:\Anaconda Projects\Mongo Projects\Office Tool\csvtojson.py", line 16, in <module>
    jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
  File "C:\Users\Predator\anaconda3\lib\json\__init__.py", line 234, in dumps
    return cls(
  File "C:\Users\Predator\anaconda3\lib\json\encoder.py", line 201, in encode
    chunks = list(chunks)
MemoryError

Pandas read_fwf() is for data files where data is in a fixed column. Pandas read_fwf()适用于数据位于固定列中的数据文件。 Sometimes they might have a separator as well (usually a pipe character to make the data table easier to read).有时它们也可能有分隔符(通常是 pipe 字符,以使数据表更易于阅读)。

You can read a pipe-separated file with readcsv() .您可以使用readcsv()读取管道分隔的文件。 Just use the sep='|'只需使用sep='|' :

df = pd.read_csv(filename, sep='|')

Now you can insert the data into the mongo collection converting the dataframe to a dict this way:现在您可以将数据插入到 mongo 集合中,以这种方式将 dataframe 转换为 dict:

Customers.insert_many( df.to_dict(orient='records') )

Finally found the solution.终于找到了解决办法。

I tested it on a 5GB file although slow it still works.我在一个 5GB 的文件上测试了它,虽然速度很慢,但它仍然可以工作。 It imports all data from a pipe delimited txt file to MongoDB.它将 pipe 分隔的 txt 文件中的所有数据导入到 MongoDB。

import csv
import json

from pymongo import MongoClient

url_mongo = "mongodb://localhost:27017"
client = MongoClient(url_mongo)
db = client.Office
customer = db.Customer
jsonArray = []
file_txt = "Text.txt"
rowcount = 0
with open(file_txt, "r") as txt_file:
    csv_reader = csv.DictReader(txt_file, dialect="excel", delimiter="|", quoting=csv.QUOTE_NONE)
    for row in csv_reader:
        rowcount += 1
        jsonArray.append(row)
    for i in range(rowcount):
        jsonString = json.dumps(jsonArray[i], indent=1, separators=(",", ":"))
        jsonfile = json.loads(jsonString)
        customer.insert_one(jsonfile)
print("Finished")

Thank You All for your Ideas谢谢大家的想法

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM