简体   繁体   中英

InvalidBSON on MongoDB import - Pandas

I'm currently working with Pandas (0.14.1) in Python 3.4.2 importing data from a Mongo database using pymongo (2.8). Upon a simple import,

cur = db.collection.find()
df = pd.DataFrame(list(cur))

I'm getting the following error:

InvalidBSON: 'utf-8' codec can't decode byte 0xed in position 3123: invalid continuation byte

Import note: Previously, I was doing the same tasks (importing the same collections into a pandas dataframe for processing) using pandas in Python 2.7+ and all of the imports worked without issue. For other reasons, I would now prefer to stay in the 3.4+ environment.

While I cannot share the data, I can say it is UTF-8 encoded (which makes the error confusing) line-delimited JSON documents I bulk imported into MongoDB. Some of the fields contain many unicode characters. Up until now, working in the mongo console and python 2.7+ with read-only (from the db) tasks, I have not run into the above problem. As a check, after getting this error in python 3.4, I ran the same code in 2.7 (for the same db collection) and it imported fine.

Is anyone able to provide some insight into what is happening, and perhaps provide some support to remedy the problem? I am willing to provide any additional information I can.

Update:

I identified the offending document using

for doc in cur.sort([('_id', 1)]): print(doc['_id'])

and taking the _id following the last one listed. However, there is some odd behavior. Specifically, if I create a DataFrame using

pd.DataFrame(list(db.collection.find({'_id' : ObjectId('offending _id')}))

it works fine. The same document exists in several collections, and throws the error in each one when attempting to import the full collection.

Document:

{"app_name" : "Tiles", "description" : "Tiles is a sliding tile puzzle, also known as a \"15 Puzzle\". Using Tiles, you choose photos from your Photo Library on your iPhone or iPod Touch, or use the built-in camera on your iPhone.  Tiles then cuts the photo into tiles and scrambles them into a fun puzzle for you to solve!  Your job is to slide the tiles around and re-assemble the photo!\n\nSee if you can re-assemble the photo in the least number of moves or the fastest time possible!  Challenge your friends to beat your time!  Choose from an infinite number of images you create yourself, and up to 4 different puzzle configurations.\n\nFeatures:\n\n* 9, 16, 25, or 36 Tile Selections\n* Integrated with the built in iPhone camera and Photo Library so you can use your photos for puzzles.\n\nBy Request: A standard \"15\" Puzzle image can be downloaded at http://www.random-ideas.net/Software/Tiles/16.png simply download it and sync it with your phone (via iTunes) touse it.\n\nIn keeping with our company mission, we will be donating 5% of the pre-tax net profits from Tiles to charity.  The selected charity for Tiles will be to benefit autism.\n  \n  \n", "whats_new" : "Fixed a rare crashing bug while selecting a new image.\n  \n  \n"}

I don't believe there is a way of applying encoding on a cursor object while directly loading it into pandas. You may want to use mongoexport to dump your data into a csv first:

mongoexport --host localhost --db dbname --collection name --csv > test.csv

...and then you load that data in as utf-8.

df = pd.read_csv('test.csv', encoding = "utf-8")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM