简体   繁体   中英

socket.timeout mongoDB pyspark

I am trying to execute a python file in SPARK using a mongoDB connector. The python file do a query to get some data from mongoDB and them process this data with a map operation in SPARK.

The the execution stops getting this error message: "socket.timeout: timed out", while the map operation is being executed. That is the output I get:

Traceback (most recent call last): File "/home/ana/computational_tools_for_big_data/project/review_analysis.py", line 27, in bad_reviews = reviews_1.rdd.map(lambda r: r.text).collect() File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 777, in collect File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 142, in _load_from_socket File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 139, in load_stream File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 156, in _read_with_length File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 543, in read_int File "/usr/lib/python2.7/socket.py", line 384, in read data = self._sock.recv(left) socket.timeout: timed out

I get this problem because the file I am querying it is very big 2.3GB, I tried the same with a file of 1GB and it is the same problem but it is works with a smaller file of 400MB.

Is it possible to change the timeout or something to make it work? Is there any other way to process a big amount of data faster?

Your issue is the socket connection is taking more time than the timeout specified. Refer this document to change the timeouts and other settings.

The property you want to change

socketTimeoutMS: (integer or None) Controls how long (in milliseconds) the driver will wait for a response after sending an ordinary (non-monitoring) database operation before concluding that a network error has occurred. Defaults to None (no timeout).

Eg MongoClient('localhost', 27017, socketTimeoutMS=6000)

Of course based on how much time it actually takes for 2.3GB file transfer, you might want to go above one minute (6000), I mentioned in the example.

Documentation of MongoClient

https://mongodb.github.io/node-mongodb-native/driver-articles/mongoclient.html

Documentation of PyMongo MongoClient

http://api.mongodb.com/python/current/api/pymongo/mongo_client.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM