RunTimeError while reading tab separated text file into Pandas dataframe

Question

I am reading a tab separated text file into pandas dataframe.I am getting a runtime error while reading this.I have gone through the posts related to this error and all of them are alluding to the rule that one should not modify dicts while iterating over them.In my case all I am doing is reading a file. How is this problem connected to an error of iterating and changing dicts ?

>>> import pandas as pd
>>> df=pd.read_csv("dummy_data.txt",header=None,chunksize=10000,error_bad_lines=False,warn_bad_lines=True,engine='c',sep="\t",encoding="latin-1")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    df=pd.read_csv("dummy_data.txt",header=None,chunksize=10000,error_bad_lines=False,warn_bad_lines=True,engine='c',sep="\t",encoding="latin-1")
  File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/pandas/io/parsers.py", line 709, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/pandas/io/parsers.py", line 431, in _read
    compression = _infer_compression(filepath_or_buffer, compression)
  File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/pandas/io/common.py", line 270, in _infer_compression
    filepath_or_buffer = _stringify_path(filepath_or_buffer)
  File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/pandas/io/common.py", line 157, in _stringify_path
    from py.path import local as LocalPath
  File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/py/__init__.py", line 148, in <module>
    'Syslog'             : '._log.log:Syslog',
  File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/py/_vendored_packages/apipkg.py", line 63, in initpkg
    for module in sys.modules.values():
RuntimeError: dictionary changed size during iteration

Edit 1: While reading the file via the interactive mode I encounter the same error twice while trying to read the file.On the 3rd time running the same line doesn't throw any error.What could be the reason for such unstable behavior ?

>>> df=pd.read_csv("product_name.txt",header=None,chunksize=10000,error_bad_lines=False,warn_bad_lines=True,engine='c',sep="\t",encoding="latin-1")

Edit 2: To replicate the error here is a link to a 1000 row dataset: S3 link to the dataset

Edit 3 : Found a link with a similar issue: Pandas CSV file with occasional extra column But the flags mentioned in it (error_bad_lines) doesn't seem to work in my case.

>>> df = pd.read_csv("unclean.csv", error_bad_lines=False, header=None)

Edit 4: I have developed a script to load the dummy data (mentioned in Edit 2) to a pandas dataframe and then save it to a hdf5 file.I ran this script 20 times and not once did I encounter a RuntimeError.On the other hand while trying to read the file on the interactive mode exposes a RuntimeError and a unstable behaviour. What could be the reason for a different behaviour for python script Vs interactive mode .I am using Pandas ==0.22.0 and Python==3.5.2 and tables==3.4.4

import pandas as pd
import tables

df=pd.read_csv("dummy.txt",header=None,error_bad_lines=False,warn_bad_lines=False,engine='c',sep="\t",encoding="latin-1",names=["product_name_id","current_product_name_id","product_n","active_f","create_d","create_user_n","change_d","change_user_n","ft_timestamp"])

df.to_hdf(path_or_buf="/home/avadhut/data_files/dummy_data.h5",key="dummy",mode="a",format="table")

df=pd.read_hdf("/home/avadhut/data_files/dummy_data.h5",key="dummy")
print(df.head(100))

Answer 1

在默认的python解释器上运行代码，看看错误是否仍然存在。这应该是bpython的错误，因为我无法在默认的python解释器上复制问题

Answer 2

The issue is with your data, the file contains inconsistent number of tabs in each line. After cleaning the data I was able to load the file into Pandas. You need to clean the data and make sure the number of columns in each row are same before loading.

Answer 3

I had the same issue. What worked for me was to simply comment the incriminated lines in the file indicated by the error. "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/py/_vendored_packages/apipkg.py", line 63

Comment all the following lines:

# eagerload in bypthon to avoid their monkeypatching breaking packages
    if 'bpython' in sys.modules or eager:
        for module in sys.modules.values():
             if isinstance(module, ApiModule):
                 module.__dict__

Unfortunatly I have no idea what these lines are supposed to achieve so this dirty correction might induce other problems afterward. Does anyone know?

RunTimeError while reading tab separated text file into Pandas dataframe

Question

3 answers

solution1
2 ACCPTED 2018-08-03 08:46:10

solution2
1 2018-08-02 07:22:36

solution3
0 2019-06-26 13:33:20

RunTimeError while reading tab separated text file into Pandas dataframe

Question

3 answers

solution1 2 ACCPTED 2018-08-03 08:46:10

solution2 1 2018-08-02 07:22:36

solution3 0 2019-06-26 13:33:20

solution1
2 ACCPTED 2018-08-03 08:46:10

solution2
1 2018-08-02 07:22:36

solution3
0 2019-06-26 13:33:20