I am trying to collect data using chromedriver
I am using the url ' http://web.mta.info/developers/turnstile.html ' to get my data, extract the file link and then I am putting it in two tables based on the date of the data this is the code I am trying to execute:
record_cnt = 0
for link in data_list_post:
data = pd.read_table(link, sep=',')
print('%s:%s rows %s columns' % (link[-10:-4],data.shape[0], data.shape[1]))
record_cnt += data.shape[0]
data.to_sql(name='post', con=conPost, flavor='sqlite', if_exists='append')
Traceback:
---------------------------------------------------------------------------
OperationalError Traceback (most recent call last)
<ipython-input-9-6f5adea38bf9> in <module>()
3 data = pd.read_table(link, sep=',')
4 record_cnt += data.shape[0]
----> 5 data.to_sql(name='post', con=conPost, flavor='sqlite', if_exists='append')
/Users/xx/anaconda/lib/python3.4/site-packages/pandas/core/generic.py in to_sql(self, name, con, flavor, schema, if_exists, index, index_label, chunksize, dtype)
1199 sql.to_sql(self, name, con, flavor=flavor, schema=schema,
1200 if_exists=if_exists, index=index, index_label=index_label,
-> 1201 chunksize=chunksize, dtype=dtype)
1202
1203 def to_pickle(self, path):
/Users/xx/anaconda/lib/python3.4/site-packages/pandas/io/sql.py in to_sql(frame, name, con, flavor, schema, if_exists, index, index_label, chunksize, dtype)
468 pandas_sql.to_sql(frame, name, if_exists=if_exists, index=index,
469 index_label=index_label, schema=schema,
--> 470 chunksize=chunksize, dtype=dtype)
471
472
/Users/xx/anaconda/lib/python3.4/site-packages/pandas/io/sql.py in to_sql(self, frame, name, if_exists, index, index_label, schema, chunksize, dtype)
1501 dtype=dtype)
1502 table.create()
-> 1503 table.insert(chunksize)
1504
1505 def has_table(self, name, schema=None):
/Users/xx/anaconda/lib/python3.4/site-packages/pandas/io/sql.py in insert(self, chunksize)
662
663 chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])
--> 664 self._execute_insert(conn, keys, chunk_iter)
665
666 def _query_iterator(self, result, chunksize, columns, coerce_float=True,
/Users/xx/anaconda/lib/python3.4/site-packages/pandas/io/sql.py in _execute_insert(self, conn, keys, data_iter)
1289 def _execute_insert(self, conn, keys, data_iter):
1290 data_list = list(data_iter)
-> 1291 conn.executemany(self.insert_statement(), data_list)
1292
1293 def _create_table_setup(self):
OperationalError: table post has no column named A002
your problem is that you want to pull the table from each link at that page, and compile them into a single database table... but the tables in your links are different. Links towards the top of the list like
http://web.mta.info/developers/data/nyct/turnstile/turnstile_160312.txt
have as their first/header row:
C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
vs links towards the bottom of the page like
http://web.mta.info/developers/data/nyct/turnstile/turnstile_121222.txt
have very different looking first rows, like:
A002,R051,02-00-00,12-15-12,03:00:00,REGULAR,003911852,001349428,12-15-12,07:00:00,REGULAR,003911868,001349432,12-15-12,11:00:00,REGULAR,003911930,001349538,12-15-12,15:00:00,REGULAR,003912146,001349600,12-15-
At first it looked like the second page above is just missing a header row, but its top row (& all rows) don't look like the data rows from the first group either. Can you decipher what all the fields should be called for those rows in the second group?
Basically there's some set of links (generally lower down the list) that you're gonna have to treat differently than the top ones because the tables are different.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.