I have the following code that is currently running like normal Python code:
def remove_missing_rows(app_list):
print("########### Missing row removal ###########")
missing_rows = []
''' Remove any row that has missing data in the name, id, or description column'''
for row in app_list:
if not row[1]:
missing_rows.append(row)
continue # Continue loop to next row. No need to check more columns
if not row[5]:
missing_rows.append(row)
continue # Continue loop to next row. No need to check more columns
if not row[4]:
missing_rows.append(row)
print("Number of missing entries: " + str(len(missing_rows))) # 967 with current method
# Remove the missing_rows from the original data
app_list = [row for row in app_list if row not in missing_rows]
return app_list
Now, after writing this for a smaller sample I wish to run this on a very large data set. To do this I thought it would be useful to utilise the multiple cores of my computer.
I'm struggling to implement this using the multiprocessing module though. Eg The idea I have is that Core 1 could work through the first half of the data set, while Core 2 would work through the last half. Etc. And do this in parallel. Is this possible?
This is probably not cpu bound. Try the code below.
I've used a set
for very fast (hash-based) contains
(you use it when you invoke if row not in missing_rows
, and it's very slow for a long list).
If this is the csv module you're already holding tuples which are hashable so not many changes needed:
def remove_missing_rows(app_list):
print("########### Missing row removal ###########")
filterfunc = lambda row: not all([row[1], row[4], row[5]])
missing_rows = set(filter(filterfunc, app_list))
print("Number of missing entries: " + str(len(missing_rows))) # 967 with current method
# Remove the missing_rows from the original data
# note: should be a lot faster with a set
app_list = [row for row in app_list if row not in missing_rows]
return app_list
You can use filter, to not iterate twice:
def remove_missing_rows(app_list):
filter_func = lambda row: all((row[1], row[4], row[5]))
return list(filter(filter_func, app_list))
But if you are doing data analysis, you probably should have a look into pandas. There you could do something like this:
import pandas as pd
df = pd.read_csv('your/csv/data/file', usecols=(1, 4, 5))
df = df.dropna() # remove missing values
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.