简体   繁体   中英

Iterate over a list of files, extracting their contents? (SparkContext error)

I need to iterate over a large list of files on disk, open each file and parse it. I have a file with filenames and I only need to iterate over those filenames.

I pass this function to map() :

%python

def parse(filename):
  try:
    tf = sc.textFile(filename)
    # run parsing code, produce text
    return text
  except:
      return None

when I try to run the following:

parsed_contents = filenames.map(parse)
parsed_contents.top(5)

I get this error:

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

The code inside the try block works if I run it separately, specifying a filename.

How should I iterate over a specified list of files, extracting their contents?

When you perform a transformation on an rdd (in this case your call filnames.map(parse) ), the driver allocates workers to process each partition of your rdd. Hence your map call is essentially sent out to the workers to be applied to your rdd. In the code you've provided, you're basically calling on the sparkContext instance from code that is running on the workers, which leads to the error. File reads need to be made on the driver process.

sc.textFile accepts a comma delimited string, specifying the filenames that you want to read in. So you could do something like:

filenames = sc.textFile("filesToRead.txt")

parsed_contents = sc.textFile(",".join(filenames.collect()))

parsed_contents.top(5)

You could also specify patterns as input to the sc.textFile method. For example,

parsed_contents = sc.textFile("file[0-5].txt")

UPDATE For filtering on files that exist on disk.

def check_exists(name):
    try:
        open(name, 'r')
        True
    except:
        False

existingFiles = filenames.filter(check_exists)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM