简体   繁体   中英

Splitting a text file based on empty lines in Spark

I am working on a really big file which is a very large text document almost 2GBs.

Something like this -

\#\*MOSFET table look-up models for circuit simulation
\#t1984
\#cIntegration, the VLSI Journal
\#index1

\#\*The verification of the protection mechanisms of high-level language machines
\#@Virgil D. Gligor
\#t1984
\#cInternational Journal of Parallel Programming
\#index2

\#\*Another view of functional and multivalued dependencies in the relational database model
\#@M. Gyssens, J. Paredaens
\#t1984
\#cInternational Journal of Parallel Programming
\#index3

\#\*Entity-relationship diagrams which are in BCNF
\#@Sushil Jajodia, Peter A. Ng, Frederick N. Springsteel
\#t1984
\#cInternational Journal of Parallel Programming
\#index4

I want to read them in spark and split them based on the empty blocks in spark and create blocks of these data in PySpark.

#*Entity-relationship diagrams which are in BCNF #@Sushil Jajodia, Peter A. Ng, Frederick N. Springsteel #t1984 #cInternational Journal of Parallel Programming #index4

The code I currently wrote is rdd = sc.textFile('acm.txt').flatMap( lambda x: x.split("\n\n") )

From what I understand, you want to read this text file in spark and have one record per paragraph. For that, you can change the record delimiter (which is \n by default) like this:

In scala:

sc.hadoopConfiguration.set("textinputformat.record.delimiter","\n\n")
val rdd = sc.textFile("acm.txt")

In python (you need to access the java spark context to have access to the hadoop configuration):

sc._jsc.hadoopConfiguration().set("textinputformat.record.delimiter","\n\n")
rdd = sc.textFile("acm.txt")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM