Spark在csv文件python的所有行之间找到特定的字符串

Question

I am using pyspark and i have a large csv file. 我正在使用pyspark，并且有一个很大的csv文件。 The csv file is having multiple lines CSV文件包含多行

<ABCosmswkmwPQR>
<ABCasdfasdfadsPQR>
 ...
 ...

I need to iterate through each line and find the text between the particular string in it. 我需要遍历每一行，并在其中的特定字符串之间找到文本。 I am using regex to do it 我正在使用正则表达式来做到这一点

text_file = sc.textFile("file:///path/subset.tsv")
s = text_file.first()
conf = SparkConf().setAppName('MyFirstStandaloneApp')
links = re.findall(r'ABC(.*?)\PQR', s)


sc = SparkContext(conf=conf)

But I am only able to do this for only the first line. 但是我只能在第一行执行此操作。 How do i do it for all files of the line. 我该如何处理该行的所有文件。 I need to iterate line by line and write the output of matched regex to a list if it fits into memory or a file. 我需要逐行迭代并将匹配的正则表达式的输出写入列表（如果它适合内存或文件）。

I have opened the file using sparkcontext and I have to do the same since I have to read the file from HDFS. 我已经使用sparkcontext打开了文件，我必须做同样的事情，因为我必须从HDFS中读取文件。

Answer 1

Try something like this: 尝试这样的事情：

read_lines = open("file.csv", "r")
for line in read_lines:
    #if line matches regex:
        #do something

read_lines reads the entire file and the for loop will loop each line in the file. read_lines读取整个文件，并且for循环将循环文件中的每一行。 You just have to plug in the regex code. 您只需要插入正则表达式代码即可。

Answer 2

You can use regexp_extract from module pyspark.sql.functions . 您可以使用regexp_extract从模块pyspark.sql.functions 。 IF your file is temp.csv 如果您的文件是temp.csv

spark.createDataFrame(sc.textFile("temp.csv"), schema=StringType()).
                        select(regexp_extract(regexpattern, col, idx)

Spark在csv文件python的所有行之间找到特定的字符串

问题描述

2 个解决方案

解决方案1
1 2017-10-10 20:14:47

解决方案2
0 2017-10-11 03:14:48

Spark在csv文件python的所有行之间找到特定的字符串

问题描述

2 个解决方案

解决方案1 1 2017-10-10 20:14:47

解决方案2 0 2017-10-11 03:14:48

解决方案1
1 2017-10-10 20:14:47

解决方案2
0 2017-10-11 03:14:48