PySpark - 从 Dataframe 中删除第一行

Question

I have a.txt file with a header, which I'd like to remove.我有一个带有 header 的 .txt 文件，我想删除它。 The file looks like this:该文件如下所示：

Entry  Per  Account     Description               
 16524  01  3930621977  TXNPUES                     
191675  01  2368183100  OUNHQEX            
191667  01  3714468136  GHAKASC             
191673  01  2632703881  PAHFSAP              
 80495  01  2766389794  XDZANTV                    
 80507  01  4609266335  BWWYEZL                   
 80509  01  1092717420  QJYPKVO                  
 80497  01  3386366766  SOQLCMU                  
191669  01  5905893739  FYIWNKA             
191671  01  2749355876  CBMJTLP

# Create spark session
spark = SparkSession.builder.master("local").appName("fixed-width"                          )\
                                            .config("spark.some.config.option", "some-value")\
                                            .getOrCreate()

# Read in fixed-width text file into DataFrame
df = spark.read.option("header"     , "true" )\
               .option("inferSchema", "true" )\
               .text(file                    )
df.show()
df.printSchema()

Which returns:返回：

+--------------------+
|               value|
+--------------------+
|Entry  Per  Accou...|
| 16524  01  39306...|
|191675  01  23681...|
|191667  01  37144...|
|191673  01  26327...|
| 80495  01  27663...|
| 80507  01  46092...|
| 80509  01  10927...|
| 80497  01  33863...|
|191669  01  59058...|
|191671  01  27493...|
+--------------------+

root
 |-- value: string (nullable = true)

I can grab the header:我可以抓住 header：

header = df.first()
header

which returns:返回：

Row(value='Entry  Per  GL Account  Description               ')

and then split into distinct columns:然后分成不同的列：

# Take the fixed width file and split into 3 distinct columns
sorted_df = df.select(
    df.value.substr( 1,  6).alias('Entry'      ),
    df.value.substr( 8,  3).alias('Per'        ),
    df.value.substr(12, 11).alias('GL Account' ),
    df.value.substr(24, 11).alias('Description'),
)

sorted_df.show()
sorted_df.printSchema()

which returns:返回：

+------+---+-----------+-----------+
| Entry|Per| GL Account|Description|
+------+---+-----------+-----------+
|Entry |Per| GL Account| Descriptio|
| 16524| 01| 3930621977| TXNPUES   |
|191675| 01| 2368183100| OUNHQEX   |
|191667| 01| 3714468136| GHAKASC   |
|191673| 01| 2632703881| PAHFSAP   |
| 80495| 01| 2766389794| XDZANTV   |
| 80507| 01| 4609266335| BWWYEZL   |
| 80509| 01| 1092717420| QJYPKVO   |
| 80497| 01| 3386366766| SOQLCMU   |
|191669| 01| 5905893739| FYIWNKA   |
|191671| 01| 2749355876|   CBMJTLP |
+------+---+-----------+-----------+

Now you see that the header still appears as the first line in my dataframe here.现在您看到 header 仍然显示为我的 dataframe 中的第一行。 I'm unsure of how to remove it.我不确定如何删除它。

.iloc is not available, and I often see this approach, but this only works on an RDD: .iloc 不可用，我经常看到这种方法，但这仅适用于 RDD：

header = rdd.first()
rdd.filter(lambda line: line != header)

So which alternatives are available?那么有哪些替代方案可用呢？

Answer 1

You can use either .csv , .text , .textFile for this case.对于这种情况，您可以使用.csv 、 .text 、 .textFile 。

Read the file with .csv method so that spark can read the header(we don't have to filter out the header).使用.csv方法读取文件，以便 spark 可以读取标题（我们不必过滤掉标题）。

1.Using.csv:

.csv results df . .csv结果df 。

df=spark.read.option("header","true").csv("path")
df.show(10,False)
#+----------------------------------------------------+
#|Entry  Per  Account     Description                 |
#+----------------------------------------------------+
#| 16524  01  3930621977  TXNPUES                     |
#|191675  01  2368183100  OUNHQEX                     |
#|191667  01  3714468136  GHAKASC                     |
#|191673  01  2632703881  PAHFSAP                     |
#| 80495  01  2766389794  XDZANTV                     |
#| 80507  01  4609266335  BWWYEZL                     |
#| 80509  01  1092717420  QJYPKVO                     |
#| 80497  01  3386366766  SOQLCMU                     |
#|191669  01  5905893739  FYIWNKA                     |
#|191671  01  2749355876  CBMJTLP                     |
#+----------------------------------------------------+

2.Using.text:

.text results df . .text结果df 。

#can't read header
df=spark.read.text("path")
#get the header
header=df.first()[0]
#filter the header out from data
df.filter(~col("value").contains(header)).show(10,False)
#+----------------------------------------------------+
#|value                                               |
#+----------------------------------------------------+
#| 16524  01  3930621977  TXNPUES                     |
#|191675  01  2368183100  OUNHQEX                     |
#|191667  01  3714468136  GHAKASC                     |
#|191673  01  2632703881  PAHFSAP                     |
#| 80495  01  2766389794  XDZANTV                     |
#| 80507  01  4609266335  BWWYEZL                     |
#| 80509  01  1092717420  QJYPKVO                     |
#| 80497  01  3386366766  SOQLCMU                     |
#|191669  01  5905893739  FYIWNKA                     |
#|191671  01  2749355876  CBMJTLP                     |
#+----------------------------------------------------+

Then use然后使用

sorted_df = df.select(
    df.value.substr( 1,  6).alias('Entry'      ),
    df.value.substr( 8,  3).alias('Per'        ),
    df.value.substr(12, 11).alias('GL Account' ),
    df.value.substr(24, 11).alias('Description'),
)

sorted_df.show()
sorted_df.printSchema()

3.Using .textFile : 3.使用3.Using :

.textFile results rdd . .textFile结果rdd 。

#get header into a variable
header=spark.sparkContext.textFile("path").first()

#.textfile and filter out the header
spark.sparkContext.textFile("path").\
filter(lambda l :not str(l).startswith(header)).\
map(lambda x:x.split()).map(lambda x:(str(x[0].strip()),str(x[1].strip()),str(x[2].strip()),str(x[3].strip()))).\
toDF(["Entry","Per","Account","Description"]).\
show()
#+------+---+----------+-----------+
#| Entry|Per|   Account|Description|
#+------+---+----------+-----------+
#| 16524| 01|3930621977|    TXNPUES|
#|191675| 01|2368183100|    OUNHQEX|
#|191667| 01|3714468136|    GHAKASC|
#|191673| 01|2632703881|    PAHFSAP|
#| 80495| 01|2766389794|    XDZANTV|
#| 80507| 01|4609266335|    BWWYEZL|
#| 80509| 01|1092717420|    QJYPKVO|
#| 80497| 01|3386366766|    SOQLCMU|
#|191669| 01|5905893739|    FYIWNKA|
#|191671| 01|2749355876|    CBMJTLP|
#+------+---+----------+-----------+

PySpark - 从 Dataframe 中删除第一行

问题描述

1 个解决方案

解决方案1
7 已采纳 2020-05-13 18:26:35

PySpark - 从 Dataframe 中删除第一行

问题描述

1 个解决方案

解决方案1 7 已采纳 2020-05-13 18:26:35

解决方案1
7 已采纳 2020-05-13 18:26:35