[英]PySpark - Remove first row from Dataframe
I have a.txt file with a header, which I'd like to remove.我有一个带有 header 的 .txt 文件,我想删除它。 The file looks like this:
该文件如下所示:
Entry Per Account Description
16524 01 3930621977 TXNPUES
191675 01 2368183100 OUNHQEX
191667 01 3714468136 GHAKASC
191673 01 2632703881 PAHFSAP
80495 01 2766389794 XDZANTV
80507 01 4609266335 BWWYEZL
80509 01 1092717420 QJYPKVO
80497 01 3386366766 SOQLCMU
191669 01 5905893739 FYIWNKA
191671 01 2749355876 CBMJTLP
# Create spark session
spark = SparkSession.builder.master("local").appName("fixed-width" )\
.config("spark.some.config.option", "some-value")\
.getOrCreate()
# Read in fixed-width text file into DataFrame
df = spark.read.option("header" , "true" )\
.option("inferSchema", "true" )\
.text(file )
df.show()
df.printSchema()
Which returns:返回:
+--------------------+
| value|
+--------------------+
|Entry Per Accou...|
| 16524 01 39306...|
|191675 01 23681...|
|191667 01 37144...|
|191673 01 26327...|
| 80495 01 27663...|
| 80507 01 46092...|
| 80509 01 10927...|
| 80497 01 33863...|
|191669 01 59058...|
|191671 01 27493...|
+--------------------+
root
|-- value: string (nullable = true)
I can grab the header:我可以抓住 header:
header = df.first()
header
which returns:返回:
Row(value='Entry Per GL Account Description ')
and then split into distinct columns:然后分成不同的列:
# Take the fixed width file and split into 3 distinct columns
sorted_df = df.select(
df.value.substr( 1, 6).alias('Entry' ),
df.value.substr( 8, 3).alias('Per' ),
df.value.substr(12, 11).alias('GL Account' ),
df.value.substr(24, 11).alias('Description'),
)
sorted_df.show()
sorted_df.printSchema()
which returns:返回:
+------+---+-----------+-----------+
| Entry|Per| GL Account|Description|
+------+---+-----------+-----------+
|Entry |Per| GL Account| Descriptio|
| 16524| 01| 3930621977| TXNPUES |
|191675| 01| 2368183100| OUNHQEX |
|191667| 01| 3714468136| GHAKASC |
|191673| 01| 2632703881| PAHFSAP |
| 80495| 01| 2766389794| XDZANTV |
| 80507| 01| 4609266335| BWWYEZL |
| 80509| 01| 1092717420| QJYPKVO |
| 80497| 01| 3386366766| SOQLCMU |
|191669| 01| 5905893739| FYIWNKA |
|191671| 01| 2749355876| CBMJTLP |
+------+---+-----------+-----------+
Now you see that the header still appears as the first line in my dataframe here.现在您看到 header 仍然显示为我的 dataframe 中的第一行。 I'm unsure of how to remove it.
我不确定如何删除它。
.iloc is not available, and I often see this approach, but this only works on an RDD: .iloc 不可用,我经常看到这种方法,但这仅适用于 RDD:
header = rdd.first()
rdd.filter(lambda line: line != header)
So which alternatives are available?那么有哪些替代方案可用呢?
You can use either .csv
, .text
, .textFile
for this case.对于这种情况,您可以使用
.csv
、 .text
、 .textFile
。
Read the file with .csv
method so that spark can read the header(we don't have to filter out the header).使用
.csv
方法读取文件,以便 spark 可以读取标题(我们不必过滤掉标题)。
1.Using.csv:
.csv
results df
. .csv
结果df
。
df=spark.read.option("header","true").csv("path")
df.show(10,False)
#+----------------------------------------------------+
#|Entry Per Account Description |
#+----------------------------------------------------+
#| 16524 01 3930621977 TXNPUES |
#|191675 01 2368183100 OUNHQEX |
#|191667 01 3714468136 GHAKASC |
#|191673 01 2632703881 PAHFSAP |
#| 80495 01 2766389794 XDZANTV |
#| 80507 01 4609266335 BWWYEZL |
#| 80509 01 1092717420 QJYPKVO |
#| 80497 01 3386366766 SOQLCMU |
#|191669 01 5905893739 FYIWNKA |
#|191671 01 2749355876 CBMJTLP |
#+----------------------------------------------------+
2.Using.text:
.text
results df
. .text
结果df
。
#can't read header
df=spark.read.text("path")
#get the header
header=df.first()[0]
#filter the header out from data
df.filter(~col("value").contains(header)).show(10,False)
#+----------------------------------------------------+
#|value |
#+----------------------------------------------------+
#| 16524 01 3930621977 TXNPUES |
#|191675 01 2368183100 OUNHQEX |
#|191667 01 3714468136 GHAKASC |
#|191673 01 2632703881 PAHFSAP |
#| 80495 01 2766389794 XDZANTV |
#| 80507 01 4609266335 BWWYEZL |
#| 80509 01 1092717420 QJYPKVO |
#| 80497 01 3386366766 SOQLCMU |
#|191669 01 5905893739 FYIWNKA |
#|191671 01 2749355876 CBMJTLP |
#+----------------------------------------------------+
Then use然后使用
sorted_df = df.select(
df.value.substr( 1, 6).alias('Entry' ),
df.value.substr( 8, 3).alias('Per' ),
df.value.substr(12, 11).alias('GL Account' ),
df.value.substr(24, 11).alias('Description'),
)
sorted_df.show()
sorted_df.printSchema()
3.Using
.textFile :
3.使用
3.Using
:
.textFile
results rdd
. .textFile
结果rdd
。
#get header into a variable
header=spark.sparkContext.textFile("path").first()
#.textfile and filter out the header
spark.sparkContext.textFile("path").\
filter(lambda l :not str(l).startswith(header)).\
map(lambda x:x.split()).map(lambda x:(str(x[0].strip()),str(x[1].strip()),str(x[2].strip()),str(x[3].strip()))).\
toDF(["Entry","Per","Account","Description"]).\
show()
#+------+---+----------+-----------+
#| Entry|Per| Account|Description|
#+------+---+----------+-----------+
#| 16524| 01|3930621977| TXNPUES|
#|191675| 01|2368183100| OUNHQEX|
#|191667| 01|3714468136| GHAKASC|
#|191673| 01|2632703881| PAHFSAP|
#| 80495| 01|2766389794| XDZANTV|
#| 80507| 01|4609266335| BWWYEZL|
#| 80509| 01|1092717420| QJYPKVO|
#| 80497| 01|3386366766| SOQLCMU|
#|191669| 01|5905893739| FYIWNKA|
#|191671| 01|2749355876| CBMJTLP|
#+------+---+----------+-----------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.