简体   繁体   中英

pyspark dataframe split dynamic columns

(I'm not a python developer)

We have a library developed by a external party that recalculates errors in temperature measurements. This works fine but now we noticed that our 2 sensor types produce different log files (Diff between Europe and Australia).

Bottom line we would like to transform the dataframe before passing it to the library. I was able to skip the first line, that should not be used as header with this code

data21 = spark.read.option("header",False).format("csv").load("abfss://oper-iot-uploads@eurdtadcoglb907.dfs.core.windows.net/belgium/dessel/c3/kiln/temp/Auto202012101237.TXT")
header2 = data21.first()[0]
data2 = data21.filter(~col("_c0").contains(header2))

This results in a good file with only a _c0 columns, but the European version had a comma as a delimiter, the Australian a semicolon. So the European version had columns from _c0 up to _c980 (or less, depending on the model).

I'm looking for a way to split this data2 into multiple columns, found several solution with.split with that is mostly hardcoded, from pos 1-5 = field1, 6-xx = field2.

I would like to find a instruction that goes the complete line and creates up to _cXXX, where XXX is the numbers of columns found on the line...

Any suggestions

This is a sample of such a _c0 row.

136; 136; 136; 126; 126; 124; 124; 118; 118; 113; 113; 113; 112; 112; 118; 118; 132; 132; 150; 150; 167; 167; 167; 174; 174; 173; 173; 173; 173; 176; 176; 183; 183; 183; 194; 194; 207; 207; 221; 221; 233; 233; 242; 242; 249; 249; 253; 253; 258; 258; 258; 261; 261; 265; 265; 270; 270; 275; 275; 279; 279; 284; 284; 287; 287; 290; 290; 291; 291; 293; 293; 295; 295; 297; 297; 299; 299; 302; 302; 304; 304; 305; 305; 306; 306; 308; 308; 310; 310; 312; 312; 314; 314; 315; 315; 318; 318; 320; 320; 322; 322; 325; 325; 327; 327; 329; 329; 330; 330; 331; 331; 333; 333; 334; 336; 336; 338; 338; 339; 339; 341; 341; 344; 344; 347; 347; 350; 350; 351; 350; 350; 347; 347; 342; 342; 328; 328; 299; 299; 262; 235; 235; 223; 223; 219; 219; 216; 216; 215; 220; 220; 225; 225; 232; 232; 240; 240; 256; 273; 273; 284; 284; 292; 292; 307; 327; 327; 340; 340; 344; 344; 344; 343; 343; 341; 341; 338; 333; 333; 328; 328; 323; 317; 317; 309; 309; 300; 300; 291; 282; 282; 274; 274; 265; 257; 257; 250; 250; 244; 238;...

split should do it

from pyspark.sql import functions as F

data2.select(F.split('_c0', '; ').alias('_c0'))

If the column numbers are not consistent,then get the length and iterate over the max length

col_sizes = data2.select(F.size('_c0').alias('_c0'))
col_max = col_sizes.agg(F.max('_c0'))
columns = col_max.collect()[0][0]

data2.select(*[data2['_c0'][i] for i in range(columns)])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM