![](/img/trans.png)
[英]Apache-Beam + Python: Writing JSON (or dictionaries) strings to output file
[英]Python/Apache-Beam: How to Parse Text File To CSV?
我還是 Beam 的新手,但是您究竟如何從 GCS 存儲桶中的 CSV 文件中讀取數據? 我基本上使用 Beam 將這些文件轉換為 Pandas 數據幀,然后應用 sklearn 模型來“訓練”這些數據。 我見過的大多數示例都預先定義了標題,我希望這個 Beam 管道可以推廣到標題肯定不同的任何文件。 有一個名為beam_utils的庫可以完成我想做的事情,但后來我遇到了這個錯誤: AttributeError: module 'apache_beam.io.fileio' has no attribute 'CompressionTypes'
代碼示例:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
# The error occurs in this import
from beam_utils.sources import CsvFileSource
options = {
'project': 'my-project',
'runner:': 'DirectRunner',
'streaming': False
}
pipeline_options = PipelineOptions(flags=[], **options)
class Printer(beam.DoFn):
def process(self, element):
print(element)
with beam.Pipeline(options=pipeline_options) as p: # Create the Pipeline with the specified options.
data = (p
| 'Read File From GCS' >> beam.io.textio.ReadFromText('gs://my-csv-files')
)
_ = (data | "Print the data" >> beam.ParDo(Printer()))
result = p.run()
result.wait_until_finish()
Apache Beam 模塊fileio
最近進行了向后不兼容的更改,並且庫beam_utils
尚未更新。
我通過@Pablo 提出的問題和beam_utils
(也由Pablo 編寫)的源代碼來復制使用filesystems
模塊的行為。
下面是使用 Pandas 生成 DataFrame 的兩個版本的代碼。
用於示例的 csv:
a,b
1,2
3,4
5,6
讀取 csv 並創建包含其所有內容的 DataFrame
import apache_beam as beam
import pandas as pd
import csv
import io
def create_dataframe(readable_file):
# Open a channel to read the file from GCS
gcs_file = beam.io.filesystems.FileSystems.open(readable_file)
# Read it as csv, you can also use csv.reader
csv_dict = csv.DictReader(io.TextIOWrapper(gcs_file))
# Create the DataFrame
dataFrame = pd.DataFrame(csv_dict)
print(dataFrame.to_string())
p = beam.Pipeline()
(p | beam.Create(['gs://my-bucket/my-file.csv'])
| beam.FlatMap(create_dataframe)
)
p.run()
結果數據幀
a b
0 1 2
1 3 4
2 5 6
讀取 csv 並在其他轉換中創建數據幀
def get_csv_reader(readable_file):
# Open a channel to read the file from GCS
gcs_file = beam.io.filesystems.FileSystems.open(readable_file)
# Return the csv reader
return csv.DictReader(io.TextIOWrapper(gcs_file))
p = beam.Pipeline()
(p | beam.Create(['gs://my-bucket/my-file.csv'])
| beam.FlatMap(get_csv_reader)
| beam.Map(lambda x: pd.DataFrame([x])) # Create the DataFrame from each csv row
| beam.Map(lambda x: print(x.to_string()))
)
結果數據幀
a b
0 1 2
a b
0 3 4
a b
0 5 6
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.