[英]How to split a string of words and numbers into columns of words and numbers
我正在尝试从文本文件中拆分字符串
"Cost of Goods Sold (COGS) incl. D&A 142.26B 131.51B 141.7B 163.83B 162.26B"
"Depreciation & Amortization Expense 10.5B 9.8B 9.4B 9.3B 11.3B"
进入aa表,例如:
Metric 2019 2018 2017 2016 2015
Cost of Goods Sold (COGS) incl. D&A: 142.26B 131.51B 141.7B 163.83B 162.26B
Depreciation & Amortization Expense 10.5B 9.8B 9.4B 9.3B 11.3B
我用过这个命令:
df = pd.read_csv(fileName, sep="\s+", names=['Metric','Y4','Y3','Y2','Y1'])
但我得到这个输出:
Metric Y4 Y3 Y2 Y1
Cost of Goods Sold (COGS) incl.
COGS excluding D&A 27.56B 26.83B 26.77B
Depreciation & Amortization Expense 5.48B 5.95B
有没有一种简单的方法可以将此文本拆分为文本 + 数字? 我可以将字符串拆分为列表并手动重建字符串,但由于包含多个字符串的“度量”,它变得复杂。
谢谢!
艾伦
我们可以分几个步骤来解决这个问题:
file.txt
) :with open('file.txt') as f:
data = f.read().split('\n')
print(data)
['Cost of Goods Sold (COGS) incl. D&A 142.26B 131.51B 141.7B 163.83B 162.26B', 'Depreciation & Amortization Expense 10.5B 9.8B 9.4B 9.3B 11.3B']
split
您的行split
为前面有多个非数字字符的空格 ( ' '
)。 我们为此使用regular expressions
和positive lookbehind
:import re
df = pd.DataFrame([[value for value in re.split('(\D{2,})\s', line) if value != '']
for line in data], columns=['Metric', 'Years'])
Metric Years
0 Cost of Goods Sold (COGS) incl. D&A 142.26B 131.51B 141.7B 163.83B 162.26B
1 Depreciation & Amortization Expense 10.5B 9.8B 9.4B 9.3B 11.3B
Series.split
和expand=True
将您的年份分成自己的列:df = df.join(df.pop('Years').str.split(expand=True))
Metric 0 1 2 3 4
0 Cost of Goods Sold (COGS) incl. D&A 142.26B 131.51B 141.7B 163.83B 162.26B
1 Depreciation & Amortization Expense 10.5B 9.8B 9.4B 9.3B 11.3B
df.columns = ['Metric'] + list(range(2019, 2014, -1))
Metric 2019 2018 2017 2016 2015
0 Cost of Goods Sold (COGS) incl. D&A 142.26B 131.51B 141.7B 163.83B 162.26B
1 Depreciation & Amortization Expense 10.5B 9.8B 9.4B 9.3B 11.3B
另一个解决方案是使用str.rsplit
- 使用maxsplit=5
从右侧拆分字符串:
import pandas as pd
txt = '''
"Cost of Goods Sold (COGS) incl. D&A 142.26B 131.51B 141.7B 163.83B 162.26B"
"Depreciation & Amortization Expense 10.5B 9.8B 9.4B 9.3B 11.3B"
'''
lines = []
for line in map(str.strip, txt.splitlines()):
if not line: # skip empty lines
continue
lines.append( line[1:-1].rsplit(maxsplit=5) ) # [1:-1] because we want to get rid of quotes (")
df = pd.DataFrame(lines, columns=['Metric', 'Y5', 'Y4', 'Y3', 'Y2', 'Y1'])
print(df)
印刷:
Metric Y5 Y4 Y3 Y2 Y1
0 Cost of Goods Sold (COGS) incl. D&A 142.26B 131.51B 141.7B 163.83B 162.26B
1 Depreciation & Amortization Expense 10.5B 9.8B 9.4B 9.3B 11.3B
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.