[英]need help extracting string from text
I'm trying to extract financial data from a wall of text.我正在尝试从文本墙中提取财务数据。 basically I have a function that splits the text three times, but I know there is a more efficient way of doing so, but I cannot figure it out.基本上我有一个函数可以将文本拆分三遍,但我知道有一种更有效的方法,但我无法弄清楚。 Some curly braces really throw a wrench into my plan, because i'm trying to format a string.一些花括号确实给我的计划带来了麻烦,因为我正在尝试格式化字符串。
I want to pass my function a string such as:我想向我的函数传递一个字符串,例如:
"totalCashflowsFromInvestingActivities"
and extract the following raw number:并提取以下原始数字:
"-2478000"
this is my current function, which works, but not efficient at all这是我目前的功能,它有效,但根本没有效率
def splitting(value, text):
x= text.split('"{}":'.format(value))[1]
y=x.split(',"fmt":')[0]
z=y.split(':')[1]
return z
any help would be greatly appreciated!任何帮助将不胜感激!
sample text:示例文本:
"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}
Here is a solution using regex.这是使用正则表达式的解决方案。 It assumes the format is always the same, having the raw
value always immediately after the title and separated by ":{
.它假定格式始终相同, raw
值始终紧跟在标题之后,并由":{
分隔。
import re
def get_value(value_name, text):
""" finds all the occurrences of the passed `value_name`
and returns the `raw` values"""
pattern = value_name + r'":{"raw":(-?\d*)'
return re.findall(pattern, text)
text = '"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}'
val = get_value('totalCashflowsFromInvestingActivities', text)
print(val)
['-2478000']
You can cast that result to a numeric type with map
by replacing the return
line.您可以通过替换return
行将该结果转换为带有map
的数字类型。
return list(map(int, re.findall(pattern, text)))
If Buran is right and your source is Json, you might find this helpful:如果 Buran 是正确的并且您的来源是 Json,您可能会发现这很有帮助:
import json
s = '{"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}}]}}'
j = json.loads(s)
for i in j["cashflowStatementHistory"]["cashflowStatements"]:
if "totalCashflowsFromInvestingActivities" in i:
print(i["totalCashflowsFromInvestingActivities"]["raw"])
In this way you can find anything in the wall of text.通过这种方式,您可以在文本墙中找到任何内容。
Take a look at this too: https://www.w3schools.com/python/python_json.asp也看看这个: https : //www.w3schools.com/python/python_json.asp
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.