简体   繁体   English

需要帮助从文本中提取字符串

[英]need help extracting string from text

I'm trying to extract financial data from a wall of text.我正在尝试从文本墙中提取财务数据。 basically I have a function that splits the text three times, but I know there is a more efficient way of doing so, but I cannot figure it out.基本上我有一个函数可以将文本拆分三遍,但我知道有一种更有效的方法,但我无法弄清楚。 Some curly braces really throw a wrench into my plan, because i'm trying to format a string.一些花括号确实给我的计划带来了麻烦,因为我正在尝试格式化字符串。

I want to pass my function a string such as:我想向我的函数传递一个字符串,例如:

"totalCashflowsFromInvestingActivities"

and extract the following raw number:并提取以下原始数字:

"-2478000"

this is my current function, which works, but not efficient at all这是我目前的功能,它有效,但根本没有效率

def splitting(value, text):
 x= text.split('"{}":'.format(value))[1]
 y=x.split(',"fmt":')[0]
 z=y.split(':')[1]
 return z

any help would be greatly appreciated!任何帮助将不胜感激!

sample text:示例文本:

"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}

Here is a solution using regex.这是使用正则表达式的解决方案。 It assumes the format is always the same, having the raw value always immediately after the title and separated by ":{ .它假定格式始终相同, raw值始终紧跟在标题之后,并由":{分隔。

import re

def get_value(value_name, text):
    """ finds all the occurrences of the passed `value_name`
    and returns the `raw` values"""
    pattern = value_name + r'":{"raw":(-?\d*)'
    return re.findall(pattern, text)

text = '"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}'

val = get_value('totalCashflowsFromInvestingActivities', text)
print(val)
['-2478000']

You can cast that result to a numeric type with map by replacing the return line.您可以通过替换return行将该结果转换为带有map的数字类型。

return list(map(int, re.findall(pattern, text)))

If Buran is right and your source is Json, you might find this helpful:如果 Buran 是正确的并且您的来源是 Json,您可能会发现这很有帮助:

import json

s = '{"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}}]}}'

j = json.loads(s)
for i in j["cashflowStatementHistory"]["cashflowStatements"]:
    if "totalCashflowsFromInvestingActivities" in i:
        print(i["totalCashflowsFromInvestingActivities"]["raw"])

In this way you can find anything in the wall of text.通过这种方式,您可以在文本墙中找到任何内容。

Take a look at this too: https://www.w3schools.com/python/python_json.asp也看看这个: https : //www.w3schools.com/python/python_json.asp

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM