需要帮助从文本中提取字符串

Question

I'm trying to extract financial data from a wall of text.我正在尝试从文本墙中提取财务数据。 basically I have a function that splits the text three times, but I know there is a more efficient way of doing so, but I cannot figure it out.基本上我有一个函数可以将文本拆分三遍，但我知道有一种更有效的方法，但我无法弄清楚。 Some curly braces really throw a wrench into my plan, because i'm trying to format a string.一些花括号确实给我的计划带来了麻烦，因为我正在尝试格式化字符串。

I want to pass my function a string such as:我想向我的函数传递一个字符串，例如：

"totalCashflowsFromInvestingActivities"

and extract the following raw number:并提取以下原始数字：

"-2478000"

this is my current function, which works, but not efficient at all这是我目前的功能，它有效，但根本没有效率

def splitting(value, text):
 x= text.split('"{}":'.format(value))[1]
 y=x.split(',"fmt":')[0]
 z=y.split(':')[1]
 return z

any help would be greatly appreciated!任何帮助将不胜感激！

sample text:示例文本：

"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}

Answer 1

Here is a solution using regex.这是使用正则表达式的解决方案。 It assumes the format is always the same, having the raw value always immediately after the title and separated by ":{ .它假定格式始终相同， raw值始终紧跟在标题之后，并由":{分隔。

import re

def get_value(value_name, text):
    """ finds all the occurrences of the passed `value_name`
    and returns the `raw` values"""
    pattern = value_name + r'":{"raw":(-?\d*)'
    return re.findall(pattern, text)

text = '"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}'

val = get_value('totalCashflowsFromInvestingActivities', text)
print(val)
['-2478000']

You can cast that result to a numeric type with map by replacing the return line.您可以通过替换return行将该结果转换为带有map的数字类型。

return list(map(int, re.findall(pattern, text)))

Answer 2

If Buran is right and your source is Json, you might find this helpful:如果 Buran 是正确的并且您的来源是 Json，您可能会发现这很有帮助：

import json

s = '{"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}}]}}'

j = json.loads(s)
for i in j["cashflowStatementHistory"]["cashflowStatements"]:
    if "totalCashflowsFromInvestingActivities" in i:
        print(i["totalCashflowsFromInvestingActivities"]["raw"])

In this way you can find anything in the wall of text.通过这种方式，您可以在文本墙中找到任何内容。

Take a look at this too: https://www.w3schools.com/python/python_json.asp也看看这个： https : //www.w3schools.com/python/python_json.asp

需要帮助从文本中提取字符串

问题描述

2 个解决方案

解决方案1
0 已采纳 2020-09-23 20:44:23

解决方案2
0 2020-09-23 21:03:48

需要帮助从文本中提取字符串

问题描述

2 个解决方案

解决方案1 0 已采纳 2020-09-23 20:44:23

解决方案2 0 2020-09-23 21:03:48

解决方案1
0 已采纳 2020-09-23 20:44:23

解决方案2
0 2020-09-23 21:03:48