简体   繁体   中英

Separating a string into numbers and letters in python

I started learning python two days ago. Today I built a web scraping script which pulls data from yahoo finance and puts it in a csv file. The problem I have is that some values are string because yahoo finance displays them as such.

For example: Revenue: 806.43M

When I copy them into the csv I cant use them for calculation so I was wondering if it is possible to separate the "806.43" and "M" while still keeping both to see the unit of the number and put them in two different columns.

for the excel writing I use this command:

f.write(revenue + "," + revenue_value + "\n")

where:

print(revenue)
Revenue (ttm)
print(revenue_value)
806.43M

so in the end I should be able to use a command which looks something like this

f.write(revenue + "," + revenue_value + "," + revenue_unit + "\n")

where revenue_value is 806.43 and revenue_unit is M

Hope someone could help with the problem.

I believe the easiest way is to parse the number as string and convert it to a float based on the unit in the end of the string.


The following should do the trick:

def parse_number(number_str) -> float:
    mapping = {
        "K": 1000,
        "M": 1000000,
        "B": 1000000000
    }

    unit = number_str[-1]
    number_float = float(number_str[:-1])

    return number_float * mapping[unit]

And here's an example:

my_number = "806.43M"
print(parse_number(my_number))
>>> 806430000.0

You can always try regular expressions .

Here's a pretty good online tool to let you practice using Python-specific standards.

import re

sample = "Revenue (ttm): 806.43M"

# Note: the `(?P<name here>)` section is a named group. That way we can identify what we want to capture.
financials_pattern = r'''
    (?P<category>.+?):?\s+?     # Capture everything up until the colon
    (?P<value>[\d\.]+)          # Capture only numeric values and decimal points
    (?P<unit>[\w]*)?            # Capture a trailing unit type (M, MM, etc.)
'''

# Flags:
#     re.I -> Ignore character case (upper vs lower)
#     re.X -> Allows for 'verbose' pattern construction, as seen above
res = re.search(financials_pattern, sample, flags = re.I | re.X)

Print our dictionary of values:

res.groupdict()

Output:

{'category': 'Revenue (ttm)',
'value': '806.43',
'unit': 'M'}

We can also use.groups() to list results in a tuple.

res.groups()

Output:

('Revenue (ttm)', '806.43', 'M')

In this case, we'll immediately unpack those results into your variable names.

revenue = None # If this is None after trying to set it, don't print anything.

revenue, revenue_value, revenue_unit = res.groups()

We'll use fancy f-strings to print out both your f.write() call along with the results we've captured.

if revenue:
    print(f'f.write(revenue + "," + revenue_value + "," + revenue_unit + "\\n")\n')
    print(f'f.write("{revenue}" + "," + "{revenue_value}" + "," + "{revenue_unit}" + "\\n")')

Output:

f.write(revenue + "," + revenue_value + "," + revenue_unit + "\n")

f.write("Revenue (ttm)" + "," + "806.43" + "," + "M" + "\n")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM