简体   繁体   中英

How to fix a regular expression form for scraped url data via python?

I am trying to clean my url data using regular expression. I have already cleaned it bypass, but I have a last problem that I don't know how to solve.

It is a data that I have scraped from some newshub and it consists from theme part and a source part .

I need to scrape the source pattern from url and leave out the theme part in order to put it on to the numpy array for the further analysis.

My scraped urls look like this:

/video/36225009-report-cnbc-russian-sanctions-ukraine/

/health/36139780-cancer-rates-factors-of-stomach/

/business/36187789-in-EU-IMF-reports-about-world-economic-environment/

/video/35930625-30stm-in-last-tour-tv-album-o-llfl-/?smi2=1

/head/36214416-GB-brexit-may-stops-process-by/

/cis/36189830-kiev-arrested-property-in-crymea/

/incidents/36173928-traffic-collapse-by-trucks-incident/

..............................................................

I have tried the following code to solve this problem, but it doesn't work and returns a whole string back instead of just theme parts.

import numpy as np
import pandas as pd
import re

regex = r"^/(\b(\w*)\b)"

pattern_two = regex
prog_two = re.compile( pattern_two )

with open('urls.txt', 'r') as f:

    for line in f:
        line = line.strip()
    
    if prog_two.match( line ):
          print( line )

Also I have checked the regular expression (on regex101.com) like regex = r"^/(\\b(\\w*)\\b)" and like regex = r"^/[az]{0,9}./" , but it also doesn't work properly. I don't have a strong skills in regex and maybe I am doing something wrong?

The final result that I expect is following:

video
health
business
video
head
cis
incidents  
...........

Thank you very much for helping!

Change to the following approach:

regex = r"^/([^/]+)"
pat = re.compile(regex)

with open('urls.txt', 'r') as f:
    for line in f:
        line = line.strip()
        m = pat.search(line)
        if m:
            print(m.group(1))

Or without regex, with builtin string functions:

...
for line in f:
    line = line.strip()
    if line.startswith('/'):
        print(line.split('/', 1)[0])

You might be able to just use split() here:

with open('urls.txt', 'r') as f:
    for line in f:
        line = line.strip()   # this might be optional
        if line.startswith('/'):
            print(line.split("/")[1])

In general, if avoiding the invocation of a regex engine is possible, in favor of just using base string functions, we should go for the latter option.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM