I'm trying to solve a problem which I have done it with PHP, not sure how to do that in Python.
In the following three Rows, we like to match based on these two patterns:
only vine.co and twitter.com URLs (other domains should be ignored)
only URLs before commas , (last URL in each Row should be ignored)
Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1
The output would be an array in Python (which this output is based on PHP):
array(3) {
[0]=>
string(30) "https://vine.co/v/5W2Dg3XPX7a
"
[1]=>
string(64) "https://twitter.com/dog_rates/status/836677758902222849/photo/1
"
[2]=>
string(63) "https://twitter.com/dog_rates/status/835264098648616962/photo/1"
}
$input = 'Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1';
$array = preg_split('/Row\s\d:\s/s', $input);
$output = array();
foreach ($array as $key => $value) {
if (strlen($value) > 1) {
$URL_arrays = explode(',', $value);
foreach ($URL_arrays as $key => $value) {
if ($key = sizeof($URL_arrays) - 1) {
unset($URL_arrays[sizeof($URL_arrays) - 1]);
} else {
$match = preg_match('/twitter\.com|vine\.co/s', $value);
if ($match) {
array_push($output, $value);
}
}
}
}
}
var_dump($output);
This question is based on this RegEx problem , which you may answer either of which.
You can use this regex to capture all URLs having vine.com
or twitter.com
domain which have a comma just after the URL,
https:\/\/(?:www\.)?(?:vine\.co|twitter\.com)[^,\s]*(?=,)
As you wanted, the key point is this positive look ahead (?=,)
which ensures, your URL is followed by a comma immediately after the URL.
Python code extracting URLs using re.findall
import re
s = '''Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1'''
print(re.findall(r'https:\/\/(?:www\.)?(?:vine\.co|twitter\.com)[^,\s]*(?=,)', s))
Outputs,
['https://vine.co/v/5W2Dg3XPX7a', 'https://twitter.com/dog_rates/status/836677758902222849/photo/1', 'https://twitter.com/dog_rates/status/835264098648616962/photo/1']
Because you don't need to hold duplicates, I would suggest to use a set instead of array (but order changes):
{url for x in s.split('\n') for url in x.split(': ')[1].split(',') if 'vine.co' in url or 'twitter.co' in url}
Code :
s = '''Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1'''
print({url for x in s.split('\n') for url in x.split(': ')[1].split(',') if 'vine.co' in url or 'twitter.co' in url})
# {'https://twitter.com/dog_rates/status/835264098648616962/photo/1',
# 'https://twitter.com/dog_rates/status/836677758902222849/photo/1',
# 'https://vine.co/v/5W2Dg3XPX7a'}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.