简体   繁体   中英

Python Regex Scrape & Replace String

Hi i would like to code me a small helper Tool in Python it should process the following content:

<tr>
 <td><p>L1</p></td>
 <td><p>(4.000x2.300x500;   4,6m³)</p></td>
 <td><p>&nbsp;</p></td>
 <td><p> 1.221 kg</p></td>
 </tr>
 <tr>
 <td><p>L2</p></td>
 <td><p>(4.250x2.300x500;   4,9m³)</p></td>
 <td><p>&nbsp;</p></td>
 <td><p> 1.279 kg</p></td>
 </tr>
 <tr>
 <td><p>L3</p></td>
 <td><p>(4.500x2.300x500;   5,2m³)</p></td>
 <td><p>&nbsp;</p></td>
 <td><p> 1.321 kg</p></td>
 </tr>
 <tr>
 <td><p>L4</p></td>
 <td><p>(4.750x2.300x500;   5,5m³)</p></td>
 <td><p>&nbsp;</p></td>
 <td><p> 1.364 kg</p></td>
 </tr>

It should replace the &nbsp; of each table row with the the volume in this case everthing between the ; and the ) in the second table data field of each row.

i started to code it in python like that and i could allready scrape the Volume with a regex statement but my logic ends on how to put the values on the right place. any idea ? here is my code

import BeautifulSoup
import re

with open('3mmcontainer.html') as f:
    content = f.read()
f.close()

#print content

contentsoup = BeautifulSoup.BeautifulSoup(content)

for tablerow in contentsoup.findAll('tr'):
    inhalt = str(tablerow.contents[3])
    print inhalt


    match = re.findall('\;(.*?)\)', inhalt)


    print match
# for x in match:
#    volumen = x.lstrip()
#    print volumen

   #f = open('3mmcontainer.html', 'w')
   #newdata = f.replace("&nbsp;", volumen)
   #f.write(newdata)
   #f.close()


#m = re.search('\;(.*?)\)', inhalt)
# print m

# volumen = re.compile(r'\;(.*?)\)')
# volumen.match(tablerow.contents[3])

NB: you don't need to call close() because the with statement will do it for you.

You can use a simple function to transform each row ( <tr/> ):

import re


def parse_inhalt(content):
    td_list = re.findall(r"<td>(?:(?!</td>).)+</td>", content)
    vol_content = td_list[1]
    vol = re.findall(r";([^)]+)", vol_content)[0]
    return content.replace("&nbsp;", vol)

The code is straightforward:

  • Extract each cell in td_list
  • Get the content of the second cell which contains the volume
  • Find the volume contained between ";" and ")" (excluding those characters)
  • Replace the &nbsp; by the volume

For instance:

inhalt = u"""\
<tr>
<td><p>L4</p></td>
<td><p>(4.750x2.300x500;   5,5m³)</p></td>
<td><p>&nbsp;</p></td>
<td><p> 1.364 kg</p></td>
</tr>"""

print(parse_inhalt(inhalt))

You get:

<tr>
<td><p>L4</p></td>
<td><p>(4.750x2.300x500;   5,5m³)</p></td>
<td><p>   5,5m³</p></td>
<td><p> 1.364 kg</p></td>
</tr>

You can drop the spaces by using:

vol = re.findall(r";\s*([^)]+)", vol_content)[0]

An alternative approach.

First, find all of the table cells, and the p elements within them. You know that the p elements are characterised by the presence of within their text s, so watch for them, and you know that you must change the p elements that follow immediately. Then arrange to capture the area when you encounter it, note the ordinal number of the p element and then when you encounter the next p element, change its text by assigning area to its string attribute.

If you prefer regex then you could use this for calculating area :

area = bs4.re.search(r';\s+([^\)]+)', p.text).groups(0)[0]

.

>>> import bs4
>>> soup = bs4.BeautifulSoup(open('temp.htm').read(), 'lxml')
>>> k = None
>>> for i, p in enumerate(soup.select('td > p')):
...     if 'm³' in p.text:
...         area = p.text[1+p.text.rfind(';'):-1].strip()
...         k = i
...     if k and i == k + 1:
...         p.string = area
... 
>>> soup
<html><body><tr>
<td><p>L1</p></td>
<td><p>(4.000x2.300x500;   4,6m³)</p></td>
<td><p>4,6m³</p></td>
<td><p> 1.221 kg</p></td>
</tr>
<tr>
<td><p>L2</p></td>
<td><p>(4.250x2.300x500;   4,9m³)</p></td>
<td><p>4,9m³</p></td>
<td><p> 1.279 kg</p></td>
</tr>
<tr>
<td><p>L3</p></td>
<td><p>(4.500x2.300x500;   5,2m³)</p></td>
<td><p>5,2m³</p></td>
<td><p> 1.321 kg</p></td>
</tr>
<tr>
<td><p>L4</p></td>
<td><p>(4.750x2.300x500;   5,5m³)</p></td>
<td><p>5,5m³</p></td>
<td><p> 1.364 kg</p></td>
</tr></body></html>
>>> 

if brute force regex is acceptable

s='''
<tr>
 <td><p>L1</p></td>
 <td><p>(4.000x2.300x500;   4,6m³)</p></td>
 <td><p>&nbsp;</p></td>
 <td><p> 1.221 kg</p></td>
 </tr>
 <tr>
 <td><p>L2</p></td>
 <td><p>(4.250x2.300x500;   4,9m³)</p></td>
 <td><p>&nbsp;</p></td>
 <td><p> 1.279 kg</p></td>
 </tr>
 <tr>
 <td><p>L3</p></td>
 <td><p>(4.500x2.300x500;   5,2m³)</p></td>
 <td><p>&nbsp;</p></td>
 <td><p> 1.321 kg</p></td>
 </tr>
 <tr>
 <td><p>L4</p></td>
 <td><p>(4.750x2.300x500;   5,5m³)</p></td>
 <td><p>&nbsp;</p></td>
 <td><p> 1.364 kg</p></td>
 </tr>
'''

import re

p=r'(\([0-9x.]+)(; +)([0-9,m³]+)(\)</p></td>\n <td><p>)(&nbsp;)'

# not sure which output is preferred
x = re.sub(p, '\g<1>\g<2>\g<3>\g<4>\g<3>', s)
print(x)

y = re.sub(p, '\g<1>\g<4>\g<3>', s)
print(y)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM