[英]How to turn list into dataframe without losing text format in Python?
I webscraped this webpage.我抓取了这个网页。
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.ecb.europa.eu/press/pressconf/2022/html/index_include.en.html"
soup = BeautifulSoup(requests.get(url).content)
data = []
u = soup.select('div.title > a'):
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
data.append({ soup.select('main .section p:not([class])')})
print(data)
df = pd.DataFrame(data)
# results (it may not be the same text
[... <p><strong>Duisenberg:</strong> My answer is, well, in the first place when something is before the courts you do not comment. I don't comment and particularly not when it concerns such an esteemed colleague of mine. So, on the hypothetical question whether other people would be eligible for the job, I think it is wise not to go into that either. </p>]
The problem is that when I turn data
into a dataframe, it remains in a list format which is difficult to handle.问题是,当我将
data
转换为 dataframe 时,它仍然是难以处理的列表格式。 I would like it to be saved as a unique object without losing its properties ( </p>
, </strong>
).我希望将它保存为唯一的 object 而不会丢失其属性(
</p>
, </strong>
)。
If I do this, it loses the division in pararaphs and bolds that will be needed for manipulation.如果我这样做,它会失去操作所需的段落和粗体的划分。
data = []
u = soup.select('div.title > a'):
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
data.append({
'text':' '.join([p.text for p in soup.select('main .section p:not([class])')])
})
df = pd.DataFrame(data)
# with this however I lose the breakdown in paragraphs, bold characters etc. I'd like to keep them in the text.
Can anyone help me with this?谁能帮我这个?
Thanks!谢谢!
Note sure if I understand it correctly, but if you like to convert the resultset to text you can do it like that:请注意我是否理解正确,但如果您想将结果集转换为文本,您可以这样做:
''.join([str(e) for e in soup.select('main .section p:not([class])')])
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.ecb.europa.eu/press/pressconf/2022/html/index_include.en.html"
soup = BeautifulSoup(requests.get(url).content)
data = []
u = soup.select('div.title > a')
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
data.append({'text':''.join([str(e) for e in soup.select('main .section p:not([class])')])})
pd.DataFrame(data)
text
<p>Good afternoon, the Vice-President and I welcome you to our press conference. </p><p id="_Hlk93669934">The euro area economy is continuing to r...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.