简体   繁体   English

如何在不丢失Python中的文本格式的情况下将列表转换为dataframe?

[英]How to turn list into dataframe without losing text format in Python?

I webscraped this webpage.我抓取了这个网页。


from bs4 import BeautifulSoup
import requests
import pandas as pd

url = "https://www.ecb.europa.eu/press/pressconf/2022/html/index_include.en.html"
soup = BeautifulSoup(requests.get(url).content)

data = []

u = soup.select('div.title > a'):
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
    data.append({ soup.select('main .section p:not([class])')})

print(data)

df = pd.DataFrame(data)

# results (it may not be the same text
[... <p><strong>Duisenberg:</strong> My answer is, well, in the first place when something is before the courts you do not comment. I don't comment and particularly not when it concerns such an esteemed colleague of mine. So, on the hypothetical question whether other people would be eligible for the job, I think it is wise not to go into that either. </p>]

The problem is that when I turn data into a dataframe, it remains in a list format which is difficult to handle.问题是,当我将data转换为 dataframe 时,它仍然是难以处理的列表格式。 I would like it to be saved as a unique object without losing its properties ( </p> , </strong> ).我希望将它保存为唯一的 object 而不会丢失其属性( </p></strong> )。

If I do this, it loses the division in pararaphs and bolds that will be needed for manipulation.如果我这样做,它会失去操作所需的段落和粗体的划分。

data = []

u = soup.select('div.title > a'):
    soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
    data.append({
        'text':' '.join([p.text for p in soup.select('main .section p:not([class])')])
    })

df = pd.DataFrame(data)

# with this however I lose the breakdown in paragraphs, bold characters etc. I'd like to keep them in the text.

Can anyone help me with this?谁能帮我这个?

Thanks!谢谢!

Note sure if I understand it correctly, but if you like to convert the resultset to text you can do it like that:请注意我是否理解正确,但如果您想将结果集转换为文本,您可以这样做:

''.join([str(e) for e in soup.select('main .section p:not([class])')])

Example例子

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = "https://www.ecb.europa.eu/press/pressconf/2022/html/index_include.en.html"
soup = BeautifulSoup(requests.get(url).content)

data = []

u = soup.select('div.title > a')
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
data.append({'text':''.join([str(e) for e in soup.select('main .section p:not([class])')])})

pd.DataFrame(data)

Output Output

text
<p>Good afternoon, the Vice-President and I welcome you to our press conference. </p><p id="_Hlk93669934">The euro area economy is continuing to r...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM