如何在不丢失Python中的文本格式的情况下将列表转换为dataframe？

Question

I webscraped this webpage.我抓取了这个网页。


from bs4 import BeautifulSoup
import requests
import pandas as pd

url = "https://www.ecb.europa.eu/press/pressconf/2022/html/index_include.en.html"
soup = BeautifulSoup(requests.get(url).content)

data = []

u = soup.select('div.title > a'):
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
    data.append({ soup.select('main .section p:not([class])')})

print(data)

df = pd.DataFrame(data)

# results (it may not be the same text
[... <p><strong>Duisenberg:</strong> My answer is, well, in the first place when something is before the courts you do not comment. I don't comment and particularly not when it concerns such an esteemed colleague of mine. So, on the hypothetical question whether other people would be eligible for the job, I think it is wise not to go into that either. </p>]

The problem is that when I turn data into a dataframe, it remains in a list format which is difficult to handle.问题是，当我将data转换为 dataframe 时，它仍然是难以处理的列表格式。 I would like it to be saved as a unique object without losing its properties ( </p> , </strong> ).我希望将它保存为唯一的 object 而不会丢失其属性（ </p> ， </strong> ）。

If I do this, it loses the division in pararaphs and bolds that will be needed for manipulation.如果我这样做，它会失去操作所需的段落和粗体的划分。

data = []

u = soup.select('div.title > a'):
    soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
    data.append({
        'text':' '.join([p.text for p in soup.select('main .section p:not([class])')])
    })

df = pd.DataFrame(data)

# with this however I lose the breakdown in paragraphs, bold characters etc. I'd like to keep them in the text.

Can anyone help me with this?谁能帮我这个？

Thanks!谢谢！

Answer 1

Note sure if I understand it correctly, but if you like to convert the resultset to text you can do it like that:请注意我是否理解正确，但如果您想将结果集转换为文本，您可以这样做：

''.join([str(e) for e in soup.select('main .section p:not([class])')])

Example例子

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = "https://www.ecb.europa.eu/press/pressconf/2022/html/index_include.en.html"
soup = BeautifulSoup(requests.get(url).content)

data = []

u = soup.select('div.title > a')
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
data.append({'text':''.join([str(e) for e in soup.select('main .section p:not([class])')])})

pd.DataFrame(data)

Output Output

text
<p>Good afternoon, the Vice-President and I welcome you to our press conference. </p><p id="_Hlk93669934">The euro area economy is continuing to r...

如何在不丢失Python中的文本格式的情况下将列表转换为dataframe？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-02-12 15:28:38

Example例子

Output Output

如何在不丢失Python中的文本格式的情况下将列表转换为dataframe？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-02-12 15:28:38

Example例子

Output Output

解决方案1
1 已采纳 2022-02-12 15:28:38