簡體   English   中英

用python從網站中提取句子中的一些文本

[英]Extracting some text in a sentence from a website in python

我在試圖通過這個網站提取句子中的一些文本時被卡住了。

import pandas as pd
import requests
from b24 import BeautifulSoap


res = requests.get('http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/')
soup4 = BeautifulSoup(res4.content, 'html.parser')

soup4.findAll('div','excerpt')

下面是輸出。 我只想在Translation:之前提取句子在每個 html 標簽中,然后將它們添加到pandas DataFrame


[<div class="excerpt">
 <p>A ki i fi ara eni se oogun alokunna. Translation: One does not use oneself as an ingredient in a medicine requiring that the ingredients be pulverized. Meaning; Self-preservation is a compulsory project for all.</p>
 </div>, <div class="excerpt">
 <p>A ki i fi ai-mo-we mookun. Translation: One does not dive under water without knowing how to swim. Meaning: Never engage in a project for which you lack the requisite skills.</p>
 </div>, <div class="excerpt">
 <p>A ki i fi agba sile sin agba. Translation: One does not leave one elder sitting to walk another elder part of his way. meaning: One should not slight one person in order to humor another.</p>
 </div>, <div class="excerpt">
 <p>A ki i fa ori lehin olori. Translation: One does not shave a head in the absence of the owner. Meaning: One does not settle a matter in the absence of the person most concerned.</p>
 </div>, <div class="excerpt">
 <p>A ki i duni loye ka fona ile-e Baale hanni. Translation: One does not compete with another for a chieftaincy title and also show the way to the king’s house to the competitor. Meaning: A person should be treated either as an adversary or as an ally, not as both.</p>
 </div>, <div class="excerpt">
 <p>A ki i du ori olori ki awodi gbe teni lo. Translation: One does not fight to save another person’s head only to have a kite carry one’s own away. Meaning: One should not save other’s at the cost of one’s own safety.</p>
 </div>, <div class="excerpt">
 <p>A ki i da eru ikun pa ori. Translation: One does not weigh the head down with a load that belongs to the belly. Meaning: Responsibilities should rest where they belong.</p>
 </div>, <div class="excerpt">
 <p>A ki i da aro nisokun ala la nlo. Translation: One does not engage in a dyeing trade in (isokun) people there wear only white. Meaning Wherever one might be, one should respect the manners and habits of the place.</p>
 </div>, <div class="excerpt">
 <p>A ki  bo sinu omi tan ka maa sa fun otutu. Translation: Does not enter into the water and then run from the cold. Meaning: Precautions are useful only before the event.</p>
 </div>, <div class="excerpt">
 <p>A fun o lobe o tami si; o gbon ju olobe lo. Translation: You are given some stew and you add water; you must be wiser than the cook. Meaning: Adding water is a means of stretching stew. A person who thus stretches the stew he or she is given would seem to know better than the person who served it how much would suffice for the meal.</p>
 </div>]

一種解決方案是向 Dataframe 添加文本,然后使用.str.extract()清除數據:

import requests
import pandas as pd
from bs4 import BeautifulSoup

res = requests.get('http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/')
soup4 = BeautifulSoup(res.content, 'html.parser')

df = pd.DataFrame([div.get_text(strip=True) for div in soup4.findAll('div','excerpt')], columns=['Proverb'])

df['Proverb'] = df['Proverb'].str.extract('^(.*)\s+Translation')
print(df)

印刷:

                                       Proverb
0         A ki i fi ara eni se oogun alokunna.
1                   A ki i fi ai-mo-we mookun.
2                A ki i fi agba sile sin agba.
3                   A ki i fa ori lehin olori.
4  A ki i duni loye ka fona ile-e Baale hanni.
5    A ki i du ori olori ki awodi gbe teni lo.
6                   A ki i da eru ikun pa ori.
7            A ki i da aro nisokun ala la nlo.
8   A ki  bo sinu omi tan ka maa sa fun otutu.
9  A fun o lobe o tami si; o gbon ju olobe lo.

或者之前使用re模塊:

df = pd.DataFrame([re.sub(r'^(.*)\s+Translation:.*', r'\1', div.get_text(strip=True)) for div in soup4.findAll('div','excerpt')], columns=['Proverb'])
print(df)
import pandas as pd
import requests
from bs4 import BeautifulSoup


res = requests.get('http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/')
soup4 = BeautifulSoup(res.content, 'html.parser')

data = soup4.findAll('div','excerpt')
for i in data:
    #print(i.p.text)
    print(i.p.text.split('Translation:')[0])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM