简体   繁体   English

在python中搜索特定于语言的单词

[英]Searching for language specific words in python

I have a program that I need to loop through files and find a danish word. 我有一个程序,需要遍历文件并找到丹麦字。 I then need it to extract the text it finds (between html tags) and input that to a csv file.. 然后,我需要它来提取找到的文本(在html标签之间)并将其输入到csv文件中。

My code so far: 到目前为止,我的代码:

from bs4 import BeautifulSoup
import re
import csv
import glob

#def get_danish(text):
#    return re.compile(r'\b({0})\b'.format(text), flags=re.IGNORECASE).search

with open('dk_snip.csv', 'w', newline='') as f_out:
    csv_out = csv.writer(f_out)
#    csv_out.writerow(["Nessus", "ID", "Descrip"])

    for filename in glob.glob('/home/rj/Documents/snip/snips/*'):
        print("Processing:", filename)

        with open(filename) as f_in:
            soup = BeautifulSoup(f_in, 'html5lib')

            var1 = soup.find('li', text = re.compile('Scan vendor:'), attrs = {'class' : 'property_name'})
            var2 = soup.find('li', text = re.compile('Vendor ID:'), attrs = {'class' : 'property_name'})

            vendor = var1.find_next('li').get_text(strip=True)
            vend_id = var2.find_next('li').get_text(strip=True)

#    rows = [[vendor, vend_id, dk_desc.get_text(strip=True)] for dk_desc in soup.find_all("textarea")[:3]]


            for textarea in soup.find_all("textarea"):
                print(len(textarea))
#                if !not textarea[7]:
#                    desc = textarea[7].get_text(strip=True)
#                elif not textarea[7]:
#                    desc = textarea[6].get_text(strip=True)
#                elif not textarea[6]:
#                    desc = "unknown"
#            csv_out.writerow([vendor, vend_id])


#    for elem in soup.select("textarea"):
#        if "disse" in elem:
#            second_text = elem.text.get_text(Strip=True)
#            print(second_text)

#    textarea = soup.find_all(re.compile("textarea"))
#    second_text = textarea[6].text.rstrip(' ')

#    wr.writerow([fin, fin2, second_text])

The files have a varying number of "textarea" tags, sometimes 5, sometimes 7, the danish text is NOT present i all of them, but if it is, its allways in the 6th or 7th.. Now I dont know if its just easier to search for a given word and extract the text from the tag where its found, or if finding it by counting tags is best.. 这些文件具有不同数量的“ textarea”标记,有时5个,有时7个,但丹麦语文本在所有文件中均不存在,但如果是,则始终位于6或7位。.现在我不知道它是否只是更容易地搜索给定的单词并从找到它的标签中提取文本,或者最好是通过对标签进行计数来找到它。

My code: 我的代码:

 <!DOCTYPE html> <html lang="en"> <head> <li class="property_name"> <label for="id_194-description"> Description: </label> </li> <li class="property_value"> <textarea class="mceNoEditor" cols="40" id="id_194-description" name="194-description" rows="10" style="width:100%">According to its version, the installation of Oracle Database on the remote host is no longer supported. Lack of support implies that no new security patches for the product will be released by the vendor. As a result, it is likely to contain security vulnerabilities.</textarea> </li> <li class="property_name"> <label for="id_194-consequence"> Consequence: </label> </li> <li class="property_value"> <textarea class="mceNoEditor" cols="40" id="id_194-consequence" name="194-consequence" rows="10" style="width:100%">The remote host is running an unsupported version of a database server.</textarea> </li> <li class="property_name"> <label for="id_194-solution"> Solution: </label> </li> <li class="property_value"> <textarea class="mceNoEditor" cols="40" id="id_194-solution" name="194-solution" rows="10" style="width:100%">Upgrade to a version of Oracle Database that is currently supported.</textarea> </li> <li class="property_name"> <label for="id_194-cve_id"> Cve id: </label> </li> <li class="property_value"> <textarea class="mceNoEditor" cols="40" id="id_194-cve_id" maxlength="8192" name="194-cve_id" rows="10" style="width:100%; height:80px"></textarea> </li> <input id="id_194-override" name="194-override" type="hidden" value="11953"/> <input id="id_194-priority" name="194-priority" type="hidden"/> <li class="property_name"> Vulnerability priority </li> <li class="property_value"> <select name="prio_194"> <option selected="selected" value="0"> 0 </option> </select> : Oracle Database Unsupported (Nessus) <br/> </li> <li class="property_name"> Save </li> <li class="property_value"> <input type="submit" value="Save vulnerability changes"/> </li> </ul> </form> <br style="clear:both"/> </div> <div class="box"> <h4> Related vulnerabilities </h4> <hr/> <h5> Oracle Database Unsupported (Nessus) </h5> <ul> <li class="property_name"> Description </li> <li class="property_value"> According to its version, the installation of Oracle Database on the remote host is no longer supported. <br/> <br/> Lack of support implies that no new security patches for the product will be released by the vendor. As a result, it is likely to contain security vulnerabilities. </li> <li class="property_name"> Consequence </li> <li class="property_value"> The remote host is running an unsupported version of a database server. </li> <li class="property_name"> Solution </li> <li class="property_value"> Upgrade to a version of Oracle Database that is currently supported. </li> </ul> <br style="clear:both"/> </div> <div class="box"> <h4> Create new snippet </h4> <form action="/report/vulnerabilityEditor/? action=edit&amp; id=194&amp; sid=&amp; model=snippet" method="POST"> <ul> <li class="property_name"> <label for="id_language"> Language: </label> </li> <li class="property_value"> <select id="id_language" name="language" style="width:100%"> <option selected="" value="1"> Danish (DK) </option> <option value="2"> English (EN) </option> <option value="3"> Icelandic (IS) </option> </select> </li> <input id="id_vulnerability" name="vulnerability" type="hidden" value="194"/> <li class="property_name"> <label for="id_title"> Title: </label> </li> <li class="property_value"> <input id="id_title" maxlength="100" name="title" style="width:100%" type="text"/> </li> <li class="property_name"> <label for="id_recommendation"> Recommendation: </label> </li> <li class="property_value"> <input id="id_recommendation" maxlength="255" name="recommendation" style="width:100%" type="text"/> </li> <li class="property_name"> <label for="id_snippet"> Snippet: </label> </li> <li class="property_value"> <textarea cols="40" id="id_snippet" name="snippet" rows="10" style="width:100%"></textarea> </li> <li class="property_name"> Scan type </li> <li class="property_value"> <select multiple="multiple" name="scan_type" size="6" style="width:100%"> <option selected="selected" value="5"> COMPANY PCI </option> <option selected="selected" value="7"> Other </option> <option selected="selected" value="8"> Firewall Audit </option> <option selected="selected" value="6"> Penetration Test </option> <option selected="selected" value="9"> WIFI Test </option> <option selected="selected" value="10"> APP Test </option> <option selected="selected" value="1"> External Security Analysis </option> <option selected="selected" value="2"> Internal Security Analysis </option> <option selected="selected" value="3"> Web Application Test </option> <option selected="selected" value="4"> Host Discovery Analysis </option> </select> -- Use ctrl to mark multiple types </li> <li class="property_name"> Save </li> <li class="property_value"> <input type="submit" value="Save new snippet"/> </li> </ul> <br style="clear:both;"/> </form> </div> <div class="box"> <h4> Edit snippets </h4> <input id="property_vulnerability_id" type="hidden" value="194"/> <input id="property_url_filter_snippets" type="hidden" value="/report/filterSnippets/"/> <ul> <li class="property_name"> Language </li> <li class="property_value"> <select id="language" name="language"> <option value="0"> All </option> <option value="1"> Danish </option> <option value="2"> English </option> <option value="3"> Icelandic </option> </select> </li> <li class="property_name"> Scan Type </li> <li class="property_value"> <select id="scantype" name="scantype"> <option value="0"> All </option> <option value="5"> COMPANY PCI </option> <option value="7"> Other </option> <option value="8"> Firewall Audit </option> <option value="6"> Penetration Test </option> <option value="9"> WIFI Test </option> <option value="10"> APP Test </option> <option value="1"> External Security Analysis </option> <option value="2"> Internal Security Analysis </option> <option value="3"> Web Application Test </option> <option value="4"> Host Discovery Analysis </option> </select> </li> </ul> <br style="clear:both;"/> <div class="snippet"> <form action="/report/vulnerabilityEditor/?action=edit&amp;id=194&amp;sid=1290&amp;model=snippet" method="POST"> <input id="id_1290-vulnerability" name="1290-vulnerability" type="hidden" value="194"/> <hr/> <ul> <li class="property_name"> <label for="id_1290-language"> Language: </label> </li> <li class="property_value"> <select id="id_1290-language" name="1290-language" style="width:100%"> <option value="1"> Danish (DK) </option> <option selected="" value="2"> English (EN) </option> <option value="3"> Icelandic (IS) </option> </select> </li> <li class="property_name"> <label for="id_1290-title"> Title: </label> </li> <li class="property_value"> <input id="id_1290-title" maxlength="100" name="1290-title" style="width:100%" type="text" value="Oracle Database Unsupported"/> </li> <li class="property_name"> <label for="id_1290-recommendation"> Recommendation: </label> </li> <li class="property_value"> <input id="id_1290-recommendation" maxlength="255" name="1290-recommendation" style="width:100%" type="text" value="Upgrade to a version of Oracle Database that is currently supported."/> </li> <li class="property_name"> <label for="id_1290-snippet"> Snippet: </label> </li> <li class="property_value"> <a href="https://cyberopswiki/index.php/How_to:_Add_figure_number_in_snippet" target="_blank"> How to: Add figure number in snippet. </a> </li> <li class="property_value"> <textarea cols="40" id="id_1290-snippet" name="1290-snippet" rows="10" style="width:100%">&lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; text-align: justify; line-height: normal;"&gt;&lt;span lang="EN-US" style="font-size: 10pt;"&gt;It has been detected, that the installed version of Oracle Application Server is&amp;nbsp;&lt;strong&gt;XXXX.&amp;nbsp;&lt;/strong&gt;This version is known to be vulnerable to a number of unspecified vulnerabilities, categorized as 'urgent'.&lt;/span&gt;&lt;/p&gt; &lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; text-align: justify; line-height: normal;"&gt;&lt;span lang="EN-US" style="font-size: 10pt;"&gt;As this version is no longer supported for this platform, updates or patches may no longer be released, which have the consequence that vulnerabilities can not be patched, leaving the system vulnerable.&lt;/span&gt;&lt;/p&gt; &lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; text-align: justify; line-height: normal;"&gt;&lt;span lang="EN-US" style="font-size: 10pt;"&gt;In version 10.1.2.0.2 there are, according to http://www.cvedetails.com more than 54 vulnerabilities which affects the installed version.&lt;/span&gt;&lt;/p&gt; &lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; text-align: center; line-height: normal;" align="center"&gt;&lt;strong&gt;&lt;em&gt;&lt;span lang="EN-US" style="font-size: 8pt;"&gt;Figure 1: &lt;/span&gt;&lt;/em&gt;&lt;/strong&gt;&lt;em&gt;&lt;span lang="EN-US" style="font-size: 8pt;"&gt;Oracle Application Server version.&lt;/span&gt;&lt;/em&gt;&lt;/p&gt; &lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; text-align: justify; line-height: normal;"&gt;&lt;span lang="EN-US" style="font-size: 10pt;"&gt;More information on these vulnerabilities can be found at:&amp;nbsp;&lt;/span&gt;&lt;span style="font-size: 10pt;"&gt;&lt;a href="http://www.cvedetails.com/vulnerability-list/vendor_id-93/product_id-707/version_id-26592/Oracle-Application-Server-10.1.2.0.2.html"&gt;&lt;span lang="EN-US" style="color: blue; mso-ansi-language: EN-US;"&gt;http://www.cvedetails.com/vulnerability-list/vendor_id-93/product_id-707/version_id-26592/Oracle-Application-Server-10.1.2.0.2.html&lt;/span&gt;&lt;/a&gt;&lt;a href="http://www.cvedetails.com/vulnerability-list/vendor_id-93/product_id-707/version_id-26592/Oracle-Application-Server-10.1.2.0.2.html"&gt;&lt;span lang="EN-US" style="color: blue; mso-ansi-language: EN-US;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span lang="EN-US" style="font-size: 10pt;"&gt;.&lt;/span&gt;&lt;/p&gt; &lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; text-align: justify; line-height: normal;"&gt;&amp;nbsp;&lt;/p&gt; &lt;p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; text-align: justify; line-height: normal;"&gt;&lt;span lang="EN-US" style="font-size: 10pt;"&gt;It is recommended that the installed version is updated as soon as possible to the latest version.&lt;/span&gt;&lt;/p&gt;</textarea> </li> <li class="property_name"> Scan type </li> <li class="property_value"> <select multiple="multiple" name="scan_type" size="6" style="width:100%"> <option selected="selected" value="5"> COMPANY PCI </option> <option selected="selected" value="7"> Other </option> <option selected="selected" value="8"> Firewall Audit </option> <option selected="selected" value="6"> Penetration Test </option> <option selected="selected" value="9"> WIFI Test </option> <option selected="selected" value="10"> APP Test </option> <option selected="selected" value="1"> External Security Analysis </option> <option selected="selected" value="2"> Internal Security Analysis </option> <option selected="selected" value="3"> Web Application Test </option> <option selected="selected" value="4"> Host Discovery Analysis </option> </select> -- Use ctrl to mark multiple types </li> <li class="property_name"> Update </li> <li class="property_value"> <input type="submit" value="Update snippet"/> </li> </ul> </form> <br style="clear:both;"/> </div> <div class="snippet"> <form action="/report/vulnerabilityEditor/?action=edit&amp;id=194&amp;sid=172&amp;model=snippet" method="POST"> <input id="id_172-vulnerability" name="172-vulnerability" type="hidden" value="194"/> <hr/> <ul> <li class="property_name"> <label for="id_172-language"> Language: </label> </li> <li class="property_value"> <select id="id_172-language" name="172-language" style="width:100%"> <option selected="" value="1"> Danish (DK) </option> <option value="2"> English (EN) </option> <option value="3"> Icelandic (IS) </option> </select> </li> <li class="property_name"> <label for="id_172-title"> Title: </label> </li> <li class="property_value"> <input id="id_172-title" maxlength="100" name="172-title" style="width:100%" type="text" value="Forældet Oracle Application Server 10g"/> </li> <li class="property_name"> <label for="id_172-recommendation"> Recommendation: </label> </li> <li class="property_value"> <input id="id_172-recommendation" maxlength="255" name="172-recommendation" style="width:100%" type="text"/> </li> <li class="property_name"> <label for="id_172-snippet"> Snippet: </label> </li> <li class="property_value"> <a href="https://cyberopswiki/index.php/How_to:_Add_figure_number_in_snippet" target="_blank"> How to: Add figure number in snippet. </a> </li> <li class="property_value"> <textarea cols="40" id="id_172-snippet" name="172-snippet" rows="10" style="width:100%">&lt;p style="font-size: 13px;"&gt;Det konstateret, at den installerede version af Oracle Application Server er&amp;nbsp;&lt;strong&gt;XXXX.&amp;nbsp;&lt;/strong&gt;Denne version indeholder flere kendte samt uspecificeret s&amp;aring;rbarheder, der kategoriseres som v&amp;aelig;rende 'yderst kritiske' og 'kritiske'.&lt;/p&gt; &lt;p style="font-size: 13px;"&gt;Da der ikke l&amp;aelig;ngere komme opdateringer til denne platform, vil disse s&amp;aring;rbarheder ikke blive udbedret, hvorfor systemet er meget udsat.&lt;/p&gt; &lt;p style="font-size: 13px;"&gt;I version 10.1.2.0.2 findes der if&amp;oslash;lge http://www.cvedetails.com ikke mindre end 54 s&amp;aring;rbarheder, der ber&amp;oslash;rer denne version. Mere information om disse findes p&amp;aring; adressen&amp;nbsp;&lt;a href="http://www.cvedetails.com/vulnerability-list/vendor_id-93/product_id-707/version_id-26592/Oracle-Application-Server-10.1.2.0.2.html"&gt;http://www.cvedetails.com/vulnerability-list/vendor_id-93/product_id-707/version_id-26592/Oracle-Application-Server-10.1.2.0.2.html&lt;/a&gt;&lt;a href="http://www.cvedetails.com/vulnerability-list/vendor_id-93/product_id-707/version_id-26592/Oracle-Application-Server-10.1.2.0.2.html"&gt;&amp;nbsp;&lt;/a&gt;.&lt;/p&gt; &lt;p style="font-size: 13px;"&gt;Det anbefales leverand&amp;oslash;ren af software l&amp;oslash;sningen kontakts, s&amp;aring; der hurtigst muligt kan opgraderes til en nyere, supporteret version.&amp;nbsp;&lt;/p&gt;</textarea> </li> 

Question : Searching for language specific words 问题 :搜索特定于语言的单词

I dont know if its just easier to search for a given word and extract the text from the tag where its found, or if finding it by counting tags is best. 我不知道它是否更容易搜索给定的单词并从找到它的标签中提取文本,或者最好是通过计数标签来找到它。

Neither of them, 他们都不是,

  • "search for a given word" will fail if not present and needs preview. 如果不存在, “搜索给定单词”将失败并且需要预览。
  • "by counting tags" will fail, if the place in the list or the page layout changes. 如果列表中或页面布局中的位置发生更改, “按标签计数”将失败。

The main order of the given html are <form>...</form> Elements. 给定html的主要顺序是<form>...</form>元素。
Within any <form there is a <select Element with <option selected='' Elements. 在任何<form都有一个<select元素和<option selected='' Elements。 The <form with <option selected='' and <option ...>Danish (DK) are the one you are searching. <form带有<option selected=''<option ...>Danish (DK) <form就是您要搜索的<form

  1. Find all <form>...</form> Elements 查找所有<form>...</form>元素

     forms = soup.find_all('form') 
  2. Loop the form Elements 循环表单元素

     for form in forms: 
  3. Verify, if the <option> tag has a selected= attribut and the .text includes 'Danish (DK)' : 验证<option>标记是否具有selected=属性并且.text包含'Danish (DK)'

      danish = [True for option in form.find_all("option", selected=True) if 'Danish (DK)' in option.text] 
  4. Condition: If danish is not empty, then option with text 'Danish (DK)' are selected. 条件:如果danish不为空,则选择带有文本'Danish (DK)'的选项。

      if danish: print('MATCH:{}'.format(danish)) print('{}'.format(form.textarea.text)) 

Output : 输出

 MATCH:[True] <p style="font-size: 13px;">Det konstateret, at den installerede version... (omitted for brevity) <p style="font-size: 13px;">Da der ikke l&aelig;ngere komme opdateringer... (omitted for brevity) <p style="font-size: 13px;">I version 10.1.2.0.2 findes der if&oslash;lg... (omitted for brevity) <p style="font-size: 13px;">Det anbefales leverand&oslash;ren af softwar... (omitted for brevity) 

Tested with Python: 3.4.2 使用Python测试:3.4.2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM