python中的正则表达式，从文件中读取文本

Question

I have regular expression that scans some data from html files the code is removing html tags using BeautifulSoup and return the following text (just a part from the text):我有正则表达式从 html 文件中扫描一些数据，代码使用BeautifulSoup删除html标签并返回以下文本（只是文本的一部分）：

/Semester: 2011 / 1 Number : 20112222 /学期：2011 / 1 数：20112222

Name : XXXX XXXX XXXX XXXX Advisor姓名 : XXXX XXXX XXXX XXXX 顾问

This sample of my code:我的代码示例：

import re,glob,os
from bs4 import BeautifulSoup
import nltk

path = 'C:\\xampp\\htdocs\\data_tools\\transcripts'
os.chdir(path)
delch=','

def scantext(text,snum) :
    re_semstudent = re.compile("Semester:\s*(\d*)\s*\/\s*(\d)\s*Number\s*:\s*(\d{8})\s*Name\s*:\s*(.*)\s*Advisor")
    semesters = text.split("Year")

    for ind in range(1,len(semesters)):
        s = semesters[ind]
        x = re.search(re_semstudent,s)
        if x :
            year=x.group(1)
            semester=x.group(2)
            studentid=x.group(3)
            studentname=x.group(4)

        print year+"#"+semester

    return 0

ii=1
for fname in glob.glob("*.html") :
    f = open (fname)        
    text = BeautifulSoup(f.read(), 'html.parser').getText()
    scantext(text,ii)

When I am trying the re.search with the text as fixed string, its work fine!当我试图用文字固定的字符串，其做工精细re.search！ But when I send the text in the scantext function and use the semesters = text.split("Year") .但是当我在scantext函数中发送文本并使用semesters = text.split("Year") 。 I can print the text of each split, but the regular expression can't match any value!我可以打印每个拆分的文本，但是正则表达式无法匹配任何值！

Answer 1

You need the re.U/re.Unicode flag:你需要re.U/re.Unicode标志：

  re_semstudent = re.compile("Semester:\s*(\d*)\s*\/\s*(\d)\s*Number\s*:\s*(\d{8})\s*Name\s*:\s*(.*)\s*Advisor",re.U)

Which if you run after will give you something like:如果你追赶，会给你类似的东西：

<_sre.SRE_Match object at 0x7fe9fb721df8>
2011#1
<_sre.SRE_Match object at 0x7fe9fb721d50>
2011#2
<_sre.SRE_Match object at 0x7fe9fb721df8>
2012#1
<_sre.SRE_Match object at 0x7fe9fb721d50>
2012#2

You might also need to open the file with encoding="utf-8" :您可能还需要使用encoding="utf-8"打开文件：

from io import open
for fname in glob.glob("*.html") :
    with open(fname, encoding="utf-8") as f:
        text = BeautifulSoup(f.read(), 'html.parser').getText()
        scantext(text, ii)

python中的正则表达式，从文件中读取文本

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-04-05 15:37:23

python中的正则表达式，从文件中读取文本

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-04-05 15:37:23

解决方案1
1 已采纳 2016-04-05 15:37:23