简体   繁体   English

我如何确定是否可以更改此单词……python脚本

[英]How can I determine if it's ok to change this word… python script

The goal is to read through html files and change all instances of MyWord to Myword; 目的是通读html文件,并将MyWord的所有实例更改为Myword。 except, must NOT change the word if it is found inside or as part of a path, file name or script: 除非在路径,文件名或脚本的内部或部分中找到该词,否则不得更改该词:

href="..."
src="..."
url(...)
class="..."
id="..."
script inline or linked (file name) --> <script ...></script>
styles inline or linked (file name) --> <link ...>   <style></style>  

Now the question of all questions: how do you determine if the instance of the word is in a position where it's ok to change it? 现在是所有问题的问题:您如何确定单词的实例是否处于可以更改的位置? (or, how do you determine if the word is inside of one of the above listed locations and should not be changed?) (或者,您如何确定该单词是否在上面列出的位置之一之内,并且不应更改?)

Here is my code, it can be changed to read line by line, etc. but I just can not think of how to define and enforce a rule to match above... 这是我的代码,可以将其更改为逐行读取,等等。但是我只是想不出如何定义和执行与上面匹配的规则...

Here it is: 这里是:

#!/usr/bin/python

import os
import time
from stat import *

def fileExtension(s):
   i = s.rfind('.')
   if i == -1:
      return ''
   tmp = '|' + s[i+1:] + '|'
   return tmp

def changeFiles():
   # get all files in current directory with desired extension
   files = [f for f in os.listdir('.') if extStr.find(fileExtension(f)) != -1]

   for f in files:
      if os.path.isdir(f):
         continue

      st = os.stat(f)
      atime = st[ST_ATIME] # org access time
      mtime = st[ST_MTIME] # org modification time

      fw = open(f, 'r+')
      tmp = fw.read().replace(oldStr, newStr)
      fw.seek(0)
      fw.write(tmp)
      fw.close()

      # put file timestamp back to org timestamp
      os.utime(f,(atime,mtime))

      # if we want to check subdirectories
      if checkSubDirs :
         dirs = [d for d in os.listdir('.') if os.path.isdir(d)]

      for d in dirs :
         os.chdir(d)
         changeFiles()
         os.chdir('..')

# ==============================================================================
# ==================================== MAIN ====================================

oldStr = 'MyWord'
newStr = 'Myword'
extStr = '|html|htm|'
checkSubDirs = True

changeFiles()  

Anybody know how? 有人知道吗? Have any suggestions? 有什么建议吗? ANY help is appreciated, beating my brain for 2 days now and just can not think of anything. 感谢任何帮助,现在动了两天我的大脑,什么都想不起来。

lxml helps with this kind of task. lxml帮助完成此类任务。

html = """
<html>
<body>
    <h1>MyWord</h1>
    <a href="http://MyWord">MyWord</a>
    <img src="images/MyWord.png"/>
    <div class="MyWord">
        <p>MyWord!</p>
        MyWord
    </div>
    MyWord
</body><!-- MyWord -->
</html>
"""

import lxml.etree as etree

tree = etree.fromstring(html)
for elem in tree.iter():
    if elem.text:
        elem.text = re.sub(r'MyWord', 'Myword', elem.text)
    if elem.tail:
        elem.tail = re.sub(r'MyWord', 'Myword', elem.tail)

print etree.tostring(tree)

The above prints this: 上面打印了这个:

<html>
<body>
    <h1>Myword</h1>
    <a href="http://MyWord">Myword</a>
    <img src="images/MyWord.png"/>
    <div class="MyWord">
        <p>Myword!</p>
        Myword
    </div>
    Myword
</body><!-- Myword -->
</html>

Note : You'll need to make the above code a little more complex if you also need special processing for the contents of script tags, such as the following 注意 :如果还需要对脚本标签的内容进行特殊处理,则需要使上面的代码稍微复杂一些,例如:

<script>
    var title = "MyWord"; // this should change to "Myword"
    var hoverImage = "images/MyWord-hover.png"; // this should not change
</script>

Use regex here is an example that you can start with, hope this will help : 在这里使用regex是一个示例,您可以从此开始,希望对您有所帮助:

import re

html = """
    <href="MyWord" />
    MyWord
"""

re.sub(r'(?<!href=")MyWord', 'myword', html)
output: \n\n <href="MyWord" />\n myword\n\n

ref : http://docs.python.org/library/re.html 参考: http : //docs.python.org/library/re.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM