简体   繁体   English

Notepad ++删除带有特定文本的标签

[英]Notepad++ deleting tags with specific text inside

I have a large XML file with products inside. 我有一个带有产品的大型XML文件。 I'm trying to delete all products which are out of stock. 我正在尝试删除所有缺货的产品。 File size is over 20MB. 文件大小超过20MB。

<product>
  <name>bla1</name>
  <price>50$</price>
  <stock>yes</stock>
  <description>bla</description>
</product>

<product>
  <name>bla2</name>
  <price>60$</price>
  <stock>no</stock>
  <description>bla</description>
</product>

...

Is it possible to delete them using Notepad++'s regex or should I use simpleXML(PHP) or something similar? 是否可以使用Notepad ++的正则表达式删除它们,还是应该使用simpleXML(PHP)或类似的东西?

My basic PHP code: 我的基本PHP代码:

$url = 'input/products.xml';
    $xml = new SimpleXMLElement(file_get_contents($url));

    foreach ($xml->product->children() as $product) {

        //finding out of stock products and deleting them

    }
    $xml->asXml('output/products.xml');

Forward 向前

Doing pattern matching via regular expression is not ideal, if you have access to PHP, then I recommend using a proper HTLM parsing tool. 通过正则表达式进行模式匹配不是理想的选择,如果您可以访问PHP,那么我建议您使用适当的HTLM解析工具。 With that said, I offer a solution you can use in Notepad++ 话虽如此,我提供了可以在Notepad ++中使用的解决方案

Description 描述

<product\\s*(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\\s>]*)*?\\s?\\/?>(?:(?!</product).)*<stock\\s*(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\\s>]*)*?\\s?\\/?>no</stock>(?:(?!</product).)*<\\/product>

Replace with: 用。。。来代替: nothing 没有

正则表达式可视化

To view the image better, you can right click it and select view in new window. 要更好地查看图像,可以右键单击它并选择在新窗口中查看。

This Regular Expression will do the following: 此正则表达式将执行以下操作:

  • find the entire product section 找到整个产品部分
  • require the subtag stock 需要子标签stock
  • require the subtag stock to have a value of no 要求子标签stock的价值为no
  • avoid extremely edge cases that makes pattern matching in HTML difficult 避免极端情况,以免在HTML中进行模式匹配

From Notepad ++ 从记事本++

From Notepad++, note that you should be using notpad++ version 6.1 or later as there were problems with regular expressions in an older version that have been solved now. 在Notepad ++中,请注意,您应该使用notpad ++版本6.1或更高版本,因为旧版本中的正则表达式存在问题,现已解决。

  1. press the ctrl h to enter the find and replace mode ctrl h进入查找和替换模式

  2. Select the Regular Expression option 选择正则表达式选项

  3. In the "Find what" field place the regular expression 在“查找内容”字段中放置正则表达式

  4. in the "Replace with" field enter `` 在“替换为”字段中输入“

  5. Click Replace all 点击全部替换

Example

Live Demo 现场演示

https://regex101.com/r/cW9nC5/1 https://regex101.com/r/cW9nC5/1

Sample text 示范文本

<product>
  <name>bla1</name>
  <price>50$</price>
  <stock>yes</stock>
  <description>bla</description>
</product>

<product>
  <name>bla2</name>
  <price>60$</price>
  <stock>no</stock>
  <description>bla</description>
</product>

After Replace 更换后

<product>
  <name>bla1</name>
  <price>50$</price>
  <stock>yes</stock>
  <description>bla</description>
</product>

Explanation 说明

NODE                     EXPLANATION
----------------------------------------------------------------------
  <product                 '<product'
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the least amount possible)):
----------------------------------------------------------------------
    [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ='                       '=\''
----------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ="                       '="'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    =                        '='
----------------------------------------------------------------------
    [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
    [^\s>]*                  any character except: whitespace (\n,
                             \r, \t, \f, and " "), '>' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*?                      end of grouping
----------------------------------------------------------------------
  \s?                      whitespace (\n, \r, \t, \f, and " ")
                           (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  \/?                      '/' (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  >                        '>\r\n'
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
      </product                '</product'
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
    .                        any character except \n
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  <stock                   '<stock'
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the least amount possible)):
----------------------------------------------------------------------
    [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ='                       '=\''
----------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ="                       '="'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    =                        '='
----------------------------------------------------------------------
    [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
    [^\s>]*                  any character except: whitespace (\n,
                             \r, \t, \f, and " "), '>' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*?                      end of grouping
----------------------------------------------------------------------
  \s?                      whitespace (\n, \r, \t, \f, and " ")
                           (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  \/?                      '/' (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  >no</stock>              '>no</stock>'
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
      </product                '</product'
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
    .                        any character except \n
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  <                        '<'
----------------------------------------------------------------------
  \/                       '/'
----------------------------------------------------------------------
  product>                 'product>'
----------------------------------------------------------------------

I guess notepad++ will be easier, ie: 我想notepad ++会更容易,例如:

FIND : <product>\\s+<name>.*?<\\/name>\\s+<price>.*?<\\/price>\\s+<stock>no<\\/stock>\\s+<description>.*?\\/description>\\s+<\\/product> 查找: <product>\\s+<name>.*?<\\/name>\\s+<price>.*?<\\/price>\\s+<stock>no<\\/stock>\\s+<description>.*?\\/description>\\s+<\\/product>
REPLACE : with nothing 替换:一无所有


DEMO DEMO

https://regex101.com/r/fH0mM7/1 https://regex101.com/r/fH0mM7/1


NOTE 注意

Make sure you check Regular Expression at the bottom 确保检查底部的Regular Expression

You can do this with PHP using the below code 您可以使用以下代码使用PHP进行此操作

<?php
    $url = 'input/products.xml';
    $xml = new SimpleXMLElement(file_get_contents($url));
    $i = count($xml) - 1; 
    for ($i; $i >= 0; --$i) {   
       $product = $xml->product[$i];
       if ($product->stock == "no") {
          unset($xml->product[$i]);
       }
    }
    $xml->asXml('output/products.xml');
    ?> 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM