简体   繁体   English

在JavaScript中查找文本字符串

[英]Finding text strings in JavaScript

I have a large valid JavaScript file (utf-8), from which I need to extract all text strings automatically. 我有一个很大的有效JavaScript文件(utf-8),我需要从中自动提取所有文本字符串。

For simplicity, the file doesn't contain any comment blocks in it, only valid ES6 JavaScript code. 为简单起见,该文件中不包含任何注释块,只包含有效的ES6 JavaScript代码。

Once I find an occurrence of ' or " or `, I'm supposed to scan for the end of the text block, is where I got stuck, given all the possible variations, like "'" , '"' , "\\'", '\\"', '" , `\\``, etc. 一旦我发现'"或'的出现,我应该扫描文本块的末尾,是我遇到的所有可能的变化,如"'"'"' ,“\\” “,'\\'', '" ,````等

Is there a known and/or reusable algorithm for detecting the end of a valid ES6 JavaScript text block? 是否有已知和/或可重用的算法来检测有效的ES6 JavaScript文本块的结尾?

UPDATE-1: My JavaScript file isn't just large, I also have to process it as a stream, in chunks, so Regex is absolutely not usable. UPDATE-1:我的JavaScript文件不仅很大,我还必须以块的形式处理它,因此Regex绝对不可用。 I didn't want to complicate my question, mentioning joint chunks of code, I will figure that out myself, If I have an algorithm that can work for a single piece of code that's in memory. 我不想让我的问题复杂化,提到联合代码块,我会自己解决这个问题,如果我有一个算法可以处理内存中的单个代码。

UPDATE-2: I got this working initially, thanks to the many advises given here, but then I got stuck again, because of the Regular Expressions. 更新-2:我最初开始工作,感谢这里给出的许多建议,但是由于正则表达式,我再次陷入困境。

Examples of Regular Expressions that break any of the text detection techniques suggested so far: 正则表达式的示例打破了迄今为止建议的任何文本检测技术:

/'/
/"/
/\`/

Having studied the matter closer, by reading this: How does JavaScript detect regular expressions? 仔细研究了这个问题,通过阅读: JavaScript如何检测正则表达式? , I'm afraid that detecting regular expressions in JavaScript is a whole new ball game, worth a separate question, or else it gets too complicated. 我担心在JavaScript中检测正则表达式是一个全新的球类游戏,值得一个单独的问题,否则它会变得太复杂。 But I appreciate very much if somebody can point me in the right direction with this issue... 但是,如果有人能指出我正确的方向,我非常感谢...

UPDATE-3: After much research I found with regret that I cannot come up with an algorithm that would work in my case, because presence of Regular Expressions makes the task incredibly more complicated than was initially thought. 更新3:经过大量研究后,我遗憾地发现我无法想出一个适用于我的算法的算法,因为正则表达式的存在使得任务比最初想象的要复杂得多。 According to the following: When parsing Javascript, what determines the meaning of a slash? 根据以下内容: 解析Javascript时,什么决定了斜杠的含义? , determining the beginning and end of regular expressions in JavaScript is one of the most complex and convoluted tasks. 确定JavaScript中正则表达式的开头和结尾是最复杂和最复杂的任务之一。 And without it we cannot figure out when symbols ' , '"' and ` are opening a text block or whether they are inside a regular expression. 如果没有它,我们无法弄清楚符号' ,'''和`何时打开文本块或者它们是否在正则表达式中。

The only way to parse JavaScript is with a JavaScript parser. 解析JavaScript的唯一方法是使用JavaScript解析器。 Even if you were able to use regular expressions, at the end of the day they are not powerful enough to do what you are trying to do here. 即使你能够使用正则表达式,但在一天结束时,它们还不足以完成你在这里尝试做的事情。

You could either use one of several existing parsers, that are very easy to use, or you could write your own, simplified to focus on the string extraction problem. 你可以使用几个现有的解析器中的一个,它们非常容易使用,或者你可以编写自己的解析器,简化为专注于字符串提取问题。 I hardly imagine you want to write your own parser, even a simplified one. 我很难想象你想编写自己的解析器,甚至是简化的解析器。 You will spend much more time writing it and maintaining it than you might think. 你将花费更多的时间来编写和维护它,而不是你想象的那样。

For instance, an existing parser will handle something like the following without breaking a sweat. 例如,现有的解析器将处理类似下面的内容而不会出汗。

`foo${"bar"+`baz`}`

The obvious candidates for parsers to use are esprima and babel. 解析器使用的明显候选者是esprima和babel。

By the way, what are you planning to do with these strings once you extract them? 顺便说一下,一旦你提取它们,你打算用这些字符串做什么?

If you only need an approximate answer, or if you want to get the string literals exactly as they appear in the source code, then a regular expression can do the job. 如果您只需要一个近似答案,或者您想要获得与源代码中出现的字符串文字完全相同的字符串文字,那么正则表达式就可以完成这项工作。

Given the string literal "\\n" , do you expect a single-character string containing a newline or the two characters backslash and n? 给定字符串文字"\\n" ,您是否期望包含换行符的单字符字符串或两个字符反斜杠和n?

  • In the former case you need to interpret escape sequences exactly like a JavaScript interpreter does. 在前一种情况下,您需要完全像JavaScript解释器那样解释转义序列。 What you need is a lexer for JavaScript, and many people have already programmed this piece of code. 你需要的是JavaScript的词法分析器 ,很多人已经编写了这段代码。
  • In the latter case the regular expression has to recognize escape sequences like \\x40 and \… , so even in that case you should copy the code from an existing JavaScript lexer. 在后一种情况下,正则表达式必须识别转义序列,如\\x40\… ,因此即使在这种情况下,您也应该从现有的JavaScript词法分析器复制代码。

See https://github.com/douglascrockford/JSLint/blob/master/jslint.js , function tokenize . 请参阅https://github.com/douglascrockford/JSLint/blob/master/jslint.js ,函数tokenize

Try code below: 请尝试以下代码:

 txt = "var z,b \n;z=10;\n b='321`1123`321321';\n c='321`321`312`3123`';"
 function fetchStrings(txt, breaker){
      var result = [];
      for (var i=0; i < txt.length; i++){
        // Define possible string starts characters
        if ((txt[i] == "'")||(txt[i] == "`")){
          // Get our text string;
          textString = txt.slice(i+1, i + 1 + txt.slice(i+1).indexOf(txt[i]));
          result.push(textString)
          // Jump to end of fetched string;
          i = i + textString.length + 1;
        }
      }
      return result;
    };

console.log(fetchStrings(txt));

Can I let you test it by yourself? 我可以让你自己测试一下吗? I believe that you should be able to use this solution with chunks after a few tweaks (for example, reset i to 0 for each new chunk might be a good starting point). 我相信你应该能够在经过一些调整后使用这个解决方案(例如,将每个新块重置为0可能是一个很好的起点)。 I'm ok to keep on working on your question, though I'd like you to tell me if I'm heading to the right direction :-) 我可以继续处理你的问题,虽然我希望你告诉我,我是否正朝着正确的方向前进:-)

This code uses recursivity to keep track of the current state (code, string, comment or regex). 此代码使用递归来跟踪当前状态(代码,字符串,注释或正则表达式)。 I'm not familiar with processing big files, hence I'm afraid it can lead to a stack overflow. 我不熟悉处理大文件,因此我担心它会导致堆栈溢出。 As a workaround, you could save the state in a global variable and do all these stuffs in an iterative manner. 作为一种解决方法,您可以将状态保存在全局变量中,并以迭代方式执行所有这些操作。

 var strings = []; code(document.getElementsByTagName('script')[0].textContent, 0); document.write('<pre>' + JSON.stringify(strings, 0, 2) + '</pre>'); function code (text, i) { if (i < text.length) { var c = text.charAt(i); if (/`|'|"/.test(c)) { strings.push(''); string(text, i + 1, text.charAt(i)); } else if (c == '/') { slash(text, i + 1); } else { code(text, i + 1); } } } function string (text, i, quote) { if (i < text.length) { var step, c = text.charAt(i); if (c == quote) { code(text, i + 1); } else { step = c == '\\\\' ? 2 : 1; strings[strings.length - 1] += text.substr(i, step); string(text, i + step, quote); } } } function slash (text, i) { if (i < text.length) { var c = text.charAt(i); if (c == '/') { singlelinecomment(text, i + 1); } else if (c == '*') { multilinecomment(text, i + 1, ''); } else { regex(text, i + 1); } } } function singlelinecomment (text, i) { if (i < text.length) { var c = text.charAt(i); if (c == '\\n') { code(text, i + 1); } else { singlelinecomment(text, i + 1); } } } function multilinecomment (text, i, prev) { if (i < text.length) { var c = text.charAt(i); if (prev == '*' && c == '/') { code(text, i + 1); } else { multilinecomment(text, i + 1, c); } } } function regex (text, i) { if (i < text.length) { var c = text.charAt(i); if (c == '/') { code(text, i + 1); } else { regex(text, i + 1); } } } 
 <script> var s = ""; var r = /'allo'/; // "single line comment" var f = function(){ return '`a str\\'ing`'; }; /** 'multi line' `comment` **/ var o = { "prop": "va\\"'lue" }; var l = '\\ a\\ multi\\ line\\ string'; </script> 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM