简体   繁体   English

在Perl中,如何从缩小的JavaScript源文件中提取某些字符串?

[英]In Perl, how can I extract certain strings from a minified JavaScript source file?

I have this ugly file. 我有这个丑陋的文件。

{message:"What this does is, every time the mouse moves in the canvas area, it sets mouseX and mouseY to the location of the mouse.",},{message:"Then, when each ball is updated, it figures out how far away from the mouse it is, and accelerates toward it.",},{message:"The acceleration is the square root of the distance, so it pulls harder when it is really far away. Imagine all the balls being connected to the mouse by little rubber bands or springs. It's a little like that.",},{message:"Try making the balls smaller! And add more of them! I like it with about 40 small balls chasing the mouse.",},{message:"Great job! Like what you learned? Was it fun?",code:"",hiddenCode:"var c = document.getElementById('pane').getContext('2d');\\nfunction rgba(r,g,b,a) {return 'rgba('+[r,g,b,a].join(',')+')';}\\nfunction rgb(r,g,b,a) {return 'rgb('+[r,g,b].join(',')+')';}\\n\\n",lessonSection:"The End",},{message:"Wow, you did everything! Congratulations, nice work! A lot of these are really hard. I'm impre {消息:“这是什么意思,每当鼠标在画布区域中移动时,它都会将mouseX和mouseY设置为鼠标的位置。”,},{消息:”然后,当每个球都更新时,它就会计算出它距鼠标有多远,并朝着它加速。”,},{消息:“加速度是距离的平方根,所以当距离很远时它会拉得更硬。想象一下所有与之相连的球”,},{消息:“尝试使球变小!并添加更多的球!我喜欢用约40个小球追逐鼠标。”,} ,{消息:“干得好!喜欢你学到的东西?有趣吗?”,代码:“”,hiddenCode:“ var c = document.getElementById('pane')。getContext('2d'); \\ nfunction rgba( r,g,b,a){返回'rgba('+ [r,g,b,a] .join(',')+')';} \\ n函数rgb(r,g,b,a){ return'rgb('+ [r,g,b] .join(',')+')';} \\ n \\ n“,lessonSection:” The End“,},{message:”哇,您做了一切恭喜你,辛苦了!很多这些真的很辛苦。 ssed you finished! sss,你做完了! I hope you enjoyed it!",code:'var pane = document.getElementById(\\'pane\\');\\nvar s = 3;\\n\\npane.onmousemove = function(evt) {\\n c.fillStyle = randomRGBA();\\n var x = evt.clientX;\\n var y = evt.clientY;\\n c.fillRect(x - s / 2, y - s / 2, s, s);};\\n\\nfunction randomRGBA() {\\n var r = randInt(255);\\n var g = randInt(255);\\n var b = randInt(255);\\n var a = Math.random();\\n var rgba = [r,g,b,a].join(",");\\n return "rgba(" + rgba + ")";\\n}\\nfunction randInt(limit) {\\n var x = 我希望您喜欢它!”,代码:'var panel = document.getElementById(\\'pane \\'); \\ nvar s = 3; \\ n \\ npane.onmousemove = function(evt){\\ n c.fillStyle = randomRGBA (); \\ n var x = evt.clientX; \\ n var y = evt.clientY; \\ n c.fillRect(x-s / 2,y-s / 2,s,s);}; \\ n \\ n函数randomRGBA(){\\ n var r = randInt(255); \\ n var g = randInt(255); \\ n var b = randInt(255); \\ n var a = Math.random(); \\ n var rgba = [r,g,b,a] .join(“,”); \\ n返回“ rgba(” + rgba +“)”; \\ n} \\ n函数randInt(limit){\\ n var x =

I am trying to use Perl regex to extract the body of the message 我正在尝试使用Perl正则表达式提取消息正文

I trying two 3 hours working on it, but I can not seems to extract it. 我尝试了两个3个小时来研究它,但似乎无法提取它。

My point is to translate the message from English to other languages, so I wanted the string of the message on a clean file instead of working on this ugly file that combine both messages and code. 我的意思是将消息从英语翻译为其他语言,因此我希望将消息的字符串放在干净的文件中,而不是在将消息和代码结合在一起的丑陋文件上进行处理。

I was trying to use this code: 我正在尝试使用以下代码:

use strict;
use warnings;

my $filename = 'test.txt';
my $row = '';

if (open(my $fh, '<:encoding(UTF-8)', $filename)) {
  while ($row = <$fh>) {
    if ($row =~/message:(.*)/)
    {
        print $1 . "\n";
    }
  }
} 
else {
  warn "Could not open file '$filename' $!";
}

It give me results basically the entire file as an output. 它给我的结果基本上是整个文件作为输出。 I tried \\W+ or \\s+ which gave me the first word only. 我尝试了\\W+\\s+ ,但只给了我第一个单词。

Any ideas? 有任何想法吗?

Your problem is that the .* that you use in your regex is "greedy". 您的问题是,在正则表达式中使用的.*是“贪婪的”。 It grabs as much of the input data as possible - which does right to the end of the file. 它会捕获尽可能多的输入数据-恰好在文件末尾。

You need to change that to .*? 您需要将其更改为.*? so that it grabs as little as possible. 以便尽可能少地抓取。 But you also need to define better markers for the beginning and end of the regex. 但是,您还需要为正则表达式的开头和结尾定义更好的标记。 Looks to me like your message is always in double-quotes. 在我看来,您的消息始终用双引号引起来。 So let's use that. 因此,让我们使用它。

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;

my $input = do { local $/; <> };

# Look for 'message:', then capture the following " and
# the minimal amount of test until you get the next ". Also
# check for a following comma - to be safe.
while ($input =~ /message:(".*?"),/) {
  say $1;
}

This will work unless your messages have embedded double-quote marks (which will presumably be escaped as \\" ). If that's the case, you'll need something more complex. 除非您的消息中嵌入了双引号(可能会被转义为\\" ),否则此方法将起作用。如果是这样,您将需要更复杂的东西。

The problem is that there are no newlines in the data so your .* matches the whole of the rest of the file. 问题在于数据中没有换行符,因此您的.*与文件的其余部分匹配。 Try /message:"([^"]*)/ which matches only characters that aren't double quotes 尝试/message:"([^"]*)/仅匹配不是双引号的字符

I wrote this 我写了这个

use strict;
use warnings;
use 5.010;

my $data = do {
    local $/;
    <DATA>;
};

say "$1: $2" while $data =~ /[{,](\w+):"([^"]*)/g;

__DATA__
{message:"What this does is, every time the mouse moves in the canvas area, it sets mouseX and mouseY to the location of the mouse.",},{message:"Then, when each ball is updated, it figures out how far away from the mouse it is, and accelerates toward it.",},{message:"The acceleration is the square root of the distance, so it pulls harder when it is really far away. Imagine all the balls being connected to the mouse by little rubber bands or springs. It's a little like that.",},{message:"Try making the balls smaller! And add more of them! I like it with about 40 small balls chasing the mouse.",},{message:"Great job! Like what you learned? Was it fun?",code:"",hiddenCode:"var c = document.getElementById('pane').getContext('2d');\nfunction rgba(r,g,b,a) {return 'rgba('+[r,g,b,a].join(',')+')';}\nfunction rgb(r,g,b,a) {return 'rgb('+[r,g,b].join(',')+')';}\n\n",lessonSection:"The End",},{message:"Wow, you did everything! Congratulations, nice work! A lot of these are really hard. I'm impressed you finished! I hope you enjoyed it!",code:'var pane = document.getElementById(\'pane\');\nvar s = 3;\n\npane.onmousemove = function(evt) {\n c.fillStyle = randomRGBA();\n var x = evt.clientX;\n var y = evt.clientY;\n c.fillRect(x - s / 2, y - s / 2, s, s);};\n\nfunction randomRGBA() {\n var r = randInt(255);\n var g = randInt(255);\n var b = randInt(255);\n var a = Math.random();\n var rgba = [r,g,b,a].join(",");\n return "rgba(" + rgba + ")";\n}\nfunction randInt(limit) {\n var x =

which produced this output 产生了这个输出

message: What this does is, every time the mouse moves in the canvas area, it sets mouseX and mouseY to the location of the mouse.
message: Then, when each ball is updated, it figures out how far away from the mouse it is, and accelerates toward it.
message: The acceleration is the square root of the distance, so it pulls harder when it is really far away. Imagine all the balls being connected to the mouse by little rubber bands or springs. It's a little like that.
message: Try making the balls smaller! And add more of them! I like it with about 40 small balls chasing the mouse.
message: Great job! Like what you learned? Was it fun?
code: 
hiddenCode: var c = document.getElementById('pane').getContext('2d');\nfunction rgba(r,g,b,a) {return 'rgba('+[r,g,b,a].join(',')+')';}\nfunction rgb(r,g,b,a) {return 'rgb('+[r,g,b].join(',')+')';}\n\n
lessonSection: The End
message: Wow, you did everything! Congratulations, nice work! A lot of these are really hard. I'm impressed you finished! I hope you enjoyed it!

No doubt the syntax, whatever it is, allows for embedding double quotes within each string, but there is no example of it in this fragment 毫无疑问,无论语法如何,都允许在每个字符串中嵌入双引号,但是此片段中没有任何示例

I do not know why you need to do this with the minified and concatenated source code, but, you can reverse that: 我不知道您为什么需要使用缩小和串联的源代码来执行此操作,但是,您可以将其反转:

#!/usr/bin/env perl

use strict;
use warnings;

use Path::Class;
use JavaScript::Beautifier qw/js_beautify/;

my $js = file('combined.min.js')->slurp('<:encoding(UTF-8)');

my $pretty_js = js_beautify($js);

my @messages = ($pretty_js =~ /message: (.+?)\n/g);

print "$_\n" for @messages;

You already have some perl answers, but you may also be interested in the xgettext tool which is designed specifically to extract strings for internationalisation . 您已经有了一些perl答案,但是您可能还对xgettext工具感兴趣,该工具专门用于提取字符串以进行国际化 Run it like this: 像这样运行它:

xgettext -a --from-code UTF-8 combined.min.js -o - 

It gives you output on each string like this: 它为您提供每个字符串的输出,如下所示:

#: combined.min.js:36
msgid ""
"Here is a ball that sticks to the mouse.  Every time the mouse moves, the "
"ball redraws on top of the mouse."
msgstr ""

It is in the gnu gettext package. 它在gnu gettext包中。 Look at gnu gettext 看看gnu gettext

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM