[英]Parsing a mix of structured and unstructured text
I need to parse blocks of text which are in a format something like this: 我需要解析文本块,其格式如下:
Today the weather is excellent bla bla bla.
<temperature>35</temperature>.
I'm in a great mood today.
<item>Desk</item>
I want to parse text like this, and translate it into an array which resembles something like this: 我想解析像这样的文本,并将其转换为类似这样的数组:
$array[0]['text'] = 'Today the weather is excellent bla bla bla. ';
$array[0]['type'] = 'normalText';
$array[1]['text'] = '35';
$array[1]['type'] = 'temperature';
$array[2]['text'] = ". I'm in a great mood today.";
$array[2]['type'] = 'normalText';
$array[3]['text'] = 'Desk';
$array[3]['type'] = 'item';
Essentially, I want the array to contain all of the text in the same order as in the original text, but split into types: Normal text (meaning stuff which wasn't between any tags), and other types like temperature, item, which were determined by the tags the text was between. 基本上,我希望数组包含与原始文本中相同顺序的所有文本,但分成类型:普通文本(意味着不在任何标签之间的东西),以及其他类型,如温度,项目,由文本之间的标签确定。
Is there a way to do this (ie seperate the text into normal text, and other types, using regular expressions) or should I behind the scenes convert the text into properly structured text, like: 有没有办法做到这一点(即将文本分成普通文本和其他类型,使用正则表达式)或者我应该在幕后将文本转换为结构合理的文本,如:
<normal>Today the weather is excellent bla bla bla.</normal>
<temperature>35</temperature>.
<normal> I'm in a great mood today.</normal><item>Desk</item>
Before it tries to parse the text? 在它试图解析文本之前?
EDIT: Now it works exactly as expected! 编辑:现在它完全按预期工作!
Solution: 解:
<?php
$code = <<<'CODE'
Today the weather is excellent bla bla bla.
<temperature>35</temperature>.
I'm in a great mood today.
<item>Desk</item>
CODE;
$result = array_filter(
array_map(
function ($element) {
if (!empty($element)) {
if (preg_match('/^\<([^\>]+)\>([^\<]+)\</', $element, $matches)) {
return array('text' => $matches[2],
'type' => $matches[1]);
} else {
return array('text' => $element,
'type' => 'normal');
}
}
return false;
},
preg_split('/(\<[^\>]+\>[^\<]+\<\/[^\>]+\>)/', $code, null, PREG_SPLIT_DELIM_CAPTURE)
)
);
print_r($result);
Output: 输出:
Array
(
[0] => Array
(
[text] => Today the weather is excellent bla bla bla.
[type] => normal
)
[1] => Array
(
[text] => 35
[type] => temperature
)
[2] => Array
(
[text] => .
I'm in a great mood today.
[type] => normal
)
[3] => Array
(
[text] => Desk
[type] => item
)
)
Try reading through the text, line by line. 尝试逐行阅读文本。 You have 2 cases.
你有2个案例。 Adding normal text and adding text that has a special tag.
添加普通文本并添加具有特殊标记的文本。 While adding the normal text to a variable, look for a tag with regexp.
将常规文本添加到变量时,请查找带有regexp的标记。
preg_match("/\<(\w)\>/", $line_from_text, $matches)
matches the tag, the ()'s saves the word to use with your array in $matches. 匹配标记,()保存单词以与$ matches中的数组一起使用。 Now just add text to a variable until you meet the end tag.
现在只需将文本添加到变量中,直到遇到结束标记。 Hope this helps.
希望这可以帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.