解析結構化和非結構化文本的混合

Question

我需要解析文本塊，其格式如下：

Today the weather is excellent bla bla bla.
<temperature>35</temperature>. 
I'm in a great mood today. 
<item>Desk</item>

我想解析像這樣的文本，並將其轉換為類似這樣的數組：

$array[0]['text'] = 'Today the weather is excellent bla bla bla. ';
$array[0]['type'] = 'normalText';

$array[1]['text'] = '35';
$array[1]['type'] = 'temperature';

$array[2]['text'] = ". I'm in a great mood today.";
$array[2]['type'] = 'normalText';

$array[3]['text'] = 'Desk';
$array[3]['type'] = 'item';

基本上，我希望數組包含與原始文本中相同順序的所有文本，但分成類型：普通文本（意味着不在任何標簽之間的東西），以及其他類型，如溫度，項目，由文本之間的標簽確定。

有沒有辦法做到這一點（即將文本分成普通文本和其他類型，使用正則表達式）或者我應該在幕后將文本轉換為結構合理的文本，如：

<normal>Today the weather is excellent bla bla bla.</normal>
<temperature>35</temperature>.
<normal> I'm in a great mood today.</normal><item>Desk</item>

在它試圖解析文本之前？

Answer 1

編輯：現在它完全按預期工作！

解：

<?php

$code = <<<'CODE'
Today the weather is excellent bla bla bla.
<temperature>35</temperature>. 
I'm in a great mood today. 
<item>Desk</item>
CODE;

$result = array_filter(
    array_map(
        function ($element) {
            if (!empty($element)) {
                if (preg_match('/^\<([^\>]+)\>([^\<]+)\</', $element, $matches)) {
                    return array('text' => $matches[2],
                                 'type'    => $matches[1]);
                } else {
                    return array('text' => $element,
                                 'type'    => 'normal');
                }
            }
            return false;
        },
        preg_split('/(\<[^\>]+\>[^\<]+\<\/[^\>]+\>)/', $code, null, PREG_SPLIT_DELIM_CAPTURE)
    )
);

print_r($result);

輸出：

Array
(
    [0] => Array
        (
            [text] => Today the weather is excellent bla bla bla.

            [type] => normal
        )

    [1] => Array
        (
            [text] => 35
            [type] => temperature
        )

    [2] => Array
        (
            [text] => . 
I'm in a great mood today. 

            [type] => normal
        )

    [3] => Array
        (
            [text] => Desk
            [type] => item
        )

)

Answer 2

嘗試逐行閱讀文本。 你有2個案例。 添加普通文本並添加具有特殊標記的文本。 將常規文本添加到變量時，請查找帶有regexp的標記。

preg_match("/\<(\w)\>/", $line_from_text, $matches)

匹配標記，（）保存單詞以與$ matches中的數組一起使用。 現在只需將文本添加到變量中，直到遇到結束標記。 希望這可以幫助。

解析結構化和非結構化文本的混合

問題描述

2 個解決方案

解決方案1
3 已采納 2012-10-19 07:17:02

解決方案2
1 2012-10-19 05:36:13

解析結構化和非結構化文本的混合

問題描述

2 個解決方案

解決方案1 3 已采納 2012-10-19 07:17:02

解決方案2 1 2012-10-19 05:36:13

解決方案1
3 已采納 2012-10-19 07:17:02

解決方案2
1 2012-10-19 05:36:13