Preg 將 html 語言與 lang 屬性 php 匹配

Question

我想在 PHP 中使用 preg_match 從 html 文檔中解析站點語言；

我的預賽：

$sitelang = preg_match('!<html lang="(.*?)">!i', $result, $matches) ? $matches[1] : 'Site Language not detected';

當我有一個沒有任何類或 ID 的簡單屬性時。 例如：輸入：

<html lang="de">

輸出：

de

但是當我有一個像這樣的其他 html 代碼時：輸入：

<html lang="en" class="desktop-view not-mobile-device text-size-normal anon">

輸出：

en " class=" desktop - view not - mobile - device text - size - normal anon,

我只需要語言代碼（en，de，en-En，de-DE）。

感謝您的建議或代碼。

更新**

當 lang 屬性不是作為第一個元素出現的另一個示例。

<html data-n-head-ssr lang="en">

輸出：
未檢測到站點語言

Answer 1

除了使用正則表達式解析 HTML 的標准免責聲明之外，您可能還需要兩件事。 首先，去掉模式中的右括號。 一旦你有關閉報價，該行的其余部分就無關緊要了。 其次，確保引號內的內容本身不包含引號。

當前，打開報價，然后是任何內容，然后關閉報價：

preg_match('!<html lang="(.*?)">!i', $result, $matches)

這意味着如果你有lang="foo" class="bar"你會得到foo" class="bar作為匹配項，因為正則表達式是貪婪的，並且整個字符串可以被認為是在兩組不同的最外層引號內。

新的，在引號內，一個或多個除引號外的任何內容：

preg_match('!<html lang="([^"]+)"!i', $result, $matches)

如果您想更有彈性，請將硬空格更改為一個或多個空格字符：

preg_match('!<html\s+lang="([^"]+)"!i', $result, $matches)

Answer 2

您可以使用此代碼以正確的方式進行檢測：

preg_match('!<html.*\s+lang="([^"]+)"!i', $result, $matches)

它也適用於您的最后一個樣本

Answer 3

解析任意 html 時，最好使用DOMDocument 之類的 html 解析器。

$dom = new DOMDocument();
@$dom->loadHTML($html);

$lang = $dom->getElementsByTagName('html')[0]->getAttribute('lang');

請參閱 tio.run 上的 PHP 演示（如果出現任何問題，使用@來抑制錯誤）

如果你堅持使用正則表達式，這里有一個更廣泛的模式來匹配更多的情況：

$pattern = '~<html\b[^><]+?\blang\s*=\s*["\']\s*\K[^"\']+~i';

$lang = preg_match($pattern, $html, $out) ? $out[0] : "";

\K重置報告匹配的開始，因此我們不需要捕獲.

請參閱 regex101 的正則表達式演示（右側的說明）或tio.run 的 PHP 演示

僅供參考：您的模式<html lang="(.*?)">懶惰地匹配從<html lang="到">的任何內容