[英]Why does my markov chain produce identical sentences from corpus?

我在 python 中使用markovify 馬爾可夫鏈生成器,當使用那里給出的示例代碼時,它會為我產生很多重復的句子,我不知道為什么。


import markovify

# Get raw text as string.
with open("testtekst.txt") as f:
    text = f.read()

# Build the model.
text_model = markovify.Text(text)

# Print five randomly-generated sentences
for i in range(20):

這給了我 output 的:

Time included him on their list of the world's highest-paid athlete by ESPN from 2016 to 2019.
He assumed full captaincy of the world's most marketable and famous athletes, Ronaldo was named the best Portuguese player of all time by the Portuguese Football Federation.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
He is the first footballer and the FIFA Club World Cup at age 23, he won his first season.
One of the tournament.
He also led them to victory in the world in 2014.
In 2015, Ronaldo was ranked the world's most famous athlete by ESPN from 2016 to 2019.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
In 2015, Ronaldo was ranked the world's most famous athlete by ESPN from 2016 to 2019.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
He is the first footballer and the FIFA Club World Cup at age 23, he won his first international goal at Euro 2004, where he helped Portugal reach the final.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
One of the world's highest-paid athlete by ESPN from 2016 to 2019.
Time included him on their list of the national team in July 2008.
He is the first footballer and the FIFA Club World Cup at age 23, he won his first season.
Time included him on their list of the national team in July 2008.
He also led them to victory in the world in 2014.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
One of the world's most marketable and famous athletes, Ronaldo was ranked the world's most famous athlete by Forbes in 2016 and 2017 and the FIFA Club World Cup at age 23, he won his first international goal at Euro 2016, and received the Silver Boot as top scorer of Euro 2020.

testtekst.txt 采用 ANSI 編碼,具有以下語料庫:

羅納爾多在馬德拉出生和長大,他的高級俱樂部生涯開始於葡萄牙體育,之后於 2003 年與曼聯簽約,年僅 18 歲,在他的第一個賽季就贏得了足總杯冠軍。 他還將go在23歲時連續贏得三個英超聯賽冠軍,冠軍聯賽和FIFA俱樂部世界杯,他贏得了他的第一個金球獎。 羅納爾多在 2009 年以 9400 萬歐元(8000 萬英鎊)的價格加盟皇家馬德里時,成為當時最昂貴的足協轉會對象,在那里他贏得了 15 個獎杯,包括兩個西甲冠軍、兩個國王杯冠軍和四次冠軍聯賽,並成為俱樂部歷史上的最佳射手。 他還三次獲得金球獎亞軍,僅次於萊昂內爾·梅西(他認為的職業對手),並在 2013 年和 2014 年以及 2016 年和 2017 年再次獲得金球獎。2018 年,他以最初價值 1 億歐元(8800 萬英鎊)的轉會簽約尤文圖斯,這是意大利俱樂部最昂貴的轉會,也是 30 歲以上球員最昂貴的轉會。 在 2021 年重返曼聯之前,他贏得了兩個意甲冠軍、兩個意大利超級杯和一個意大利杯冠軍。2003 年,年僅 18 歲的羅納爾多在葡萄牙完成了他的國家隊處子秀,此后出場超過 180 次,使他成為葡萄牙最出色的球員。 - 封頂的球員。 他在國際水平上打進了 100 多個進球,也是該國歷史上的最佳射手。 他參加了 11 場主要賽事並取得進球,他在 2004 年歐洲杯上打進了他的第一個國際進球,並幫助葡萄牙隊打進了決賽。 2008年7月,他擔任國家隊的正式隊長。2015年,C羅被葡萄牙足協評為有史以來最好的葡萄牙球員。 次年,他帶領葡萄牙隊在 2016 年歐洲杯上獲得了他們的第一個大滿貫賽事冠軍頭銜,並獲得了該賽事第二高射手的銀靴獎。 他還帶領他們在 2019 年首屆歐洲國家聯賽中奪冠,隨后獲得了 2020 年歐洲杯最佳射手金靴獎。作為世界上最有市場和最著名的運動員之一,羅納爾多被福布斯評為世界收入最高的運動員。 2016 年和 2017 年,以及 2016 年至 2019 年被 ESPN 評為世界上最著名的運動員。《時代》雜志將他列入了 2014 年全球最具影響力的 100 人名單。他是第一位足球運動員,也是第三位在其職業生涯中收入 10 億美元的運動員。職業。

正如您在 output 中看到的那樣 - 打印了幾個相同的句子,我不知道為什么。 默認 state 大小應為 2。

答案是我的 state 尺寸太大 - 將其設置為 1 后會產生獨特的句子。 我也不知道 Markovify 總是從語料庫中句子的第一個單詞開始生成新句子。

沒錯,Markovify 總是從語料庫中句子的第一個詞開始生成新句子。 答案是您的 state 尺寸太大。 你自己回答了。 但是,你得到了一個很好的結果。 我已經仔細閱讀了關於羅納爾多的文字。 做得好


