简体   繁体   中英

Why does my markov chain produce identical sentences from corpus?

I am using markovify markov chain generator in python and when using the example code given there it produces a lot of duplicate sentences for me and I don't know why.

The code is as follows:

import markovify

# Get raw text as string.
with open("testtekst.txt") as f:
    text = f.read()

# Build the model.
text_model = markovify.Text(text)

# Print five randomly-generated sentences
for i in range(20):
    print(text_model.make_sentence())

This gives me output of:

Time included him on their list of the world's highest-paid athlete by ESPN from 2016 to 2019.
He assumed full captaincy of the world's most marketable and famous athletes, Ronaldo was named the best Portuguese player of all time by the Portuguese Football Federation.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
He is the first footballer and the FIFA Club World Cup at age 23, he won his first season.
One of the tournament.
He also led them to victory in the world in 2014.
In 2015, Ronaldo was ranked the world's most famous athlete by ESPN from 2016 to 2019.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
In 2015, Ronaldo was ranked the world's most famous athlete by ESPN from 2016 to 2019.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
He is the first footballer and the FIFA Club World Cup at age 23, he won his first international goal at Euro 2004, where he helped Portugal reach the final.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
One of the world's highest-paid athlete by ESPN from 2016 to 2019.
Time included him on their list of the national team in July 2008.
He is the first footballer and the FIFA Club World Cup at age 23, he won his first season.
Time included him on their list of the national team in July 2008.
He also led them to victory in the world in 2014.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
One of the world's most marketable and famous athletes, Ronaldo was ranked the world's most famous athlete by Forbes in 2016 and 2017 and the FIFA Club World Cup at age 23, he won his first international goal at Euro 2016, and received the Silver Boot as top scorer of Euro 2020.

The testtekst.txt is in ANSI encoding and has the following corpus:

Born and raised in Madeira, Ronaldo began his senior club career playing for Sporting CP, before signing with Manchester United in 2003, aged 18, winning the FA Cup in his first season. He would also go onto win three consecutive Premier League titles, the Champions League and the FIFA Club World Cup at age 23, he won his first Ballon d'Or. Ronaldo was the subject of the then-most expensive association football transfer when he signed for Real Madrid in 2009 in a transfer worth €94 million (£80 million), where he won 15 trophies, including two La Liga titles, two Copa del Rey and four Champions Leagues, and became the club's all-time top goalscorer. He also finished runner-up for the Ballon d'Or three times, behind Lionel Messi (his perceived career rival), and won back-to-back Ballons d'Or in 2013 and 2014, and again in 2016 and 2017. In 2018, he signed for Juventus in a transfer worth an initial €100 million (£88 million), the most expensive transfer for an Italian club and the most expensive transfer for a player over 30 years old. He won two Serie A titles, two Supercoppe Italiana and a Coppa Italia, before returning to Manchester United in 2021. Ronaldo made his senior international debut for Portugal in 2003 at the age of 18 and has since earned over 180 caps, making him Portugal's most-capped player. With more than 100 goals at international level, he is also the nation's all-time top goalscorer. He has played in and scored at 11 major tournaments, he scored his first international goal at Euro 2004, where he helped Portugal reach the final. He assumed full captaincy of the national team in July 2008. In 2015, Ronaldo was named the best Portuguese player of all time by the Portuguese Football Federation. The following year, he led Portugal to their first major tournament title at Euro 2016, and received the Silver Boot as the second-highest goalscorer of the tournament. He also led them to victory in the inaugural UEFA Nations League in 2019, and later received the Golden Boot as top scorer of Euro 2020. One of the world's most marketable and famous athletes, Ronaldo was ranked the world's highest-paid athlete by Forbes in 2016 and 2017 and the world's most famous athlete by ESPN from 2016 to 2019. Time included him on their list of the 100 most influential people in the world in 2014. He is the first footballer and the third sportsman to earn US $1 billion in his career.

As you can see in the output - there are several identical sentences printed out and I have no idea why. The default state size should be 2.

The answer is that my state size was too big - after setting it to be 1 it produced unique sentences. I also did not know that Markovify always starts generating new sentences with the first words of the sentences in corpus.

That's right, Markovify always starts generating new sentences with the first words of sentences in the corpus. The answer is that your state size was too big. You answered yourself. However, you got a good result. I have carefully read the text on Ronaldo. Well done

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM