简体   繁体   English

使用lxml解析Yelp-忽略html标签

[英]Parsing Yelp using lxml - ignore html tag

I am trying to run the below code bit to extract Yelp review 我正在尝试运行以下代码以提取Yelp审查

from lxml import html  
import requests  
import csv  
page = requests.get('http://www.yelp.com/biz/guisados-los-angeles')

review = tree.xpath('//p[@itemprop="description"]/text()')

Now, I have a review as below 现在,我有如下评论

These tacos are the business.

We ventured into an unpretentious, relatively small restaurant who offered a photographic menu (VERY helpful) of the different tacos they have.

The above review single review is being split as the list below 上述评论的单个评论被拆分为以下列表

[
    'These tacos are the business.', 
    'We ventured into an unpretentious, relatively small restaurant who offered a photographic menu (VERY helpful) of the different tacos they have.
]

How do I get lxml text() to ignore the <br> in the comment? 如何获得lxml text()忽略注释中的<br> Any pointers, please? 有指针吗?

As far as I understand, you want each review text as a single string. 据我了解,您希望每个评论文本都作为一个字符串。

Iterate over p elements with itemprop="description" and get the .text_content() : 使用itemprop="description"遍历p元素,并获得.text_content()

for review in tree.xpath('//p[@itemprop="description"]'):
    print review.text_content()  # alternatively: ' '.join(review.xpath('text()'))
    print "----"

Prints: 打印:

These tacos are the business.We ventured into an unpretentious, relatively small restaurant who offered a photographic menu (VERY helpful) of the different tacos they have.  The friendliest of cashiers and servers greeted us.  My group and I each got the sampler with additional pescado and camarones tacos, and a quesadilla.  We pretty much ordered the whole menu and they were patient as we picked out the individual tacos for our samplers.  To echo what others have said, the corn tortillas are BOMB.com, but the braised meats also hold their own.  The whole experience is one amazing party in your mouth.  Their horchata is also a must-order.The habanero salsa that they have (I forget what they call it) really is a thing of beauty.  The spice kicks you on the tongue well after the salsa has slid down your throat.  If you're a chili eater like me, you need to get an extra side of this!  It won't let you down.  Also, they serve Stumptown coffee!  A+
----
We've been meaning to pay this place a visit for over a year. Now I wonder why we waited so long? This isn't your "traditional" taqueria, so if you're craving street taco style food, go to Tacos Gavilan. If you want a trip down memory lane with the bursting flavors and spices of old-school guisos in the form of tacos, then this is the spot. My partner and I each had a sampler, each trying a different variation of the plate.All of the options are delicious, the only one that left me feeling just a bit disappointed was the tacos de calabasitas, it was ok, but it just felt a bit bland. If you like spicy, try the tinga. The cochinita pibil also comes in a very spicy sauce, but for the sampler they use a very mild sauce. My personal favorite was the Hongos con Cilantro. Yes, it's meatless and I'm addicted. Pure perfection, bursting with flavor. To drink, we ordered an Horchata and  an Armando Palmero. The latter is their version of an Arnold Palmer, mixing Jamaica with Lemonade, it was good and refreshing, best as a summer drink. However, I'm hooked on their horchata, this is the real deal, none of that nasty powdered fake horchata. This horchata is sweetened just right and you can taste and feel the graininess of the toasted rice, pure oldschool deliciousness. Best Horchata in LA! Overall, we plan on returning here and recommend you try it at least once.
----
...

Note that there are no spaces and newlines preserved in the review text. 请注意,审阅文本中没有保留空格和换行符。 This is something you can fix (if needed), see: 这是您可以解决的问题(如果需要),请参阅:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM