如何创建以文本文件中找到的特定文件类型结尾的所有字符串出现列表？

Question

I'm trying to extract all of the links to image files from a text file. 我正在尝试从文本文件中提取图像文件的所有链接。 All of the image files end in either .jpg or .gif, and are surrounded by quotation marks. 所有图像文件都以.jpg或.gif结尾，并用引号引起来。 I want to find the first occurrence of .jpg or .gif, and then copy all of the characters between the first quotation mark located before .jpg (or .gif) and the first quotation mark found after .jpg (or.gif). 我想找到第一个出现的.jpg或.gif，然后在位于.jpg（或.gif）之前的第一个引号和位于.jpg（或.gif）之后的第一个引号之间复制所有字符。 Then I want to add this link to an array or to another text file, and repeat the process for every instance of .jpg or .gif in the original text file. 然后，我想将此链接添加到数组或另一个文本文件，并为原始文本文件中的.jpg或.gif的每个实例重复此过程。

Here's an example of what the text file might look like: 这是文本文件可能看起来的示例：

d/scriript type="texft/javascript">
    $(document).fready(function () {
        $('#post-contfainer-1720130 .post-assets .thumb A').lightBox({
            txtImafge:      'Image',
            txtOf:          'of',
            overflayOpacity:    0       });
<div class="thumb"><a href""="#">="**https://imaginepilgrimages.com/asset/image/resize/2/32/32/1/c331065jt99875146b0a1fg9140.jpg**"riript type="texft/javascript">
    $(document).freadriript type="texft/javascript">
    $(document).fread
d/scriript type="texft/javascript">
    $(document).fready(function () {
        $('#post-contfainer-1720130 .post-assets .thumb A').lightBox({
            txtImafge:      'Image',
            txtOf:          'of',
            overflayOpacity:    0       });
<div class="thumb"><a href""="#">="**https://imaginepilgrimages.com/asset/image/resize/2/32/32/75146b0a1fg9140.gif**"riript type="texft/javascript">
    $(document).freadriript type="texft/javascript">
    $(document).fread
d/scriript type="texft/javascript">
    $(document).fready(function () {
        $('#post-contfainer-1720130 .post-assets .thumb A').lightBox({
            txtImafge:      'Image',
            txtOf:          'of',
            overflayOpacity:    0       });
<div class="thumb"><a href""="#">="https://imaginepilgrimages.com/asset/image/resize/2/32/32/1/c331065jt99fgfgage55h6u7rrth6875146b0a1fg9140.jpg"riript type="texft/javascript">
    $(document).freadriript type="texft/javascript">
    $(document).fread

I've just started using python and I've been stuck on this for a while. 我刚刚开始使用python，并且在此问题上停留了一段时间。 Can anybody help me with this? 有人可以帮我吗？ Thanks in advance for your time! 在此先感谢您的时间！

Answer 1

Something like the following should work: 类似于以下内容的东西应该起作用：

re.findall('"([^"]*\.(?:gif|jpg)[^"]*)"', text)

Don't expect it to be particularly flexible or robust; 不要指望它特别灵活或强大。 for that you'd probably want an actual parser. 为此，您可能需要一个实际的解析器。

Answer 2

This will give you the image filenames, except that it doesn't attempt to trim off the leading/trailing '**' 这将为您提供图像文件名，但它不会尝试修剪前导/后缀“ **”

import re
images=[]
with open('test.dat') as f:
   for line in f:
      images.extend(re.findall(r'"([^"]*\.(?:jpg|gif)[^"]*)"',line))

The regular expression looks for a quotation mark and then grabs anything that isn't a quotation mark specifically checking to make sure that '.jpg' or '.gif' are in the string. 正则表达式将查找引号，然后抓取所有非引号的内容，并特别检查以确保字符串中包含“ .jpg”或“ .gif”。

如何创建以文本文件中找到的特定文件类型结尾的所有字符串出现列表？

问题描述

2 个解决方案

解决方案1
2 已采纳 2012-06-13 16:12:12

解决方案2
2 2012-06-13 16:12:44

如何创建以文本文件中找到的特定文件类型结尾的所有字符串出现列表？

问题描述

2 个解决方案

解决方案1 2 已采纳 2012-06-13 16:12:12

解决方案2 2 2012-06-13 16:12:44

解决方案1
2 已采纳 2012-06-13 16:12:12

解决方案2
2 2012-06-13 16:12:44