简体   繁体   English

如何从包含sql server中的html内容的字段中提取文件名?

[英]How to extract file names from a field that contains html content in sql server?

We have a cms system that write html content blocks into sql server database. 我们有一个cms系统,可以将html内容块写入sql server数据库。 I know the table name and field name where these html content blocks reside. 我知道这些html内容块所在的表名和字段名。 Some html contains links () to pdf files. 一些html包含链接()到pdf文件。 Here is a fragment: 这是一个片段:

<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>

I need to extract pdf file names from all such html content blocks. 我需要从所有这样的html内容块中提取pdf文件名。 At the end I need to get a list: 最后我需要得到一个清单:

Tuition-Reimbursement-Deferred.pdf
Some-other-file.pdf

of all pdf file names from that field. 来自该字段的所有pdf文件名。

Any help is appreciated. 任何帮助表示赞赏。 Thanks. 谢谢。

UPDATE UPDATE

I have received many replies, thank you so much, but I forgot to mention that we are still using SQL Server 2000 here. 我收到了很多回复,非常感谢你,但我忘了提到我们仍在使用SQL Server 2000。 So, this had to be done using SQL 2000 SQL. 所以,这必须使用SQL 2000 SQL来完成。

Create this function : 创建此功能

create function dbo.extract_filenames_from_a_tags (@s nvarchar(max))
returns @res table (pdf nvarchar(max)) as
begin
-- assumes there are no single quotes or double quotes in the PDF filename
declare @i int, @j int, @k int, @tmp nvarchar(max);
set @i = charindex(N'.pdf', @s);
while @i > 0
begin
  select @tmp = left(@s, @i+3);
  select @j = charindex('/', reverse(@tmp)); -- directory delimiter
  select @k = charindex('"', reverse(@tmp)); -- start of href
  if @j = 0 or (@k > 0 and @k < @j) set @j = @k;
  select @k = charindex('''', reverse(@tmp)); -- start of href (single-quote*)
  if @j = 0 or (@k > 0 and @k < @j) set @j = @k;
  insert @res values (substring(@tmp, len(@tmp)-@j+2, len(@tmp)));
  select @s = stuff(@s, 1, @i+4, ''); -- remove up to ".pdf"
  set @i = charindex(N'.pdf', @s);
end
return
end
GO

A demo on using that function : 关于使用该功能的演示

declare @t table (html varchar(max));
insert @t values
  ('
<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>'),
  ('
<p>A deferred tuition payment plan, 
or view the <a href="Two files here-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>And I use single quotes
   <a href=''/look/path/The second file.pdf''
target="_blank">list</a>');

select t.*, p.pdf
from @t t
cross apply dbo.extract_filenames_from_a_tags(html) p;

Results : 结果

|HTML                  |                                       PDF |
--------------------------------------------------------------------
|<p>A deferred tui.... |        Tuition-Reimbursement-Deferred.pdf |
|<p>A deferred tui.... | Two files here-Reimbursement-Deferred.pdf |
|<p>A deferred tui.... |                       The second file.pdf |

SQL Fiddle Demo SQL小提琴演示

Well it's not pretty but this works using standard Transact-SQL: 嗯它不漂亮,但这可以使用标准的Transact-SQL:

SELECT CASE WHEN CHARINDEX('.pdf', html) > 0
            THEN SUBSTRING(
                     html,
                     CHARINDEX('.pdf', html) -
                     PATINDEX(
                         '%["/]%',
                         REVERSE(SUBSTRING(html, 0, CHARINDEX('.pdf', html)))) + 1,
                     PATINDEX(
                         '%["/]%',
                         REVERSE(SUBSTRING(html, 0, CHARINDEX('.pdf', html)))) + 3)
            ELSE NULL
       END AS filename
FROM mytable

Could expand the list of delimiting characters before the filename from ["/] (which matches either a quotation mark or slash) if you like. 如果您愿意, 可以["/] (与引号或斜杠匹配)的文件名之前展开分隔符字符列表。

See SQL Fiddle demo 请参阅SQL Fiddle演示

How about treating that HTML as XML? 如何将HTML视为XML?

declare @t table (html varchar(max));
insert @t 
    select '
    <p>A deferred tuition payment plan, 
    or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
    target="_blank">list</a>.</p>'
    union all
    select '
    <p>A deferred tuition payment plan, 
    or view the <a href="Two files here-Reimbursement-Deferred.pdf"
    target="_blank">list</a>.</p>And I use single quotes
       <a href=''/look/path/The second file.pdf''
    target="_blank">list</a>'

select  [filename] = reverse(left(reverse('/'+p.n.value('@href', 'varchar(100)')), charindex('/',reverse('/'+p.n.value('@href', 'varchar(100)')), 1) - 1))
from    (   select  cast(html as xml)
            from    @t
        ) x(doc)
cross
apply doc.nodes('//a') p(n);

Results: 结果:

filename
---------------------------------------------------------------
Tuition-Reimbursement-Deferred.pdf
Two files here-Reimbursement-Deferred.pdf
The second file.pdf

Try this one - 试试这个 -

DECLARE @XML XML = 
'<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>'

SELECT 
      ref_text = t.p.value('./a[1]', 'NVARCHAR(50)')
    , ref_filename = REVERSE(
                        LEFT(REVERSE(t.p.value('./a[1]/@href', 'NVARCHAR(50)')), 
                        CHARINDEX('/',REVERSE(t.p.value('./a[1]/@href', 'NVARCHAR(50)')), 1) - 1))
FROM @XML.nodes('/p') t(p)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM