简体   繁体   中英

How to replace relative path URLs with absolute path URLs

I have HTML content stored in a database and I'm looking to convert all the relative asset references to use absolute paths instead. For instance, all of my image tags are looking something like this:

<img src=\"/system/images/146/original/03.png?1362691463\">

I'm trying to prepend "http://example.com" to the "/system/images/" path. I had the following code that I was hoping to handle that but sadly it doesn't seem to result in any changes:

text = "<img src=\"/system/images/146/original/03.png?1362691463\">"
text.gsub(%r{<img src=\\('|")\/system\/images\/}, "<img src=\"http://virtualrobotgames.com/system/images/")

Instead of manipulating the URL string using normal string manipulation, use a tool made for the job. Ruby includes the URI class, and there's the more thorough Addressable gem.

Here's what I'd do if I had some HTML with links I wanted to rewrite:

First, parse the document:

require 'nokogiri'
require 'uri'

SOURCE_SITE = URI.parse("http://virtualrobotgames.com")

html = '
<html>
<head></head>
<body>
  <img src="/system/images/146/original/03.png?1362691463">
  <script src="/scripts/foo.js"></script>
  <a href="/foo/bar.html">foo</a>
</body>
</html>
'
doc = Nokogiri::HTML(html)

Then you're in a position to walk through the document and modify tags like <a> , <img> , <script> and anything else that you want:

# find things using 'src' and 'href' parameters
tags = {
  'img'    => 'src',
  'script' => 'src',
  'a'      => 'href'
}
doc.search(tags.keys.join(',')).each do |node|

  url_param = tags[node.name]

  src = node[url_param]
  unless (src.empty?)
    uri = URI.parse(src)
    unless uri.host
      uri.scheme = SOURCE_SITE.scheme
      uri.host = SOURCE_SITE.host
      node[url_param] = uri.to_s
    end
  end
end

puts doc.to_html

Which, after running, outputs:

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head>
# >> <body>
# >>   <img src="http://virtualrobotgames.com/system/images/146/original/03.png?1362691463">
# >>   <script src="http://virtualrobotgames.com/scripts/foo.js"></script>
# >>   <a href="http://virtualrobotgames.com/foo/bar.html">foo</a>
# >> </body>
# >> </html>

This isn't meant to be a complete, fully-working, example. This is working with absolute links, but you'll have to deal with relative links, links with sibling/peer hostnames, missing parameters.

You'll also want to check the errors method for your "doc" after parsing to make sure it is valid HTML. A parser can rewrite/trim nodes in invalid HTML trying to make sense of it.

Can't you just use the 'base' HTML tag to do this?. Assuming you read the HTML content directly from a URL, you could do something like:

response = RestClient.get(<original_url>)
base_url = '<your_base_url>'
html_content = response.body
if html_content.index('<head>')
    html_content = html_content.gsub!('<head>', "<head><base href='#{base_url}'>")
end

Apparently it was an issue with the search argument I was passing, the escape sequences weren't required.

%r{<img src=\\('|")\/system\/images\/}

Becomes simply:

%r{<img src="/system/images/}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM