I want to extract email addresses from few different websites. If they are in active link format, I can do this using
//A[starts-with(@href, 'mailto:')]
But some of them are in just text format example@domain.com
, not a link, so I would like to select a path to element that contains @
inside
I would like to select a path to element that contains @ inside
Use :
//*[contains(., '@')]
It seems to me that what you actually wanted is to select elements that have a text-node child that contains "@". If this is so, use:
//*[contains(text(), '@')]
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
"//*[contains(text(), '@')] "/>
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the following XML document:
<html>
<body>
<a href="xxx.com">xxx.com</a>
<span>someone@xxx.com</span>
</body>
</html>
the XPath expression is evaluated and the selected nodes are copied to the output :
<span>someone@xxx.com</span>
You'll probably want to use a regular expression . They'll allow you to extract the email addresses, regardless of their context within a document. Here is a little test-driven example to get you started:
require "minitest/spec"
require "minitest/autorun"
module Extractor
EMAIL_REGEX = /[\w]+@[\w]+\.[\w]+/
def self.emails(document)
(matches = document.scan(EMAIL_REGEX)).any? ? matches : false
end
end
describe "Extractor" do
it 'should extract an email address from plaintext' do
emails = Extractor.emails("email@example.com")
emails.must_include "email@example.com"
end
it 'should extract multiple email addresses from plaintext' do
emails = Extractor.emails("email@example.com and email2@example2.com")
emails.must_include "email@example.com", "email2@example2.com"
end
it 'should extract an email address from the href attribute of an anchor' do
emails = Extractor.emails("<a href='mailto:email3@example3.com'>Email!</a>")
emails.must_include "email3@example3.com"
end
it 'should extract multiple email addresses from both plaintext and within HTML' do
emails = Extractor.emails("my@email.com OR <a href='mailto:email4@example4.com'>Email!</a>")
emails.must_include "email4@example4.com", "my@email.com"
end
it 'should not extract an email address if there isn\'t one' do
emails = Extractor.emails("email(at)address(dot)com")
emails.must_equal false
end
it "should extract email addresses" do
emails = Extractor.emails("email.address@domain.co.uk")
emails.must_include "email.address@domain.co.uk"
end
end
The last test fails because the regular expression doesn't anticipate the majority of valid email addresses. See if you use this as a starting point to come up with or find a better regular expression. To help build your regular expressions, check out Rubular .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.