简体   繁体   English

无法使用 R 中的 readtext Package 中的 readtext() 替换从 PDF 文件中提取的文本中的“\r\n-”

[英]Unable to Replace “\r\n-” in Text Extracted from PDF File Using readtext() from readtext Package in R

I am trying to remove "\r\n-" in a text which I extracted from a PDF file using readtext() from readtext package in R Studio.我正在尝试使用 R Studio 中的 readtext package 中的readtext()从 PDF 文件中提取的文本中删除“\r\n-”。 Below is my code in R:以下是我在 R 中的代码:

    library(readtext)
    jd <- readtext("C:/Users/HomeUser/Documents/Sales Manager.pdf")
    jd_text <- jd$text
    jd_text2 <- gsub(pattern = "\r\n-?|•", replacement = " ", jd_text)

Below is the original extracted text jd_text :以下是提取的原始文本jd_text

"Sales Manager\r\nCFB Bots is a technology service provider specializing in Intelligent Automation (IA). We partner with\r\nlarge enterprises in their Digital Transformation journey and help them and their employees thrive\r\nin the Future of Work. Our mission is to co-create the Digital Workforce of the Future, and our vision\r\nis to make work enjoyable. For more information, please visit www.cfb-bots.com.\r\nWe are looking for a high performing frontrunner to blaze the trail and make new connections for\r\nour growing business. As a Sales Manager, you will play a vital role in keeping the Company\r\ncompetitive by achieving our customer acquisition and revenue growth targets. You will be the key\r\nliaison in every stage of the sales process, from planning to closing the sales.\r\nIf you are passionate about technology and are motivated by a hunger to solve our clients'\r\nchallenges, read on to find out more.\r\nYou can gain: \r\n− Incentive for achie "Sales Manager\r\nCFB Bots 是一家专门从事智能自动化 (IA) 的技术服务提供商。我们与\r\n大型企业合作进行数字化转型之旅,并帮助他们及其员工在工作的未来中\r\n蓬勃发展。我们的使命是共同创造未来的数字化劳动力,我们的愿景\r\nis 让工作变得愉快。有关更多信息,请访问 www.cfb-bots.com。\r\n我们正在寻找一个高性能为我们不断发展的业务开辟道路并建立新联系的领跑者。作为一名销售经理,您将通过实现我们的客户获取和收入增长目标,在保持公司的竞争力方面发挥重要作用。\r\n您将成为在销售过程的每个阶段(从计划到完成销售)中的关键\r\n联络人。\r\n如果您对技术充满热情并且渴望解决客户的\r\n挑战,请继续阅读以了解更多。\r\n您可以获得: \r\n- 成就奖励ving sales targets\r\n− Exposure to the latest industry trends and technologies\r\n− Endless learning and growth opportunities\r\n− Sharpen sales planning, analytical and management skills\r\n− Flexible work-life benefits \r\nYou will do:\r\nSales Strategy\r\n- Develop..."制定销售目标\r\n− 接触最新的行业趋势和技术\r\n− 无尽的学习和成长机会\r\n− 提高销售计划、分析和管理技能\r\n− 灵活的工作与生活福利\ r\n你会做:\r\n销售策略\r\n- 开发..."

I was able to remove many "\r\n-" in jd_text using gsub() .我能够使用gsub()删除jd_text中的许多“\r\n-”。 Output from jd_text2 below: Output 来自jd_text2下面:

"Sales Manager CFB Bots is a technology service provider specializing in Intelligent Automation (IA). We partner with large enterprises in their Digital Transformation journey and help them and their employees thrive in the Future of Work. Our mission is to co-create the Digital Workforce of the Future, and our vision is to make work enjoyable. For more information, please visit www.cfb-bots.com. We are looking for a high performing frontrunner to blaze the trail and make new connections for our growing business. As a Sales Manager, you will play a vital role in keeping the Company competitive by achieving our customer acquisition and revenue growth targets. You will be the key liaison in every stage of the sales process, from planning to closing the sales. If you are passionate about technology and are motivated by a hunger to solve our clients' challenges, read on to find out more. You can gain: − Incentive for achieving sales targets − Exposure to the “销售经理 CFB Bots 是一家专注于智能自动化 (IA) 的技术服务提供商。我们与大型企业合作开展数字化转型之旅,帮助他们及其员工在未来的工作中茁壮成长。我们的使命是共同创造数字化未来的劳动力,我们的愿景是让工作变得愉快。有关更多信息,请访问 www.cfb-bots.com。我们正在寻找一个高性能的领跑者,为我们不断发展的业务开辟道路并建立新的联系。作为一名销售经理,您将通过实现我们的客户获取和收入增长目标,在保持公司竞争力方面发挥重要作用。您将成为销售流程每个阶段的关键联络人,从计划到销售结束。如果您充满热情了解技术并渴望解决客户的挑战,请继续阅读以了解更多信息。您可以获得: − 实现销售目标的激励 − 接触latest industry trends and technologies − Endless learning and growth opportunities − Sharpen sales planning, analytical and management skills − Flexible work-life benefits You will do: Sales Strategy Develop..."最新的行业趋势和技术 − 无穷无尽的学习和成长机会 − 提高销售计划、分析和管理技能 − 灵活的工作与生活福利您将做:制定销售策略..."

As you can see, I was able to remove "\r\n-" occurring after "Flexible work-life benefits" while "-" from those first few "\r\n-" still remained.如您所见,我能够删除“灵活的工作与生活福利”之后出现的“\r\n-”,而前几个“\r\n-”中的“-”仍然存在。 However, when I pasted the original text extract directly from the display of jd_text in R Studio console into a new variable jd_test , applied gsub() again, I was able to accomplish my goal:但是,当我将直接从 R Studio 控制台中jd_text的显示中提取的原始文本粘贴到新变量jd_test中时,再次应用gsub() ,我能够实现我的目标:

jd_test <- "Sales Manager\r\nCFB Bots is a technology service provider specializing in Intelligent Automation (IA). We partner with\r\nlarge enterprises in their Digital Transformation journey and help them and their employees thrive\r\nin the Future of Work. Our mission is to co-create the Digital Workforce of the Future, and our vision\r\nis to make work enjoyable. For more information, please visit www.cfb-bots.com.\r\nWe are looking for a high performing frontrunner to blaze the trail and make new connections for\r\nour growing business. As a Sales Manager, you will play a vital role in keeping the Company\r\ncompetitive by achieving our customer acquisition and revenue growth targets. You will be the key\r\nliaison in every stage of the sales process, from planning to closing the sales.\r\nIf you are passionate about technology and are motivated by a hunger to solve our clients’\r\nchallenges, read on to find out more.\r\nYou can gain:\r\n− Incentive for achieving sales targets\r\n− Exposure to the latest industry trends and technologies\r\n− Endless learning and growth opportunities\r\n− Sharpen sales planning, analytical and management skills\r\n− Flexible work-life benefits\r\nYou will do:\r\nSales Strategy\r\n-    Develop ..."

jd_test2 <- gsub(pattern = "\r\n-?|•", replacement = " ", jd_test)

Output from jd_test2:来自 jd_test2 的 Output:

Sales Manager CFB Bots is a technology service provider specializing in Intelligent Automation (IA).销售经理 CFB Bots 是一家专注于智能自动化 (IA) 的技术服务提供商。 We partner with large enterprises in their Digital Transformation journey and help them and their employees thrive in the Future of Work.我们与大型企业合作开展数字化转型之旅,帮助他们及其员工在未来的工作中茁壮成长。 Our mission is to co-create the Digital Workforce of the Future, and our vision is to make work enjoyable.我们的使命是共同创造未来的数字化劳动力,我们的愿景是让工作变得愉快。 For more information, please visit www.cfb-bots.com.如需更多信息,请访问 www.cfb-bots.com。 We are looking for a high performing frontrunner to blaze the trail and make new connections for our growing business.我们正在寻找一位表现出色的领跑者,为我们不断发展的业务开辟道路并建立新的联系。 As a Sales Manager, you will play a vital role in keeping the Company competitive by achieving our customer acquisition and revenue growth targets.作为销售经理,您将通过实现我们的客户获取和收入增长目标,在保持公司竞争力方面发挥重要作用。 You will be the key liaison in every stage of the sales process, from planning to closing the sales.您将成为销售流程每个阶段的关键联络人,从计划到完成销售。 If you are passionate about technology and are motivated by a hunger to solve our clients' challenges, read on to find out more.如果您对技术充满热情并且渴望解决我们客户的挑战,请继续阅读以了解更多信息。 You can gain: Incentive for achieving sales targets Exposure to the latest industry trends and technologies Endless learning and growth opportunities Sharpen sales planning, analytical and management skills Flexible work-life benefits You will do: Sales Strategy Develop..."您可以获得:实现销售目标的激励措施 接触最新的行业趋势和技术 无尽的学习和成长机会 提高销售计划、分析和管理技能 灵活的工作与生活福利您将做:制定销售策略...”

Anyone has any idea what is the problem and how do I go about it?任何人都知道问题出在哪里,我该怎么做? I have tried using another function pdf_text() from pdftools package but it yielded the same frustrating result.我曾尝试使用来自 pdftools package 的另一个 function pdf_text()但它产生了同样令人沮丧的结果。 At first I thought "-" for the first few "\r\n-" is slightly longer than the latter ones but the direct copy-paste attempt seems to contradict this observation.起初我认为前几个“\r\n-”的“-”比后者稍长,但直接复制粘贴尝试似乎与这一观察相矛盾。 Is there something "hidden" in the object which is not migrated during the copy-paste action? object 中是否存在在复制粘贴操作期间未迁移的“隐藏”内容? Any suggestions is greatly appreciated!非常感谢任何建议!

I found a likely answer to my question.我找到了我的问题的可能答案。 It seems the original extracted text from the PDF document is not in an encoding that R Studio could recognise.从 PDF 文档中提取的原始文本似乎不是 R Studio 可以识别的编码。 This would explain why for the first few "-"s were not removed.这可以解释为什么前几个“-”没有被删除。 After I apply jd_text <-iconv(jd_text,"UTF-8") to coerce the encoding to UTF-8, my problem was solved, and I am able to remove "\r\n-" completely.在我应用jd_text <-iconv(jd_text,"UTF-8")将编码强制为 UTF-8 后,我的问题得到了解决,并且我能够完全删除“\r\n-”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM