简体   繁体   English

使用python代码根据单词拆分文本

[英]To Split text based on words using python code

I have a long text like the one below.我有一个像下面这样的长文本。 I need to split based on some words say ("In","On","These")我需要根据一些词来拆分(“在”,“在”,“这些”)

Below is sample data:下面是示例数据:

On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue;另一方面,我们义愤填膺地谴责那些被一时享乐的魅力所迷惑和意志消沉,被欲望蒙蔽了双眼,无法预见必然会发生的痛苦和麻烦的人; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain.因意志软弱而未能履行职责的人,也应受到同等的责备,这与逃避劳苦和痛苦是一样的。 These cases are perfectly simple and easy to distinguish.这些情况非常简单且易于区分。 In a free hour, when our power of choice is untrammelled and when nothing prevents our being able to do what we like best, every pleasure is to be welcomed and every pain avoided.在空闲的时间里,当我们的选择权不受限制,当没有什么能阻止我们做自己最喜欢的事情时,每一种快乐都应该受到欢迎,每一种痛苦都应该避免。 But in certain circumstances and owing to the claims of duty or the obligations of business it will frequently occur that pleasures have to be repudiated and annoyances accepted.但是在某些情况下,由于义务或商业义务的要求,经常会发生必须拒绝享乐和接受烦恼的情况。 The wise man therefore always holds in these matters to this principle of selection: he rejects pleasures to secure other greater pleasures, or else he endures pains to avoid worse pains.因此,智者在这些事情上总是坚持选择的原则:他拒绝快乐以获得其他更大的快乐,或者忍受痛苦以避免更严重的痛苦。

Can this problem be solved with a code as I have 1000 rows in a csv file.这个问题可以用代码解决吗,因为我在 csv 文件中有 1000 行。

根据我的评论,我认为一个不错的选择是将正则表达式与模式一起使用:

 re.split(r'(?<!^)\b(?=(?:On|In|These)\b)', YourStringVariable)

Yes this can be done in python.是的,这可以在 python 中完成。 You can load the text into a variable and use the built in Split function for string.您可以将文本加载到变量中,并使用内置的 Split 函数来处理字符串。 For example:例如:

with open(filename, 'r') as file:
    lines = file.read()
    lines = lines.split('These')
    # lines is now a list of strings split whenever 'These' string was encountered

To find whole words that are not part of larger words, I like using the regular expression: [^\\w]word[^\\w]要查找不属于较大单词的整个单词,我喜欢使用正则表达式: [^\\w]word[^\\w]

Sample python code, assuming the text is in a variable named text :示例 Python 代码,假设文本位于名为text的变量中:

import re
exp = re.compile(r'[^\w]in[^\w]', flags=re.IGNORECASE)
all_occurrences = list(exp.finditer(text))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM