简体   繁体   中英

split a text file into words based on words and special characters in file

import os
import re
import shutil
import tempfile
import csv
from StringIO import StringIO
import sqlite3


### SQL lite




file="H:/query.txt"
file = open(file, 'r')
text = file.read().lower()
file.close()
text = re.split('\W+',text)
print text

I am using above script to split a file into a list containing all the words. But I want special characters ( .,#,_) to be inlcuding in the list.

I mean if the word is p.player I want to make sure the word is split as p.player not as p and player.same for # and _

What changes should i make in this script.

Thanks in advance

re.split('[\x7b-\x7f \x20-\x22 \x24-\x40]',<string_here>)

Basically, I took the ranges of everything outside of the upper/lower case character range and also excluded the '#' range. \\x allows you match a specific ascii/unicode character by using its corresponding hex number

Edit: I just realized that there was more than just a "#" in your included range. You could also go the other way around and use an excluded range instead, if you have too many special characters you want to include. It would look something like this:

re.split('[^\w,_#]',<string_here>)

Which turns out to be a lot cleaner in this case

Just create a regex that matches exactly what you're looking for and use the findall command.

This expression will match all words 1 character or longer that might have a . , # , or a _ inside the word.

[a-z](?:[a-z.#_]*[a-z])?

正则表达式可视化

Sample Python Script

import re
regex = ur"[a-z](?:[a-z.#_]*[a-z])?"
line = "word is p.player I want to make sure the word is split as p.player not as p and player."
words = re.findall(regex, line, re.IGNORECASE)
print(words)

Sample Output

['word', 'is', 'p.player', 'I', 'want', 'to', 'make', 'sure', 'the', 'word', 'is', 'split', 'as', 'p.player', 'not', 'as', 'p', 'and', 'player']

Live Demos

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM