import os
import re
import shutil
import tempfile
import csv
from StringIO import StringIO
import sqlite3
### SQL lite
file="H:/query.txt"
file = open(file, 'r')
text = file.read().lower()
file.close()
text = re.split('\W+',text)
print text
I am using above script to split a file into a list containing all the words. But I want special characters ( .,#,_) to be inlcuding in the list.
I mean if the word is p.player I want to make sure the word is split as p.player not as p and player.same for # and _
What changes should i make in this script.
Thanks in advance
re.split('[\x7b-\x7f \x20-\x22 \x24-\x40]',<string_here>)
Basically, I took the ranges of everything outside of the upper/lower case character range and also excluded the '#' range. \\x
allows you match a specific ascii/unicode character by using its corresponding hex number
Edit: I just realized that there was more than just a "#" in your included range. You could also go the other way around and use an excluded range instead, if you have too many special characters you want to include. It would look something like this:
re.split('[^\w,_#]',<string_here>)
Which turns out to be a lot cleaner in this case
Just create a regex that matches exactly what you're looking for and use the findall command.
This expression will match all words 1 character or longer that might have a .
, #
, or a _
inside the word.
[a-z](?:[a-z.#_]*[a-z])?
Sample Python Script
import re
regex = ur"[a-z](?:[a-z.#_]*[a-z])?"
line = "word is p.player I want to make sure the word is split as p.player not as p and player."
words = re.findall(regex, line, re.IGNORECASE)
print(words)
Sample Output
['word', 'is', 'p.player', 'I', 'want', 'to', 'make', 'sure', 'the', 'word', 'is', 'split', 'as', 'p.player', 'not', 'as', 'p', 'and', 'player']
Live Demos
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.