简体   繁体   中英

Extracting words from txt file using python

I want to extract all the words that are between single quotation marks from a text file. The text file looks like this:

u'MMA': 10,
=u'acrylic'= : 19,
== u'acting lessons': 2,
=u'aerobic': 141,
=u'alto': 2= 4,
=u&#= 39;art therapy': 4,
=u'ballet': 939,
=u'ballroom'= ;: 234,
= =u'banjo': 38,

And ideally, my output would look lie this:

MMA,
acrylic,
acting lessons,
...

From browsing posts, it seems like I should use some combination of NLTK / regex for python to accomplish this. I've tried the following:

import re

file = open('artsplus_categories.txt', 'r').readlines()

for line in file:
    list = re.search('^''$', file)

file.close()

And get the following error:

  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 142, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or buffer

I think the error might be caused by how I'm looking for the pattern. My logic is that I search for everything inside of the '....'.

What's tripping up re.py?

Thanks!

--------------------------------

Following Ashwini's comment:

import re

file = open('artsplus_categories.txt', 'r').readlines()

for line in file:
    list = re.search('^''$', line)

print list

#file.close()

But the output contains nothing:

Samuel-Finegolds-MacBook-Pro:~ samuelfinegold$ /var/folders/jv/9_sy0bn10mbdft1bk9t14qz40000gn/T/Cleanup\ At\ Startup/artsplus_categories_clean-393952531.278.py.command ; exit;
None
logout


@Rasco: here's the error I'm getting:

File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 177, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
logout

I'm using this code:

file2 = open('artsplus_categories.txt', 'r').readlines()
list = re.findall("'[^']*'", file2)
for x in list:
    print (x)

Instead of passing the line to the regex you actually passed it the whole list(file). You should pass line to re.search not file .

for line in file:
    lis = re.search('^''$', line) # line not file

Don't use list , file as variable names. They are built-in functions.

Update:

with open('artsplus_categories.txt') as f:
    for line in f:
        print re.search(r"'(.*)'", line).group(1)
...         
MMA
acrylic
acting lessons
aerobic
alto
art therapy
ballet
ballroom
banjo

Try this code example:

import re

file =  """u'MMA': 10,
        =u'acrylic'= : 19,
        == u'acting lessons': 2,
        =u'aerobic': 141,
        =u'alto': 2= 4,
        =u&#= 39;art therapy': 4,
        =u'ballet': 939,
        =u'ballroom'= ;: 234,
        = =u'banjo': 38,"""

list = re.findall("'[^']*'", file)
for x in list:
    print (x)

It shows the correct values. Keep in mind that one of the values in your example doesn't open the quote correctly, so the matches get broken there.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM