简体   繁体   中英

How can i use split function in python to split parts of text and save them to a different file?

Hello i have a problem using the split function in python with no success. I collected some tweets using a crawler and i need to split some parts of each tweets to a different .json file specifically the ID and #(hashtag). I been using the split function with no success what i am doing wrong? I want to save to a different .json file what is after "id" and "text"
The text looks like this:

{"created_at":"Fri Oct 20 16:35:36 +0000 2017","id":921414607302025216,"id_str":"921414607302025216","text":"@IdrisAhmed16 loooooool who said I was indirecting you??

def on_data(self, data):
    try:
        #print data
        with open('Bologna_streams.json', 'r') as f:
            for line in f:

                tweet = data.spit(',"text":"')[1].split('",""source"')[0]
                print (tweet)

                saveThis = str(time.time()) + '::' +tweet

                saveFile = open('Bologna_text_preprocessing.json', 'w')
                json.dump(data)
                saveFile.write(saveThis)
                saveFile.write(tweet)
                saveFile.write('\n')
                saveFile.close()
                f.close()
        return True
    except BaseException as e:
        print("Error on_data: %s" % str(e))
        time.sleep(5)

def on_error(self, status):
    print (status)

I think you should experiment with Python on the command-line, either interactively or in a small script.

Consider this:

text="""
{"created_at":"Fri Oct 20 16:35:36 +0000 2017","id":921414607302025216,"id_str":"921414607302025216","text":"@IdrisAhmed16 learn #python"}
""".strip()

print(text.split(":"))

That will print in the console:

['{"created_at"', '"Fri Oct 20 16', '35', '36 +0000 2017","id"', '921414607302025216,"id_str"', '"921414607302025216","text"', '"@IdrisAhmed16 learn #python"}']

Or, to print each split screen on a new line:

print("splits:\n")
for item in text.split(":"):
  print(item)
print("\n---")

which will print this:

splits:

{"created_at"
"Fri Oct 20 16
35
36 +0000 2017","id"
921414607302025216,"id_str"
"921414607302025216","text"
"@IdrisAhmed16 #learn python"}

---

In other words, split has done what it should: found every ":" and split the string around those characters.

What you want to do is parse the JSON:

import json

parsed = json.loads(text)
print("parsed:", parsed)

The parsed variable is a normal Python object. Result:

parsed: {
  'created_at': 'Fri Oct 20 16:35:36 +0000 2017',
  'id': 921414607302025216,
  'id_str': '921414607302025216',
  'text': '@IdrisAhmed16 learn #python'
}

Now you can do operations on the data, including retrieving the text item and splitting it.

However, if the objective is to find all hashtags, you're better of using a regular expression:

import re
hashtag_pattern = re.compile('#(\w+)')
matches = hashtag_pattern.findall(parsed['text'])
print("All hashtags in tweet:", matches)

print("Another example:", hashtag_pattern.findall("ok #learn #python #stackoverflow!"))

Result:

All hashtags in tweet: ['python']
Another example: ['learn', 'python', 'stackoverflow']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM