[英]How to fetch a substring from text file in python?
我有一堆明文形式的推文,如下所示。 我期待僅提取文本部分 。
文件中的樣本數據 -
Fri Nov 13 20:27:16 +0000 2015 4181010297 rt we're treating one of you lads to this d'struct denim shirt! simply follow & rt to enter
Fri Nov 13 20:27:16 +0000 2015 2891325562 this album is wonderful, i'm so proud of you, i loved this album, it really is the best. -273
Fri Nov 13 20:27:19 +0000 2015 2347993701 international break is garbage smh. it's boring and your players get injured
Fri Nov 13 20:27:20 +0000 2015 3168571911 get weather updates from the weather channel. 15:27:19
Fri Nov 13 20:27:20 +0000 2015 2495101558 woah what happened to twitter this update is horrible
Fri Nov 13 20:27:19 +0000 2015 229544082 i've completed the daily quest in paradise island 2!
Fri Nov 13 20:27:17 +0000 2015 309233999 new post: henderson memorial public library
Fri Nov 13 20:27:21 +0000 2015 291806707 who's going to next week?
Fri Nov 13 20:27:19 +0000 2015 3031745900 why so blue? @ golden bee
這是我在預處理階段的嘗試 -
for filename in glob.glob('*.txt'):
with open("plain text - preprocesshurricane.txt",'a') as outfile ,open(filename, 'r') as infile:
for tweet in infile.readlines():
temp=tweet.split(' ')
text=""
for i in temp:
x=str(i)
if x.isalpha() :
text += x + ' '
print(text)
輸出-
Fri Nov rt treating one of you lads to this denim simply follow rt to
Fri Nov this album is so proud of i loved this it really is the
Fri Nov international break is garbage boring and your players get
Fri Nov get weather updates from the weather
Fri Nov woah what happened to twitter this update is
Fri Nov completed the daily quest in paradise island
Fri Nov new henderson memorial public
Fri Nov going to next
Fri Nov why so golden
此輸出不是所需的輸出,因為
1.它不會讓我在推文的文本部分中獲取數字/數字。
2.每條線都以FRI NOV開頭。
你能否提出一個更好的方法來實現同樣的目標? 我不太熟悉正則表達式,但我認為我們可以使用re.search(r'2015(magic to remove tweetID)/w*',tweet)
在這種情況下,您可以避免正則表達式。 您提供的文本行在推文文本之前的空格數方面是一致的。 只是split()
:
>>> data = """
lines with tweets here
"""
>>> for line in data.splitlines():
... print(line.split(" ", 7)[-1])
...
rt we're treating one of you lads to this d'struct denim shirt! simply follow & rt to enter
this album is wonderful, i'm so proud of you, i loved this album, it really is the best. -273
international break is garbage smh. it's boring and your players get injured
get weather updates from the weather channel. 15:27:19
woah what happened to twitter this update is horrible
i've completed the daily quest in paradise island 2!
new post: henderson memorial public library
who's going to next week?
why so blue? @ golden bee
你可以在沒有正則表達式的情況下完成
import glob
for filename in glob.glob('file.txt'):
with open("plain text - preprocesshurricane.txt",'a') as outfile ,open(filename, 'r') as infile:
for tweet in infile.readlines():
temp=tweet.split(' ')
print('{}'.format(' '.join(temp[7:])))
我提出了比@Rushy Panchal更具體的模式,以避免在推文包含數字時出現問題: .+ \\+(\\d+ ){3}
使用re.sub函數
>>> import re
>>> with open('your_file.txt','r') as file:
... data = file.read()
... print re.sub('.+ \+(\d+ ){3}','',data)
產量
rt we're treating one of you lads to this d'struct denim shirt! simply follow & rt to enter
this album is wonderful, i'm so proud of you, i loved this album, it really is the best. -273
international break is garbage smh. it's boring and your players get injured
get weather updates from the weather channel. 15:27:19
woah what happened to twitter this update is horrible
i've completed the daily quest in paradise island 2!
new post: henderson memorial public library
who's going to next week?
why so blue? @ golden bee
您正在尋找的模式是.+ \\d+
:
import re
p = re.compile(".+ \d+")
tweets = p.sub('', data) # data is the original string
模式的細分
.
匹配任何字符, +
匹配1或更多。 所以, .+
匹配一個或多個字符。 但是,如果我們將其留在此處,我們將刪除所有文本。
因此,我們希望以\\d+
- \\d
匹配任何數字來結束模式,因此這將匹配任何連續的數字序列,其中最后一個是推文ID。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.