[英]Extract text between characters, strings or brackets
pd.DataFrame({"Hashtags": [ "[]", "[u'AAPHealthCare4All']", "[u'CBI',","u'Delhi',", "u'Emergency']"]})
pd.DataFrame({"Hashtags": [ " ", "AAPHealthCare4All", "CBI","Delhi", "Emergency"]})
括號,括號或逗號和引號都不丟失/有錯。 []應該用空格代替。 基本上,我想刪除所有的“ [”,“]”,“ [u'”等。我使用了以下代碼,但無濟於事:
for index,row in df.iterrows():
if "RT @" in row["Tweet"]:
df['Hashtags'] =df['Hashtags'].str.replace(r'[^[]]*\[|\][^]*|\[u\'*\'\]|\[\'*\'\]', '')
df.to_csv('string_HT.csv', index=False)
您可以將以下表達式應用於您的主題標簽:
df['Hashtags'] = sum([x if x else [" "] for x
in ast.literal_eval(''.join(df['Hashtags'])\
.replace('][', '],['))],\
[])
結果:
[' ', 'AAPHealthCare4All', 'CBI', 'Delhi', 'Emergency']
但是,數據框中的行數將更改,並且索引將不保留。 您可能使用了錯誤的數據框。
您可以使用提取功能:
df.Hashtags.str.extract("'(.*)'").fillna('')
Out[1052]:
0
1 AAPHealthCare4All
2 CBI
3 Delhi
4 Emergency
Name: Hashtags, dtype: object
我認為simpliset是使用帶有replace
double strip
:
df['Hashtags'] = df['Hashtags'].str.strip("[u,]").str.strip("'").replace('', ' ')
print (df['Hashtags'].tolist())
[' ', 'AAPHealthCare4All', 'CBI', 'Delhi', 'Emergency']
雙strip
是必要的,因為如果只有一個,它將從字符串的開頭和結尾刪除所有u
:
df = pd.DataFrame({"Hashtags": [ "[]", "[u'uuAAPHealthCare4All']",
"[u'uCBIuu',","u'uDelhi',", "u'Emergency']"]})
print (df)
Hashtags
0 []
1 [u'uuAAPHealthCare4All']
2 [u'uCBIuu',
3 u'uDelhi',
4 u'Emergency']
df['Hashtags'] = df['Hashtags'].str.strip("[u,']")
print (df['Hashtags'])
0
1 AAPHealthCare4All
2 CBI
3 Delhi
4 Emergency
Name: Hashtags, dtype: object
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.