简体   繁体   中英

How do I get text I compiled from a txt file into its corresponding row in a dataframe?

Here is part of the txt file I'm working with:

<PERSON ID="S1" LANG="NRN" ROLE="JU, Student" SEX="M" RESTRICT="NONE" AGE="1">
<FIRSTLANG> HIN </FIRSTLANG>
</PERSON>
<PERSON ID="S2" LANG="NS" ROLE="ST, Advisor" SEX="F" RESTRICT="NONE" AGE="3"/>
<PERSON ID="S3" LANG="NS" ROLE="JU, Student" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="S4" LANG="NS" ROLE="JU, Student" SEX="M" RESTRICT="NONE" AGE="1"/>
<PERSON ID="S5" LANG="NS" ROLE="JU, Student" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="S6" LANG="NS" ROLE="JU, Student" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="S7" LANG="NS" ROLE="UN, Unknown" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="R1" LANG="NS" ROLE="ST, MICASE Researcher" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="SU-f" LANG="NS" ROLE="UN, Unknown" SEX="F" RESTRICT="NONE" AGE="0"/>
<PERSON ID="SS" LANG="NS" ROLE="UN, Unkown" SEX="F" RESTRICT="NONE" AGE="0"/>

And here is the dataframe I have that I want to add columns of information to:

 speaker_id                                         utterances  #_of_words  \
0         S1  [alright, sue, now, it, s, like, uh, i, droppe...        2570   
1         S2  [this, year, this, term, ri, oh, but, you, dro...       20475   
2         S3  [yeah, hi, hi, yeah, i, already, signed, s2, o...         945   
3         S4  [back, in, i, was, like, w, what, is, that, ye...        2133   
4         S5  [okay, well, i, m, not, here, for, a, drop, ad...        1229   
5         S6  [me, yeah, that, s, right, i, have, a, questio...        1027   
6         S7  [hello, hi, what, was, your, name, i, thought,...          93   

   1p_sg  1p_pl    2p  #_of_pronouns  
0    220      6    31            257  
1    575     37  1534           2146  
2    102      0    12            114  
3    181     11    60            252  
4    120      3    17            140  
5     97      1    11            109  
6      6      1     3             10 

I'm trying to add two columns, 'role' and 'gender', to my dataframe. I want to extract that information from the txt file I have above. As you can see, there is a speaker id that is associated with a specific role and gender. So, for example, in the the first row where it says S1, I would want the 'role' column to say "JU, Student" and the 'gender' column to say "M" since those are the corresponding role and gender. I already have the following commands to compile that info:

role=re.compile('ROLE="(.+?)"')
gender=re.compile('SEX="(.+?)"')

I just don't know how to get it to the corresponding row in the dataframe. How do I do this?

You were very close but this is how you would do it using regex:

xml_file = """<PERSON ID="S1" LANG="NRN" ROLE="JU, Student" SEX="M" RESTRICT="NONE" AGE="1">
<FIRSTLANG> HIN </FIRSTLANG>
</PERSON>
<PERSON ID="S2" LANG="NS" ROLE="ST, Advisor" SEX="F" RESTRICT="NONE" AGE="3"/>
<PERSON ID="S3" LANG="NS" ROLE="JU, Student" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="S4" LANG="NS" ROLE="JU, Student" SEX="M" RESTRICT="NONE" AGE="1"/>
<PERSON ID="S5" LANG="NS" ROLE="JU, Student" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="S6" LANG="NS" ROLE="JU, Student" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="S7" LANG="NS" ROLE="UN, Unknown" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="R1" LANG="NS" ROLE="ST, MICASE Researcher" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="SU-f" LANG="NS" ROLE="UN, Unknown" SEX="F" RESTRICT="NONE" AGE="0"/>
<PERSON ID="SS" LANG="NS" ROLE="UN, Unkown" SEX="F" RESTRICT="NONE" AGE="0"/>"""

person_list = re.findall(r'(ID=\".+\")\s(LANG=\".+\")\s(ROLE=\".+\")\s(SEX=\".+\")\s(RESTRICT=\".+\")\s(AGE=\".+\")', xml_file)

df2 = pd.DataFrame([{x.split('=')[0] : x.split('=')[1].replace('"', '') for x in person} for person in person_list])

# df2 = df2[['ID', 'ROLE', 'GENDER']] #Choose which columns you want to keep.

df_merge = pd.merge(left=df1, right=df2, how='left', right_on='ID', left_on='speaker_id')

Let me know if that works for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM