繁体   English   中英

在 pandas 中搜索 df 并在特定字符串之后查找数字

[英]Searching df and looking for number after specific string in pandas

我希望从 dataframe 中提取特定字符串之后的数字。 我需要扫描整个 dataframe 并查找名为“Concession Type:”的特定字符串,然后获取结果(通常是 Concession Type:CC 或 None)并基于该字符串创建一个列。 此列将填充“CC”或“None”。 如果它具有 CC 特许类型,我想创建另一列并拉出一个字符串(框架中的另一个字符串,文本为“总金额:x”。我想从中拉出“x”。这些文本被埋在dataframe 中的各个列,所以没有一列我可以调用(dataframe 是通过从 pdf 中提取文本创建的,每个新行创建一个列。

我在下面的内容,查看 dataframe 中的所有文本并查找让步类型:无并创建让步类型列,与让步类型相同:$,然后检查它是否满足下面列出的某些条件,然后创建“让步检查”专栏 这是 dataframe 的样本。

6/9/2020 1 Per Page - Listing Report**IRES MLS  : 91 PRICE: $59,900**12 Warrior Way**ATTACHED DWELLING ACTIVE / BACKUP**Locale: Lafa County: Bould**Area/SubArea: 3/0**Subdivision: Lafayett Greens Townhomes**School District: Bould Vall Dist New Const: No**Builder: Model:**Lot SqFt: 625 Approx. Acres: 0.01**New Const Notes:**Elec: Xcel Water: City of Lafay**Gas: Xcel Taxes: $1,815/2019 Listing Comments: Bright, Modern and Cozy!
6/9/2020 1 Per Page - Listing Report**IRES MLS : 906 PRICE: $350,000**15 Calks Ave, Long 80501**RESIDENTIAL-DETACHED SOLD**Locale: Longmont County: Bould**Area/SubArea: 4/6**Sold Date: 04/01/2020 Sold Price: $360,000**Bedrooms: 3 Baths: 2 Rough Ins: 0**Terms: VA FIX DOM: 1 DTO: 1 DTS: 24**Baths Bsmt Lwr Main Upr Addl Total Down Pmt Assist: N**Full 0 0 0 1 0 1 Concession Type: None**3/4 0 1 0 0 0 1****https://www.iresis.com/MLS/Search/index.cfm?Action=LaunchReports 249/250
6/9/2020 1 Per Page - Listing Report**IRES MLS : 908 PRICE: $360,000**7 S Roosevelt Ave, Lafa 80026**RESIDENTIAL-DETACHED SOLD**Locale: Lafay County: Boul**Area/SubArea: 3/0**Sold Date: 05/08/2020 Sold Price: $360,000**Bedrooms: 2 Baths: 1 Rough Ins: 0**Terms: CONV FIX DOM: 5 DTO: 5 DTS: 34**Baths Bsmt Lwr Main Upr Addl Total Down Pmt Assist: N**Full 0 0 1 0 0 1 Concession Type: None**3/4 0 0 0 0 0 0**Property Features**1/2 0 0 0 0 0 0 Style: 1 Story/Ranch Construction: Wood/Frame, Metal Siding Roof:**https://www.iresis.com/MLS/Search/index.cfm?Action=LaunchReports 250/250

df = pd.DataFrame([sub.split("**") for sub in df])
df[['MLS #', 'Price']] = df[1].str.split('PRICE:', n=1, expand=True)
df[['Prop Type', 'Status']] = df[3].str.rsplit(' ', n=1, expand=True)
df['Concession Type'] = df.apply(lambda row: row.astype(str).str.contains('Concession Type: None', regex=False).any(), axis=1)
df['Concession Type'] = df.apply(lambda row: row.astype(str).str.contains('Concession Type: $', regex=False).any(), axis=1)
conditions = [(df['Concession Type'] == True) & (df['Status'] == 'SOLD'),
             (df['Concession Type'] == False) & (df['Status'] == 'SOLD')]
choices = ['no concession', 'concession']
df['Concession_check'] = np.select(conditions, choices, default='Active/Pending/Withdrawn')

我没有足够的关于输入数据结构的信息。 我假设每一行都是数组中的一个元素:

df = ["row1" , "row2" , "row3"] # First code block in your question
df = pd.DataFrame([sub.split("**") for sub in df])
dx =  [df[i].str.contains("Concession") for i in df]
df[pd.DataFrame(dx).T.fillna(False)] # Fill None values because it errors out without boolean values

从这里您可以添加更多检查。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM