Python 從多個 substring 中提取值

Question

我有一個名為 df 的 dataframe，它有一個名為“text”的列，每一行都包含一個這樣的字符串： This is the string of the MARC data format。

d20s 22 i2as¶001VNINDEA455133910000005¶008180529c 1996 frmmm wz 7b ¶009se z 1 m mm c¶008a ¶008at ¶008ap ¶008a ¶0441 $a2609-2565$c2609-2565¶0410 $afre$aeng$apor ¶0569 $a2758-8965$c4578-7854¶0300 $a789$987$754 ¶051 $atxt$asti$atdi$bc¶110 $317737535$w20..b.....$astock market situation¶3330 $aimport and export agency ABC¶7146 $q1$uwwww.abc.org$ma1¶7146 $q9$uAgency XYZ¶8799 $q1$uAgency ABC$fHTML$

在這里，我想提取 $u 之后區域 ¶7146 或 $c 之后區域 ¶0441 中包含的信息。

結果表將是這樣的：

¶7146$你	¶0441$c
www.abc.org	2609-2565
代理商 XYZ	2609-2565

這是我制作的代碼：

import os
import pandas as pd
import numpy as np
import requests


df = pd.read_csv('dataset.csv')

def extract(text, start_pattern, sc):
    ist = text.find(start_pattern)
    if ist < 0:
        return ""
    ist = text.find(sc, ist)
    if ist < 0:
        return ""
    im = text.find("$", ist + len(sc))
    iz = text.find("¶", ist + len(sc))
    if im >= 0:
        if iz >= 0:
            ie = min(im, iz)
        else:
            ie = im
    else:
        ie = iz
    if ie < 0:
        return ""
    return text[ist + len(sc): ie]

def extract_text(row, list_in_zones):
    text = row["text"]
    if pd.isna(text):
        return [""] * len(list_in_zones)
    patterns = [("¶" + p, "$" + c) for p, c in [zone.split("$") for zone in list_in_zones]]
    return [extract(text, pattern, sc) for pattern, sc in patterns]


list_in_zones = ["7146$u", "0441$u", "200$y"]


df[list_in_zones] = df.apply(lambda row: extract_text(row, list_in_zones),
                             axis=1,
                             result_type="expand")

df.to_excel("extract.xlsx", index = False)

對於區域¶7146 和 $u 之后，我的代碼只提取了“www.abc.org”，他無法提取值為“Agency XYZ”的重復項。 這里有什么問題？

額外的邏輯結構：關於字符串結構的邏輯是每個區域將以字符¶開頭，如¶7146，¶0441，..，並且字段以$開頭，例如$u，$c，並且該字段以結尾$ 或 ¶。 在這里，我想提取字段 $ 中的信息。

Answer 1

您可以嘗試拆分然后清理字符串，如下所示

import pandas as pd
text = ('d20s 22 i2as¶001VNINDEA455133910000005¶008180529c 1996 frmmm wz 7b ¶009se z 1 m mm c¶008a ¶008at ¶008ap ¶008a ¶0441 $a2609-2565$c2609-2565¶0410 $afre$aeng$apor ¶0569 $a2758-8965$c4578-7854¶0300 $a789$987$754 ¶051 $atxt$asti$atdi$bc¶110 $317737535$w20..b.....$astock market situation¶3330 $aimport and export agency ABC¶7146 $q1$uwwww.abc.org$ma1¶7146 $q9$uAgency XYZ¶8799 $q1$uAgency ABC$fHTML$')
u = text.split('$u')[1:3] # Taking just the seconds and third elements in the array because they match your desired output
c = text.split('$c')[1:3]

pd.DataFrame([u,c]).T

OUTPUT

    0   1
0   wwww.abc.org$ma1¶7146 $q9   2609-2565¶0410 $afre$aeng$apor ¶0569 $a2758-8965
1   Agency XYZ¶8799 $q1 4578-7854¶0300 $a789$987$754 ¶051 $atxt$asti$a...

從這里您可以嘗試清理字符串，直到它們匹配所需的 output。

如果我們能理解這個數據結構背后的邏輯——某些字段什么時候開始和結束？

Python 從多個 substring 中提取值

問題描述

1 個解決方案

解決方案1
0 2022-12-02 10:00:00

Python 從多個 substring 中提取值

問題描述

1 個解決方案

解決方案1 0 2022-12-02 10:00:00

解決方案1
0 2022-12-02 10:00:00