简体   繁体   English

极坐标中的字符串操作

[英]String manipulation in polars

I have a record in polars which has no header so far.我在 polars 中有一个记录,到目前为止还没有 header。 This header should refer to the first row of the record.这个header应该是指记录的第一行。 Before I instantiate this row as header, I want to manipulate the entries.在将此行实例化为 header 之前,我想操作这些条目。

        import polars as pl
        # Creating a dictionary with the data
        data = {
            "Column_1": ["ID", 4, 4, 4, 4],
            "Column_2": ["LocalValue", "B", "C", "D", "E"],
            "Column_3": ["Data\nField", "Q", "R", "S", "T"],
            "Column_4": [None, None, None, None, None],
            "Column_5": ["Global Value", "G", "H", "I", "J"],
        }
        # Creating the dataframe
        table = pl.DataFrame(data)
        print(table)
    
    shape: (5, 5)
    ┌──────────┬────────────┬──────────┬──────────┬──────────────┐
    │ Column_1 ┆ Column_2   ┆ Column_3 ┆ Column_4 ┆ Column_5     │
    │ ---      ┆ ---        ┆ ---      ┆ ---      ┆ ---          │
    │ str      ┆ str        ┆ str      ┆ f64      ┆ str          │
    ╞══════════╪════════════╪══════════╪══════════╪══════════════╡
    │ ID       ┆ LocalValue ┆ Data     ┆ null     ┆ Global Value │
    │          ┆            ┆ Field    ┆          ┆              │
    │ null     ┆ B          ┆ Q        ┆ null     ┆ G            │
    │ null     ┆ C          ┆ R        ┆ null     ┆ H            │
    │ null     ┆ D          ┆ S        ┆ null     ┆ I            │
    │ null     ┆ E          ┆ T        ┆ null     ┆ J            │
    └──────────┴────────────┴──────────┴──────────┴──────────────┘

First, I want to replace line breaks and spaces between words with an underscore.首先,我想用下划线替换单词之间的换行符和空格。 Furthermore I want to fill Camel Cases with an underscore (eg TestTest -> Test_Test).此外,我想用下划线填充 Camel 案例(例如 TestTest -> Test_Test)。 Finally, all entries should be lowercase.最后,所有条目都应该是小写的。 For this I wrote the following function:为此,我写了以下 function:

def clean_dataframe_columns(df):
    header = list(df.head(1).transpose().to_series())
    cleaned_headers = []
    for entry in header:
        if entry:
            entry = (
                entry.replace("\n", "_")
                .replace("(?<=[a-z])(?=[A-Z])", "_")
                .replace("\s", "_")
                .to_lowercase()
            )
        else:
            entry = "no_column"
        cleaned_headers.append(entry)
    df.columns = cleaned_headers
    return df

Unfortunately I have the following error.不幸的是我有以下错误。 What am I doing wrong?我究竟做错了什么?

AttributeError                            Traceback (most recent call last)
Cell In[13], line 1
----> 1 df1 = clean_dataframe_columns(df)

Cell In[12], line 7, in clean_dataframe_columns(df)
      4 for entry in header:
      5     if entry:
      6         entry = (
----> 7             entry.str.replace("\n", "_")
      8             .replace("(?<=[a-z])(?=[A-Z])", "_")
      9             .replace("\s", "_")
     10             .to_lowercase()
     11         )
     12     else:
     13         entry = "no_column"

AttributeError: 'str' object has no attribute 'str'

The goal should be this dataframe:目标应该是这个 dataframe:

shape: (4, 5)
┌─────┬─────────────┬────────────┬───────────┬──────────────┐
│ id  ┆ local_value ┆ data_field ┆ no_column ┆ global_value │
│ --- ┆ ---         ┆ ---        ┆ ---       ┆ ---          │
│ i64 ┆ str         ┆ str        ┆ f64       ┆ str          │
╞═════╪═════════════╪════════════╪═══════════╪══════════════╡
│ 4   ┆ B           ┆ Q          ┆ null      ┆ G            │
│ 4   ┆ C           ┆ R          ┆ null      ┆ H            │
│ 4   ┆ D           ┆ S          ┆ null      ┆ I            │
│ 4   ┆ E           ┆ T          ┆ null      ┆ J            │
└─────┴─────────────┴────────────┴───────────┴──────────────┘

Here for entry in header: you iterate over python strings , so you should use corresponding methods (like .lower() instead of .to_lowercase() ).在这里for entry in header:你迭代了python strings ,所以你应该使用相应的方法(比如.lower()而不是.to_lowercase() )。


Rewritten sol-n:重写的 sol-n:

import re

def get_cols(raw_col):
    if raw_col is None: return "no_column"
    raw_col = re.sub("(?<=[a-z])(?=[A-Z])", "_", raw_col)
    return raw_col.replace("\n", "_").replace(" ", "_").lower()


def clean_dataframe_columns(df):
    raw_cols = table.head(1).transpose().to_series().to_list()

    return df.rename({
        col: get_cols(raw_col) for col, raw_col in zip(df.columns, raw_cols)
    }).slice(1).with_column(pl.col("id").fill_null(4).cast(pl.Int32))
┌─────┬─────────────┬────────────┬───────────┬──────────────┐
│ id  ┆ local_value ┆ data_field ┆ no_column ┆ global_value │
│ --- ┆ ---         ┆ ---        ┆ ---       ┆ ---          │
│ str ┆ str         ┆ str        ┆ f64       ┆ str          │
╞═════╪═════════════╪════════════╪═══════════╪══════════════╡
│ 4   ┆ B           ┆ Q          ┆ null      ┆ G            │
│ 4   ┆ C           ┆ R          ┆ null      ┆ H            │
│ 4   ┆ D           ┆ S          ┆ null      ┆ I            │
│ 4   ┆ E           ┆ T          ┆ null      ┆ J            │
└─────┴─────────────┴────────────┴───────────┴──────────────┘

I solved it on my own with this approach:我用这种方法自己解决了它:

def clean_select_columns(self, df: pl.DataFrame) -> pl.DataFrame:
    """
    Clean columns from a dataframe.

    :param df: input Dataframe
    :return: Dataframe with cleaned columns

    The function takes a loaded Dataframe and performs the following operations:

        Transposes the first row of the dataframe to get the header
        Selects the required columns defined in the list required_columns
        Cleans the header names by:
            1. Replacing special characters with underscores
            2. Converting CamelCase strings to snake_case strings
            3. Converting all columns to lowercase
            4. Naming columns with no names as "no_column_X", where X is a unique integer
            5. Returns the cleaned dataframe.
    """
    header = list(df.head(1).transpose().to_series())
    cleaned_headers = []
    i = 0
    for entry in header:
        if entry:
            entry = (
                re.sub(r"(?i)([\n ?])", "",
                re.sub(r"(?<!^)(?=[A-Z][a-z])", "_", entry))
                .lower()
            )
        else:
            entry = f"no_column_{i}"
        cleaned_headers.append(entry)
        i += 1
    df.columns = cleaned_headers
    return df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM