简体   繁体   中英

Python: Iterate over every row of a CSV, count the tokens in every row, create a new CSV with the number of tokens for each row of the original CSV

)

I have a CSV-File that looks like this:

Blockquote

  • ID Content
  • Text 1 Here comes some text
  • Text 2 Her comes also some text
  • Text 3 And so on, and so on...

Blockquote

I want to write a Code to iterate over every row of this CSV-Table. Then to count the number of tokens in every row (eg every Text) Then make a new CSV-Table as Output, in which should only be the Text-ID with the number of Tokens in this text.

Blockquote

The Output CSV-File should look like this:

  • ID NumberOfTokens
  • Text 1 8
  • Text 2 12
  • Text 3 15

Blockquote

So far I have this Code:

import csv
from textblob_de import TextBlobDE as TextBlob

data = open('myInputFile.csv', encoding="utf-8").readlines()

blob = TextBlob(str(data))


csv_file = open('myOutputFile.csv', 'w', encoding="utf-8")
csv_writer = csv.writer(csv_file)
# Define the Headers of the CSV
csv_writer.writerow(['Text-ID', 'Tokens])


def numOfWordTokens(document):

    myList = []

    for eachRow in document:
        myList.append(eachRow)
        return "\n".join(myList)

        #return eachRow
        #print(eachRow)

        # Count Tokens
        #countTokens = len(wordTokens2.split()) # Output: integer
        #return countTokens
        #myList.append(str(countTokens))


wordTokens = numOfWordTokens(data)

# Write Content in the CSV-Table Rows
csv_writer.writerow([wordTokens])
csv_file.close()

So, first of all I have the following question?

When I do return eachRow I get no Output in the Shell and only the 1. row as output in the new created CSV-File. When I do print (eachRow) I get really each row printed as Output in the Shell, but my new created CSV-file is just empty!

So that is the first part that I have trouble with, so I can't continue to go to the part where I actually count the tokens in each row and write the number of tokens into the new CSV-File.

It's super easy with pandas, but if you prefer not to use other modules, that's fine as well :) I've added the code for both pandas and for manually iterating over the data:

import pandas as pd
import csv


def main_pandas(path_to_csv: str, target_path: str):
    df = pd.read_csv(path_to_csv, encoding='utf-8')
    df['tokens'] = df['Content'].apply(lambda x: len(x.split()))
    sub_df = df[['ID', 'tokens']]
    sub_df.to_csv(target_path, index=False)


def main_manual(path_to_csv: str, target_path: str):
    with open(path_to_csv, 'r') as r_fp:
        csv_reader = csv.reader(r_fp)
        next(csv_reader)  # Skip headers
        with open(target_path, 'w') as w_fp:
            csv_writer = csv.writer(w_fp)
            csv_writer.writerow(['Text ID', 'tokens'])  # Write headers
            for line in csv_reader:
                text_id, text_content = line
                csv_writer.writerow([text_id, len(text_content.split())])


if __name__ == '__main__':
    main_manual('text.csv', 'tokens.csv')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM