简体   繁体   中英

How to load txt file that used indentations to mark observations into R

I am running analyses with data by county, and I would like to include variables with data from adjacent county. Before that, I need a file listing each county's adjacent counties.

From the census, I have such a txt file, but the format is... unique. While columns are tab delimited, each new source county is marked by indentation.

Example:

"Autauga County, AL"    01001   "Autauga County, AL"    01001
        "Chilton County, AL"    01021
        "Dallas County, AL" 01047
        "Elmore County, AL" 01051
        "Lowndes County, AL"    01085
        "Montgomery County, AL" 01101
"Baldwin County, AL"    01003   "Baldwin County, AL"    01003
        "Clarke County, AL" 01025
        "Escambia County, AL"   01053
        "Mobile County, AL" 01097
        "Monroe County, AL" 01099
        "Washington County, AL" 01129
        "Escambia County, FL"   12033  

I have no idea how to load this in. And there are too many counties in my study area to do it manually.

Would greatly appreciate any help!

In the case of tab-delimited files, the field separator character is \t :

df <- read.csv(file = ".../countries.txt", sep = "\t")

Don't quite get the part with:

each new source county is marked by indentation

In case you really mean accumulating tab stops before every new row - like that, after you read the data exactly as mentioned above with read.csv() :

                  V1                 V2                 V3   V4
1 Autauga County, AL              01001                      NA
2                    Autauga County, AL              01001   NA
3                                       Chilton County, AL 1021

You can try something like the following - written under the condition that there are no column names in your data, as indicated in your example:

res <- data.frame()
for (i in 1:nrow(countries)) {
  new <- countries[i, c(i, i+1)]
  colnames(new) <- NA
  res <- rbind(res, new)
}

This should give you:

                  NA    NA
1 Autauga County, AL 01001
2 Autauga County, AL 01001
3 Chilton County, AL  1021
...

Can you tell us what the output might look like. Are the indented lines subservient to the unindented? Are you expecting an output where for example "Autauga County" would be in the first column and then all the indented ones would be on a row by themselves with "Autauga County" as the parent? So more information is needed to understand what you are expecting. Reading in the data will not be hard if we know what the output looks like.

If you go to the page describing the layout of the file - County Adjacency File Record Layout - it specifies that the file is tab delimited. So you can just use read_tsv . You can also use fill to get each main county associated with all of the adjacent counties.

    library(tidyverse)

    read_tsv("county_adjacency.txt", col_names = c("county", "geoid", "adj_county", "adj_geoid")) %>% 
       fill(county:geoid, .direction = "down")

Result:

  county             geoid adj_county            adj_geoid
   <chr>              <chr> <chr>                 <chr>    
 1 Autauga County, AL 01001 Autauga County, AL    01001    
 2 Autauga County, AL 01001 Chilton County, AL    01021    
 3 Autauga County, AL 01001 Dallas County, AL     01047    
 4 Autauga County, AL 01001 Elmore County, AL     01051    
 5 Autauga County, AL 01001 Lowndes County, AL    01085    
 6 Autauga County, AL 01001 Montgomery County, AL 01101    
 7 Baldwin County, AL 01003 Baldwin County, AL    01003    
 8 Baldwin County, AL 01003 Clarke County, AL     01025    
 9 Baldwin County, AL 01003 Escambia County, AL   01053    
10 Baldwin County, AL 01003 Mobile County, AL     01097   
# … with 22,190 more rows

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM