简体   繁体   English

使用 Julia 和 Regex 从文件中收集字符串并写入 output 文件

[英]Collecting strings from files and writing to an output file using Julia and Regex

(Julia and general programming newb) (Julia 和通用编程新手)

I'm trying to read a directory full of JSON files containing lots of HTML pages (about 30), Regex match short strings (many per file, up to 60k total) and output these to one big file - which I'll try and parse later so I can add to a MySQL DB.我正在尝试读取一个充满 JSON 文件的目录,其中包含许多 HTML 页面(大约 30 个),正则表达式匹配短字符串(每个文件很多,总共最多 60k)和 Z78E6221F6393D1356681DB39稍后解析,以便我可以添加到 MySQL 数据库。

Here's my code:这是我的代码:

patFilename = r"[0-9]+_[0-9]+.json"
patID = r"\/entry\/[0-9]+\/go"

filenames = readdir("C:/getentries/data/")

caseIDs = []

for filename in filenames
    if match(patFilename, filename) === nothing
        continue
    end

    file = open("C:/getentries/data/" * filename)
    case = read(file, String)

    push!(caseIDs, match(patID, case))

end

println(caseIDs)

touch("C:/getentries/data/caseIDs.txt")
open("C:/getentries/data/caseIDs.txt", "w") do caseID
    println(caseID, caseIDs)
end

No errors are thrown but only a few strings are written to the file.不会引发任何错误,但只会将几个字符串写入文件。 So I'm assuming something's going wrong as I try to collect all the strings.因此,当我尝试收集所有字符串时,我假设出了点问题。

I thought I could try the approach suggested in my last question but this didn't help - although that's likely due to my complete inexperience!我以为我可以尝试上一个问题中建议的方法,但这并没有帮助-尽管这可能是由于我完全没有经验!

May I ask if anyone has any thoughts?请问有人有什么想法吗?

It's hard to say without a minimal, reproducible example.没有一个最小的、可重复的例子很难说。 But my guess is that, since you're calling match once per file, you're only getting the first match in each file.但我的猜测是,由于您对每个文件调用一次match ,因此您只会获得每个文件中的第一个匹配项。 Instead, you could call eachmatch to get an iterator over all matches in the file contents.相反,您可以调用eachmatch来获取文件内容中所有匹配项的迭代器。

This would look something like the following:这将类似于以下内容:

for filename in filenames
    # Note that you forgot to close the file in your original example
    # Using higher-level functions such as this method of `read` may be safer
    str = read(filename, String)
   
    # Loop over all matches of the regexp found in the string
    for m in eachmatch(pattern, str)
        push!(matches, m)
    end
end

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM