没有数据分析师,你才会学到的。
一、任务
处理一个文件信息内容,对全部的excel文件,按照size列去重,然后按照size升序,最后输出去重排序后的完整内容。
二、文件内容
fid,ext,size,action,date,time1,time2,time3
cat total_page_time_2020-10-19.csv F640E8C2D75750B66A8661FFCBB688AD,xlsx,13300,,2020-10-18 19:02:05,1488,719,769 F640E8C2D75750B66A8661FFCBB688AD,xlsx,13300,,2020-10-18 19:02:17,1488,719,769 F640E8C2D75750B66A8661FFCBB688AD,xlsx,13300,,2020-10-18 19:02:30,1488,719,769 99A94A4E27B0C6994710C598CD785CD5,doc,1247744,,2020-10-18 19:04:46,1488,719,769 939EE932559677149FF9EDEA3EEDCC55,xlsx,10177,,2020-10-18 19:03:52,1488,719,769 6297EF1A95AFEFA283AE4A7560EC4E14,docx,31186,,2020-10-18 19:04:06,1488,719,769 0cc59242461bb3999fc2d72412a509e9,pptx,62301,,2020-10-18 19:03:31,1488,719,769 a456d95e5c6fd3e7ca44b4a069715ee7,pptx,66219,,2020-10-18 19:03:31,1488,719,769 B8B14D3245A6F6C60C273C736FD7EB46,pdf,55124,,2020-10-18 19:02:43,1488,719,769 a93bf90fe8e67b1cda5b56b709588126,xlsx,9861,,2020-10-18 19:02:51,1488,719,769 15dc3167f2b852ce6946112ce29b4388,pptx,364470,,2020-10-18 19:02:50,1488,719,769 9E50772FA4D68E4E0800A0E469D3C83A,xlsx,11275,,2020-10-18 19:01:48,1488,719,769 EC749C75A5C6978143A85620ED9FD155,docx,16978,,2020-10-18 19:01:51,1488,719,769 D92DF1D5E554BCF7E1E49DC70313A939,xlsx,2669340,,2020-10-18 19:00:14,1488,719,769 ACB3950E8C4A51BC0D48D91ACA3B1191,pptx,3007531,,2020-10-18 19:00:28,1488,719,769 3176A266A87CA06E2B022152DBB2854B,pptx,3144658,,2020-10-18 19:00:30,1488,719,769 e626e31f2ac0058e92a1f0d06db0f6c6,pptx,14210440,,2020-10-18 19:01:20,1488,719,769 7CCEEC736E75D93A1D653F98AFE58D49,xlsx,30290,,2020-10-18 19:00:08,1488,719,769 F8583CC40EB3C4D1EF92D221A43AD1DD,docx,102265,,2020-10-18 19:01:04,1488,719,769 4DF67EB107895BB6B8EE57F0896F853F,pptx,7281561,,2020-10-18 19:00:52,1488,719,769
F640E8C2D75750B66A8661FFCBB688AD xlsx 13300
F640E8C2D75750B66A8661FFCBB688AD xlsx 13300
三、处理脚本
cat total_page_time_2020-10-19.csv | grep -E 'csv|xls|et|xlt|wps' | sort -t ',' -k 3,3 -u -n > xls.csv # 处理后的数据 cat xls.csv a93bf90fe8e67b1cda5b56b709588126,xlsx,9861,,2020-10-18 19:02:51,1488,719,769 939EE932559677149FF9EDEA3EEDCC55,xlsx,10177,,2020-10-18 19:03:52,1488,719,769 9E50772FA4D68E4E0800A0E469D3C83A,xlsx,11275,,2020-10-18 19:01:48,1488,719,769 F640E8C2D75750B66A8661FFCBB688AD,xlsx,13300,,2020-10-18 19:02:05,1488,719,769 7CCEEC736E75D93A1D653F98AFE58D49,xlsx,30290,,2020-10-18 19:00:08,1488,719,769 D92DF1D5E554BCF7E1E49DC70313A939,xlsx,2669340,,2020-10-18 19:00:14,1488,719,769
四、总结
1、grep -E 指定匹配多个字符串
2、sort 小知识, -k和-u配合使用才能去重
-t 指定列之间的分隔符;
-k 从第几列到第几列作为去重标准;
-u 去重
-n 升序
-r 降序
3、如果只有一列去重,用uniq即可
uniq tmp.csv # 以下可以去重复 F640E8C2D75750B66A8661FFCBB688AD 13300 F640E8C2D75750B66A8661FFCBB688AD 13300 # 以下不可以去重复 F640E8C2D75750B66A8661FFCBB688AD 13301 F640E8C2D75750B66A8661FFCBB688AD 13300
五、sort help
sort --help Usage: sort [OPTION]... [FILE]... or: sort [OPTION]... --files0-from=F Write sorted concatenation of all FILE(s) to standard output. Mandatory arguments to long options are mandatory for short options too. Ordering options: -b, --ignore-leading-blanks ignore leading blanks -d, --dictionary-order consider only blanks and alphanumeric characters -f, --ignore-case fold lower case to upper case characters -g, --general-numeric-sort compare according to general numerical value -i, --ignore-nonprinting consider only printable characters -M, --month-sort compare (unknown) < `JAN' < ... < `DEC' -h, --human-numeric-sort compare human readable numbers (e.g., 2K 1G) -n, --numeric-sort compare according to string numerical value -R, --random-sort sort by random hash of keys --random-source=FILE get random bytes from FILE -r, --reverse reverse the result of comparisons --sort=WORD sort according to WORD: general-numeric -g, human-numeric -h, month -M, numeric -n, random -R, version -V -V, --version-sort natural sort of (version) numbers within text Other options: --batch-size=NMERGE merge at most NMERGE inputs at once; for more use temp files -c, --check, --check=diagnose-first check for sorted input; do not sort -C, --check=quiet, --check=silent like -c, but do not report first bad line --compress-program=PROG compress temporaries with PROG; decompress them with PROG -d --files0-from=F read input from the files specified by NUL-terminated names in file F; If F is - then read names from standard input -k, --key=POS1[,POS2] start a key at POS1 (origin 1), end it at POS2 (default end of line) -m, --merge merge already sorted files; do not sort -o, --output=FILE write result to FILE instead of standard output -s, --stable stabilize sort by disabling last-resort comparison -S, --buffer-size=SIZE use SIZE for main memory buffer -t, --field-separator=SEP use SEP instead of non-blank to blank transition -T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp; multiple options specify multiple directories -u, --unique with -c, check for strict ordering; without -c, output only the first of an equal run -z, --zero-terminated end lines with 0 byte, not newline --help display this help and exit --version output version information and exit POS is F[.C][OPTS], where F is the field number and C the character position in the field; both are origin 1. If neither -t nor -b is in effect, characters in a field are counted from the beginning of the preceding whitespace. OPTS is one or more single-letter ordering options, which override global ordering options for that key. If no key is given, use the entire line as the key. SIZE may be followed by the following multiplicative suffixes: % 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, Y. With no FILE, or when FILE is -, read standard input. *** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values. Report sort bugs to bug-coreutils@gnu.org GNU coreutils home page: <http://www.gnu.org/software/coreutils/> General help using GNU software: <http://www.gnu.org/gethelp/> For complete documentation, run: info coreutils 'sort invocation'