sort按列去重并排序

没有数据分析师,你才会学到的。

一、任务
处理一个文件信息内容,对全部的excel文件,按照size列去重,然后按照size升序,最后输出去重排序后的完整内容。

二、文件内容
fid,ext,size,action,date,time1,time2,time3

cat total_page_time_2020-10-19.csv

F640E8C2D75750B66A8661FFCBB688AD,xlsx,13300,,2020-10-18 19:02:05,1488,719,769
F640E8C2D75750B66A8661FFCBB688AD,xlsx,13300,,2020-10-18 19:02:17,1488,719,769
F640E8C2D75750B66A8661FFCBB688AD,xlsx,13300,,2020-10-18 19:02:30,1488,719,769
99A94A4E27B0C6994710C598CD785CD5,doc,1247744,,2020-10-18 19:04:46,1488,719,769
939EE932559677149FF9EDEA3EEDCC55,xlsx,10177,,2020-10-18 19:03:52,1488,719,769
6297EF1A95AFEFA283AE4A7560EC4E14,docx,31186,,2020-10-18 19:04:06,1488,719,769
0cc59242461bb3999fc2d72412a509e9,pptx,62301,,2020-10-18 19:03:31,1488,719,769
a456d95e5c6fd3e7ca44b4a069715ee7,pptx,66219,,2020-10-18 19:03:31,1488,719,769
B8B14D3245A6F6C60C273C736FD7EB46,pdf,55124,,2020-10-18 19:02:43,1488,719,769
a93bf90fe8e67b1cda5b56b709588126,xlsx,9861,,2020-10-18 19:02:51,1488,719,769
15dc3167f2b852ce6946112ce29b4388,pptx,364470,,2020-10-18 19:02:50,1488,719,769
9E50772FA4D68E4E0800A0E469D3C83A,xlsx,11275,,2020-10-18 19:01:48,1488,719,769
EC749C75A5C6978143A85620ED9FD155,docx,16978,,2020-10-18 19:01:51,1488,719,769
D92DF1D5E554BCF7E1E49DC70313A939,xlsx,2669340,,2020-10-18 19:00:14,1488,719,769
ACB3950E8C4A51BC0D48D91ACA3B1191,pptx,3007531,,2020-10-18 19:00:28,1488,719,769
3176A266A87CA06E2B022152DBB2854B,pptx,3144658,,2020-10-18 19:00:30,1488,719,769
e626e31f2ac0058e92a1f0d06db0f6c6,pptx,14210440,,2020-10-18 19:01:20,1488,719,769
7CCEEC736E75D93A1D653F98AFE58D49,xlsx,30290,,2020-10-18 19:00:08,1488,719,769
F8583CC40EB3C4D1EF92D221A43AD1DD,docx,102265,,2020-10-18 19:01:04,1488,719,769
4DF67EB107895BB6B8EE57F0896F853F,pptx,7281561,,2020-10-18 19:00:52,1488,719,769

F640E8C2D75750B66A8661FFCBB688AD xlsx 13300
F640E8C2D75750B66A8661FFCBB688AD xlsx 13300

三、处理脚本

cat total_page_time_2020-10-19.csv | grep  -E 'csv|xls|et|xlt|wps' | sort  -t ',' -k 3,3 -u -n > xls.csv

# 处理后的数据
cat xls.csv

a93bf90fe8e67b1cda5b56b709588126,xlsx,9861,,2020-10-18 19:02:51,1488,719,769
939EE932559677149FF9EDEA3EEDCC55,xlsx,10177,,2020-10-18 19:03:52,1488,719,769
9E50772FA4D68E4E0800A0E469D3C83A,xlsx,11275,,2020-10-18 19:01:48,1488,719,769
F640E8C2D75750B66A8661FFCBB688AD,xlsx,13300,,2020-10-18 19:02:05,1488,719,769
7CCEEC736E75D93A1D653F98AFE58D49,xlsx,30290,,2020-10-18 19:00:08,1488,719,769
D92DF1D5E554BCF7E1E49DC70313A939,xlsx,2669340,,2020-10-18 19:00:14,1488,719,769

四、总结
1、grep -E 指定匹配多个字符串

2、sort 小知识, -k和-u配合使用才能去重
-t 指定列之间的分隔符;
-k 从第几列到第几列作为去重标准;
-u 去重
-n 升序
-r 降序

3、如果只有一列去重,用uniq即可

uniq tmp.csv

# 以下可以去重复
F640E8C2D75750B66A8661FFCBB688AD 13300
F640E8C2D75750B66A8661FFCBB688AD 13300

# 以下不可以去重复
F640E8C2D75750B66A8661FFCBB688AD 13301
F640E8C2D75750B66A8661FFCBB688AD 13300

五、sort help

sort --help
Usage: sort [OPTION]... [FILE]...
  or:  sort [OPTION]... --files0-from=F
Write sorted concatenation of all FILE(s) to standard output.

Mandatory arguments to long options are mandatory for short options too.
Ordering options:

  -b, --ignore-leading-blanks  ignore leading blanks
  -d, --dictionary-order      consider only blanks and alphanumeric characters
  -f, --ignore-case           fold lower case to upper case characters
  -g, --general-numeric-sort  compare according to general numerical value
  -i, --ignore-nonprinting    consider only printable characters
  -M, --month-sort            compare (unknown) < `JAN' < ... < `DEC'
  -h, --human-numeric-sort    compare human readable numbers (e.g., 2K 1G)
  -n, --numeric-sort          compare according to string numerical value
  -R, --random-sort           sort by random hash of keys
      --random-source=FILE    get random bytes from FILE
  -r, --reverse               reverse the result of comparisons
      --sort=WORD             sort according to WORD:
                                general-numeric -g, human-numeric -h, month -M,
                                numeric -n, random -R, version -V
  -V, --version-sort          natural sort of (version) numbers within text

Other options:

      --batch-size=NMERGE   merge at most NMERGE inputs at once;
                            for more use temp files
  -c, --check, --check=diagnose-first  check for sorted input; do not sort
  -C, --check=quiet, --check=silent  like -c, but do not report first bad line
      --compress-program=PROG  compress temporaries with PROG;
                              decompress them with PROG -d
      --files0-from=F       read input from the files specified by
                            NUL-terminated names in file F;
                            If F is - then read names from standard input
  -k, --key=POS1[,POS2]     start a key at POS1 (origin 1), end it at POS2
                            (default end of line)
  -m, --merge               merge already sorted files; do not sort
  -o, --output=FILE         write result to FILE instead of standard output
  -s, --stable              stabilize sort by disabling last-resort comparison
  -S, --buffer-size=SIZE    use SIZE for main memory buffer
  -t, --field-separator=SEP  use SEP instead of non-blank to blank transition
  -T, --temporary-directory=DIR  use DIR for temporaries, not $TMPDIR or /tmp;
                              multiple options specify multiple directories
  -u, --unique              with -c, check for strict ordering;
                              without -c, output only the first of an equal run
  -z, --zero-terminated     end lines with 0 byte, not newline
      --help     display this help and exit
      --version  output version information and exit

POS is F[.C][OPTS], where F is the field number and C the character position
in the field; both are origin 1.  If neither -t nor -b is in effect, characters
in a field are counted from the beginning of the preceding whitespace.  OPTS is
one or more single-letter ordering options, which override global ordering
options for that key.  If no key is given, use the entire line as the key.

SIZE may be followed by the following multiplicative suffixes:
% 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, Y.

With no FILE, or when FILE is -, read standard input.

*** WARNING ***
The locale specified by the environment affects sort order.
Set LC_ALL=C to get the traditional sort order that uses
native byte values.

Report sort bugs to bug-coreutils@gnu.org
GNU coreutils home page: <http://www.gnu.org/software/coreutils/>
General help using GNU software: <http://www.gnu.org/gethelp/>
For complete documentation, run: info coreutils 'sort invocation'

发表评论

电子邮件地址不会被公开。 必填项已用*标注