以qq库为例:
得到的数据源文件为txt无序数据
step1: 对源文件进行分割合并排序 key为qq字段(phone也行)
这里我自己写了一个脚本,需要配合emeditor
使用emeditor将源文件以行分割 7k5w行一个文件 分出来大概10个文件
对这是个文件进行归并排序,最后得到的需要是一个有序的源文件
step2: 使用emeditor将源文件以行分割 100w行一个文件 分割后大概720个文件
为后面建表作为数据源,每个文件对应一张表,也就是720张表
step3: 批量创建数据库导入数据
先建立一个用于查询数据库名的表,字段为 database_name,begin,end
begin,end对应每个排序表的开始和结尾
这里以linux系统为例: 批量创建database及压缩表(压缩表可以减少表的占用空间和提高查询效率)
注意下压缩表不能修改,这里贴下shell脚本,需要有一定的基础,进行修改
#!/bin/bash # @download: www.8gws.com index=1 USER_NAME="root" PASSWD="" DB_NAME="" HOST_NAME="127.0.0.1" DB_PORT="3306" endIndex=720 MYSQL_ETL="mysql -h${HOST_NAME} -P${DB_PORT} -u${USER_NAME} -p${PASSWD} ${DB_NAME} -s -e" for ((i=$index; i<=$endIndex; i++)) do table_name="qq_database_"$i"" database_path="/var/lib/mysql-files/qq_database/MargedFileOutPut_"$i".txt" times=$(date "+%Y-%m-%d %H:%M:%S") echo "[${times}] Insert Data ${table_name}" create_table="CREATE TABLE ${table_name} ( qq bigint UNSIGNED NOT NULL,phone bigint UNSIGNED NOT NULL,PRIMARY KEY (qq), INDEX phone_index(phone) USING BTREE) ENGINE = MyISAM;" exec_create_table=$($MYSQL_ETL "${create_table}") load_data="LOAD DATA INFILE '${database_path}' REPLACE INTO TABLE "${table_name}" FIELDS TERMINATED BY ',' enclosed by '' lines terminated by '\n' (qq,phone);" exec_load_data=$($MYSQL_ETL "${load_data}") query_begin="select * from ${table_name} limit 1;" query_end="select * from ${table_name} order by qq desc limit 1;" query_begin_done=$($MYSQL_ETL "${query_begin}") query_end_done=$($MYSQL_ETL "${query_end}") array=(${query_begin_done// / }) begin=${array[0]} array=(${query_end_done// / }) end=${array[0]} insert_index="INSERT INTO qq_database_index (database_name, begin, end) VALUES ('${table_name}',${begin},${end});" insert_index_done=$($MYSQL_ETL "${insert_index}") #pack myisampack /var/lib/mysql/bind_search_service/${table_name} myisamchk -rq /var/lib/mysql/bind_search_service/${table_name} #update #remove file > /boot/bigfile rm ${database_path} times=$(date "+%Y-%m-%d %H:%M:%S") echo "[${times}] Insert Data ${table_name} Done!" done
step4:
脚本运行完后需要刷新下表,flush tables;
step5: 查询
先查询数据库索引表 通过 begin<= keys <= end进行查询,得到的数据取出数据库名
再进行一次查询 SELECT * FROM database_name WHELE qq = keys;
通过分表和添加索引,查询效率非常高且占用空间少 通过主键查询大概0.05s以内,当然如果通过索引phone查询就得需要查询所有分表
写个循环构造表名,处理好逻辑 查询时间大概也在0.5s以内,
最后贴下归并排序文件的python源码
#!/usr/bin/env python3 # -*- coding: utf-8 -*- # [url=home.php?mod=space&uid=2260]@Time[/url] : 2021/3/9 10:12 # @Author : Smida # @FileName: sortDatabase.py # @download: www.8gws.com import os import time import numpy class SortDatabaseManager(): dataPath = "E:\\ariDownload\\裤子\\q绑\\qqSearch_split_6\\OutPut" #文件目录 dataFiles = [i for i in os.listdir(dataPath) if i[-3::] == 'txt'] #目录下所有txt文件名 theQQMaxMap = {} theSplitFlag = ',' theDataPosition = 0 timeScale = 0 @staticmethod def printLog(msg): print(f"[{time.strftime('%Y-%m-%d %H:%M:%S',time.localtime())}] -> {msg}") @staticmethod def caculateTimeSpan(fileSize,timeScale): return fileSize/timeScale if timeScale else "Null" @staticmethod def getFileSize(filePath): return round(os.path.getsize(filePath) / float(1024 * 1024),2) def sortFile(self,path, chunk): self.printLog(f"开始分割文件 {path} \n 缓存大小为{chunk}") baseDir, baseFile = os.path.split(path) fileIndex = 1 files = [] with open(path, 'r') as f: while True: lines = f.readlines(chunk) lines.sort(key=lambda x: int(x.split(",")[0])) if lines: newFileName = os.path.join(baseDir, f"{baseFile[1:-4]}_{fileIndex}.txt") with open(newFileName, 'a') as sf: sf.write(''.join(lines)) files.append(newFileName) fileIndex += 1 else: break return files def mergeFiles(self,fileList: list,filePath: str) -> str: """ :param fileList: a list of file absolute path :return: a string of merged file absolute path """ self.printLog(f"开始归并文件覆盖输出到 {filePath}") fs = [open(file_, 'r') for file_ in fileList] tempDict = {} mergedFile = open(filePath, 'w+') for f in fs: initLine = f.readline() if initLine: tempDict[f] = initLine while tempDict: min_item = min(tempDict.items(), key=lambda x: int(x[1].split(",")[0])) mergedFile.write(min_item[1]) nextLine = min_item[0].readline() if nextLine: tempDict[min_item[0]] = nextLine else: del tempDict[min_item[0]] min_item[0].close() mergedFile.close() for file_ in fileList: self.printLog(f"清除缓存文件 {file_}") os.remove(file_) return os.path.join(filePath) def getFilePaths(self): pathList = [] for fileName in self.dataFiles: pathList.append(f"{self.dataPath}\\{fileName}") return pathList def setTimeScale(self,fileSize,timeSpan): self.timeScale = fileSize // timeSpan # 遍历文件,找出每个文件的最大值或最小值 def startSortFile(self): allStartTime = time.time() filePathList = [] for fileName in self.dataFiles: filePath = f"{self.dataPath}\\{fileName}" if fileName == "qqSearch_1.txt": continue fileSize = self.getFileSize(filePath) startTime = time.time() self.printLog(f"开始处理文件:{fileName} 预计耗时:{self.caculateTimeSpan(fileSize, self.timeScale)}s") self.mergeFiles(self.sortFile(filePath,1024 * 1024 * 500),filePath) endTime = time.time() self.setTimeScale(fileSize,endTime - startTime) self.printLog("开始最后归并...") for i in self.dataFiles: filePathList.append(f"{self.dataPath}\\{i}") self.mergeFiles(filePathList, "MargedFileOutPut.txt") allEndTime = time.time() self.printLog(f"Done! 耗时{allEndTime-allStartTime}") oj = SortDatabaseManager() path = oj.startSortFile()同qq库一样凡是可以使用bigint存储的都可以使用类似方法,在服务器配置不好的情况下,可以尝试。