「java爬取前端项目」Java爬虫项目
本篇文章给大家谈谈java爬取前端项目,以及Java爬虫项目对应的知识点,希望对各位有所帮助,不要忘了收藏本站喔。
本文目录一览:
- 1、Java网络爬虫怎么实现?
- 2、JAVA的condition没有get和set方法如何才能获取前端的数据
- 3、用java写服务器怎么获得从前端传来的json数据
- 4、用java怎么实现从前端接收、处理并传回视频?
Java网络爬虫怎么实现?
网络爬虫是一个自动提取网页的程序,它为搜索引擎从万维网上下载网页,是搜索引擎的重要组成。\x0d\x0a传统爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件。对于垂直搜索来说,聚焦爬虫,即有针对性地爬取特定主题网页的爬虫,更为适合。\x0d\x0a\x0d\x0a以下是一个使用java实现的简单爬虫核心代码:\x0d\x0apublic void crawl() throws Throwable { \x0d\x0a while (continueCrawling()) { \x0d\x0a CrawlerUrl url = getNextUrl(); //获取待爬取队列中的下一个URL \x0d\x0a if (url != null) { \x0d\x0a printCrawlInfo(); \x0d\x0a String content = getContent(url); //获取URL的文本信息 \x0d\x0a \x0d\x0a //聚焦爬虫只爬取与主题内容相关的网页,这里采用正则匹配简单处理 \x0d\x0a if (isContentRelevant(content, this.regexpSearchPattern)) { \x0d\x0a saveContent(url, content); //保存网页至本地 \x0d\x0a \x0d\x0a //获取网页内容中的链接,并放入待爬取队列中 \x0d\x0a Collection urlStrings = extractUrls(content, url); \x0d\x0a addUrlsToUrlQueue(url, urlStrings); \x0d\x0a } else { \x0d\x0a System.out.println(url + " is not relevant ignoring ..."); \x0d\x0a } \x0d\x0a \x0d\x0a //延时防止被对方屏蔽 \x0d\x0a Thread.sleep(this.delayBetweenUrls); \x0d\x0a } \x0d\x0a } \x0d\x0a closeOutputStream(); \x0d\x0a}\x0d\x0aprivate CrawlerUrl getNextUrl() throws Throwable { \x0d\x0a CrawlerUrl nextUrl = null; \x0d\x0a while ((nextUrl == null) (!urlQueue.isEmpty())) { \x0d\x0a CrawlerUrl crawlerUrl = this.urlQueue.remove(); \x0d\x0a //doWeHavePermissionToVisit:是否有权限访问该URL,友好的爬虫会根据网站提供的"Robot.txt"中配置的规则进行爬取 \x0d\x0a //isUrlAlreadyVisited:URL是否访问过,大型的搜索引擎往往采用BloomFilter进行排重,这里简单使用HashMap \x0d\x0a //isDepthAcceptable:是否达到指定的深度上限。爬虫一般采取广度优先的方式。一些网站会构建爬虫陷阱(自动生成一些无效链接使爬虫陷入死循环),采用深度限制加以避免 \x0d\x0a if (doWeHavePermissionToVisit(crawlerUrl) \x0d\x0a (!isUrlAlreadyVisited(crawlerUrl)) \x0d\x0a isDepthAcceptable(crawlerUrl)) { \x0d\x0a nextUrl = crawlerUrl; \x0d\x0a // System.out.println("Next url to be visited is " + nextUrl); \x0d\x0a } \x0d\x0a } \x0d\x0a return nextUrl; \x0d\x0a}\x0d\x0aprivate String getContent(CrawlerUrl url) throws Throwable { \x0d\x0a //HttpClient4.1的调用与之前的方式不同 \x0d\x0a HttpClient client = new DefaultHttpClient(); \x0d\x0a HttpGet httpGet = new HttpGet(url.getUrlString()); \x0d\x0a StringBuffer strBuf = new StringBuffer(); \x0d\x0a HttpResponse response = client.execute(httpGet); \x0d\x0a if (HttpStatus.SC_OK == response.getStatusLine().getStatusCode()) { \x0d\x0a HttpEntity entity = response.getEntity(); \x0d\x0a if (entity != null) { \x0d\x0a BufferedReader reader = new BufferedReader( \x0d\x0a new InputStreamReader(entity.getContent(), "UTF-8")); \x0d\x0a String line = null; \x0d\x0a if (entity.getContentLength() 0) { \x0d\x0a strBuf = new StringBuffer((int) entity.getContentLength()); \x0d\x0a while ((line = reader.readLine()) != null) { \x0d\x0a strBuf.append(line); \x0d\x0a } \x0d\x0a } \x0d\x0a } \x0d\x0a if (entity != null) { \x0d\x0a nsumeContent(); \x0d\x0a } \x0d\x0a } \x0d\x0a //将url标记为已访问 \x0d\x0a markUrlAsVisited(url); \x0d\x0a return strBuf.toString(); \x0d\x0a}\x0d\x0apublic static boolean isContentRelevant(String content, \x0d\x0aPattern regexpPattern) { \x0d\x0a boolean retValue = false; \x0d\x0a if (content != null) { \x0d\x0a //是否符合正则表达式的条件 \x0d\x0a Matcher m = regexpPattern.matcher(content.toLowerCase()); \x0d\x0a retValue = m.find(); \x0d\x0a } \x0d\x0a return retValue; \x0d\x0a}\x0d\x0apublic List extractUrls(String text, CrawlerUrl crawlerUrl) { \x0d\x0a Map urlMap = new HashMap(); \x0d\x0a extractHttpUrls(urlMap, text); \x0d\x0a extractRelativeUrls(urlMap, text, crawlerUrl); \x0d\x0a return new ArrayList(urlMap.keySet()); \x0d\x0a} \x0d\x0aprivate void extractHttpUrls(Map urlMap, String text) { \x0d\x0a Matcher m = (text); \x0d\x0a while (m.find()) { \x0d\x0a String url = m.group(); \x0d\x0a String[] terms = url.split("a href=\""); \x0d\x0a for (String term : terms) { \x0d\x0a // System.out.println("Term = " + term); \x0d\x0a if (term.startsWith("http")) { \x0d\x0a int index = term.indexOf("\""); \x0d\x0a if (index 0) { \x0d\x0a term = term.substring(0, index); \x0d\x0a } \x0d\x0a urlMap.put(term, term); \x0d\x0a System.out.println("Hyperlink: " + term); \x0d\x0a } \x0d\x0a } \x0d\x0a } \x0d\x0a} \x0d\x0aprivate void extractRelativeUrls(Map urlMap, String text, \x0d\x0a CrawlerUrl crawlerUrl) { \x0d\x0a Matcher m = relativeRegexp.matcher(text); \x0d\x0a URL textURL = crawlerUrl.getURL(); \x0d\x0a String host = textURL.getHost(); \x0d\x0a while (m.find()) { \x0d\x0a String url = m.group(); \x0d\x0a String[] terms = url.split("a href=\""); \x0d\x0a for (String term : terms) { \x0d\x0a if (term.startsWith("/")) { \x0d\x0a int index = term.indexOf("\""); \x0d\x0a if (index 0) { \x0d\x0a term = term.substring(0, index); \x0d\x0a } \x0d\x0a String s = //" + host + term; \x0d\x0a urlMap.put(s, s); \x0d\x0a System.out.println("Relative url: " + s); \x0d\x0a } \x0d\x0a } \x0d\x0a } \x0d\x0a \x0d\x0a}\x0d\x0apublic static void main(String[] args) { \x0d\x0a try { \x0d\x0a String url = ""; \x0d\x0a Queue urlQueue = new LinkedList(); \x0d\x0a String regexp = "java"; \x0d\x0a urlQueue.add(new CrawlerUrl(url, 0)); \x0d\x0a NaiveCrawler crawler = new NaiveCrawler(urlQueue, 100, 5, 1000L, \x0d\x0a regexp); \x0d\x0a // boolean allowCrawl = crawler.areWeAllowedToVisit(url); \x0d\x0a // System.out.println("Allowed to crawl: " + url + " " + \x0d\x0a // allowCrawl); \x0d\x0a crawler.crawl(); \x0d\x0a } catch (Throwable t) { \x0d\x0a System.out.println(t.toString()); \x0d\x0a t.printStackTrace(); \x0d\x0a } \x0d\x0a}
JAVA的condition没有get和set方法如何才能获取前端的数据
使用RequestBody。RequestBody接收JAVA前端的数据时,JAVA前端不能使用GET方式提交数据一个请求,只有一个RequestBody和PathVariable才可以。
用java写服务器怎么获得从前端传来的json数据
通过 JSONObject类就可以了
首先 你把这几个包 下下来 放到你项目。如果有就不要下了:
1.commons-lang.jar
2.commons-beanutils.jar
3.commons-collections.jar
4.commons-logging.jar
5.ezmorph.jar
6.json-lib-2.2.2-jdk15.jar
像你这种是数据形式 就通过 JSONArray 如:
JSONArray datasJson = JSONArray.fromObject(datas);最好把datas toString 一下
用java怎么实现从前端接收、处理并传回视频?
1、接收前端上传的文件
/**
* 接收多文件
*/
@RequestMapping("/upload")
public R uploadFile(@RequestParam MapString, Object params, HttpServletRequest request) {
// 复杂类型的request对象
MultipartHttpServletRequest mRequest = (MultipartHttpServletRequest) request;
// 获取文件名集合放入迭代器
IteratorString files = mRequest.getFileNames();
while (files.hasNext()) {
// 获取上传文件的对象
MultipartFile mFile = mRequest.getFile(files.next());
if (mFile != null) {
//原始文件名称
String oldfile = mFile.getOriginalFilename();
//文件后缀
String suffix = oldfile.substring(oldfile.indexOf('.'), oldfile.length());
String suffix2 = oldfile.substring(oldfile.indexOf('.')+1, oldfile.length());
/***************文件处理*********************/
}
}
}
2.接收前端上传的文件
/**
* 接收附件
* @param request
* @return
*/
@ResponseBody
@RequestMapping(value="fileupload",method=RequestMethod.POST)
public void springUpload(HttpServletRequest request) {
//将当前上下文初始化给 CommonsMutipartResolver (多部分解析器)
CommonsMultipartResolver multipartResolver=new CommonsMultipartResolver(
request.getSession().getServletContext());
//检查form中是否有enctype="multipart/form-data"
if(multipartResolver.isMultipart(request)) {
//将request变成多部分request
MultipartHttpServletRequest multiRequest=(MultipartHttpServletRequest)request;
//获取multiRequest 中所有的文件名
Iterator iter=multiRequest.getFileNames();
while(iter.hasNext()){
//一次遍历所有文件
MultipartFile file=multiRequest.getFile(iter.next().toString());
//最初上传文件名的文件名
String oldFilename = file.getOriginalFilename();
//获取初始文件名后缀
String fileSuffix = oldFilename.substring(oldFilename.lastIndexOf(".") +1);
/***************文件处理*********************/
}
}
3.接收前端上传的文件
/**
* 接收文件
*
*
* @param model
* @return
* @throws IOException
* @throws IllegalStateException
*/
@RequestMapping(value = "imageupload")
public void imageUpload(MultipartFile file) throws IllegalStateException, IOException {
//文件名称
String realFileName = file.getOriginalFilename();
//文件后缀
String suffix = realFileName.substring(realFileName.lastIndexOf(".") + 1);
/***************文件处理*********************/
}
java爬取前端项目的介绍就聊到这里吧,感谢你花时间阅读本站内容,更多关于Java爬虫项目、java爬取前端项目的信息别忘了在本站进行查找喔。
发布于:2022-12-15,除非注明,否则均为
原创文章,转载请注明出处。