网络提取器API是一种强大的工具,旨在从网页中提取干净、结构化的文本。凭借其两个专用端点,它使用户能够提取有意义的内容,而无需广告、导航元素或其他无关细节。此外,它的markdown转换端点将网页转换为结构化的markdown文档,非常适合博客、内容管理以及与基于markdown的平台的集成。该API兼容静态和动态页面,能够适应复杂的网络结构,以确保一致的高质量结果
要使用此端点,请发送一个请求,包含网页的URL,并接收从该页面内容中提取的干净文本
获取干净文本 - 端点功能
| 对象 | 描述 |
|---|---|
请求体 |
[必需] Json |
{"response":"Spark Basics\nSuppose we have a web application hosted in an application orchestrator like kubernetes. If load in that particular application increases then we can horizontally scale our application simply by increasing the number of pods in our service.\nNow let’s suppose there is heavy compute operation happening in each of the pods. Then there will be certain limit upto which these services can run because unlike horizontal scaling where you can have as many numbers of machines as required, there is limit for vertical scaling because you can’t have unlimited ram and cpu cores for each of the machines in a cluster. Distributed Computing removes this limitation of vertical scaling by distributing the processing across cluster of machines. Now, a group of machines alone is not powerful, you need a framework to coordinate work across them. Spark does just that, managing and coordinating the execution of tasks on data across a cluster of computers. The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like Spark’s standalone cluster manager, Kubernetes, YARN, or Mesos.\nSpark Basics\nSpark is distributed data processing engine. Distributed data processing in big data is simply series of map and reduce functions which runs across the cluster machines. Given below is python code for calculating the sum of all the even numbers from a given list with the help of map and reduce functions.\nfrom functools import reduce\na = [1,2,3,4,5]\nres = reduce(lambda x,y: x+y, (map(lambda x: x if x%2==0 else 0, a)))\nNow consider, if instead of a simple list, it is a parquet file of size in order of gigabytes. Computation with MapReduce system becomes optimized way of dealing with such problems. In this case spark will load the big parquet file into multiple worker nodes (if the file doesn’t support distributed storage then it will be first loaded into driver node and afterwards, it will get distributed across the worker nodes). Then map function will be executed for each task in each worker node and the final result will fetched with the reduce function.\nSpark timeline\nGoogle was first to introduce large scale distributed computing solution with MapReduce and its own distributed file system i.e., Google File System(GFS). GFS provided a blueprint for the Hadoop File System (HDFS), including the MapReduce implementation as a framework for distributed computing. Apache Hadoop framework was developed consisting of Hadoop Common, MapReduce, HDFS, and Apache Hadoop YARN. There were various limitations with Apache Hadoop like it fell short for combining other workloads such as machine learning, streaming, or interactive SQL-like queries etc. Also the results of the reduce computations were written to a local disk for subsequent stage of operations. Then came the Spark. Spark provides in-memory storage for intermediate computations, making it much faster than Hadoop MapReduce. It incorporates libraries with composable APIs for machine learning (MLlib), SQL for interactive queries (Spark SQL), stream processing (Structured Streaming) for interacting with real-time data, and graph processing (GraphX).\nSpark Application\nSpark Applications consist of a driver process and a set of executor processes. The driver process runs your main() function, sits on a node in the cluster. The executors are responsible for actually carrying out the work that the driver assigns them. The driver and executors are simply processes, which means that they can live on the same machine or different machines.\nThere is a SparkSession object available to the user, which is the entrance point to running Spark code. When using Spark from Python or R, you don’t write explicit JVM instructions; instead, you write Python and R code that Spark translates into code that it then can run on the executor JVMs.\nSpark’s language APIs make it possible for you to run Spark code using various programming languages like Scala, Java, Python, SQL and R.\nSpark has two fundamental sets of APIs: the low-level “unstructured” APIs (RDDs), and the higher-level structured APIs (Dataframes, Datasets).\nSpark Toolsets\nA DataFrame is the most common Structured API and simply represents a table of data with rows and columns. To allow every executor to perform work in parallel, Spark breaks up the data into chunks called partitions. A partition is a collection of rows that sit on one physical machine in your cluster.\nIf a function returns a Dataframe or Dataset or Resilient Distributed Dataset (RDD) then it is a transformation and if it doesn’t return anything then it’s an action. An action instructs Spark to compute a result from a series of transformations. The simplest action is count.\nTransformation are of types narrow and wide. Narrow transformations are those for which each input partition will contribute to only one output partition. Wide transformation will have input partitions contributing to many output partitions.\nSparks performs a lazy evaluation which means that Spark will wait until the very last moment to execute the graph of computation instructions. This provides immense benefits because Spark can optimize the entire data flow from end to end.\nSpark-submit\nReferences\n- https://spark.apache.org/docs/latest/\n- spark: The Definitive Guide by Bill Chambers and Matei Zaharia"}
curl --location --request POST 'https://zylalabs.com/api/5660/web+extractor+api/7369/retrieve+clean+text' --header 'Authorization: Bearer YOUR_API_KEY'
--data-raw '{
"url": "https://techtalkverse.com/post/software-development/spark-basics/"
}'
要使用此端点,发送包含网页 URL 的请求并接收该页面转换为 markdown 格式的内容
文本内容提取 - 端点功能
| 对象 | 描述 |
|---|---|
请求体 |
[必需] Json |
{"response":"---\ntitle: Spark Basics\nurl: https://techtalkverse.com/post/software-development/spark-basics/\nhostname: techtalkverse.com\ndescription: Suppose we have a web application hosted in an application orchestrator like kubernetes. If load in that particular application increases then we can horizontally scale our application simply by increasing the number of pods in our service.\nsitename: techtalkverse.com\ndate: 2023-05-01\ncategories: ['post']\n---\n# Spark Basics\n\nSuppose we have a web application hosted in an application orchestrator like kubernetes. If load in that particular application increases then we can horizontally scale our application simply by increasing the number of pods in our service.\n\nNow let’s suppose there is heavy compute operation happening in each of the pods. Then there will be certain limit upto which these services can run because unlike horizontal scaling where you can have as many numbers of machines as required, there is limit for vertical scaling because you can’t have unlimited ram and cpu cores for each of the machines in a cluster. **Distributed Computing** removes this limitation of vertical scaling by distributing the processing across cluster of machines.\nNow, a group of machines alone is not powerful, you need a framework to\ncoordinate work across them. Spark does just that, managing and coordinating the execution of tasks on data across a cluster of computers. The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like Spark’s standalone cluster manager, Kubernetes, YARN, or Mesos.\n\n## Spark Basics\n\nSpark is distributed data processing engine. Distributed data processing in big data is simply series of map and reduce functions which runs across the cluster machines. Given below is python code for calculating the sum of all the even numbers from a given list with the help of map and reduce functions.\n\n```\nfrom functools import reduce\na = [1,2,3,4,5]\nres = reduce(lambda x,y: x+y, (map(lambda x: x if x%2==0 else 0, a)))\n```\n\n\nNow consider, if instead of a simple list, it is a parquet file of size in order of gigabytes. Computation with MapReduce system becomes optimized way of dealing with such problems. In this case spark will load the big parquet file into multiple worker nodes (if the file doesn’t support distributed storage then it will be first loaded into driver node and afterwards, it will get distributed across the worker nodes). Then map function will be executed for each task in each worker node and the final result will fetched with the reduce function.\n\n## Spark timeline\n\nGoogle was first to introduce large scale distributed computing solution with **MapReduce** and its own distributed file system i.e., **Google File System(GFS)**. GFS provided a blueprint for the **Hadoop File System (HDFS)**, including the MapReduce implementation as a framework for distributed computing. **Apache Hadoop** framework was developed consisting of Hadoop Common, MapReduce, HDFS, and Apache Hadoop YARN. There were various limitations with Apache Hadoop like it fell short for combining other workloads such as machine learning, streaming, or interactive SQL-like queries etc. Also the results of the reduce computations were written to a local disk for subsequent stage of operations. Then came the **Spark**. Spark provides in-memory storage for intermediate computations, making it much faster than Hadoop MapReduce. It incorporates libraries with composable APIs for\nmachine learning (MLlib), SQL for interactive queries (Spark SQL), stream processing (Structured Streaming) for interacting with real-time data, and graph processing (GraphX).\n\n## Spark Application\n\n**Spark Applications** consist of a driver process and a set of executor processes. The **driver** process runs your main() function, sits on a node in the cluster. The **executors** are responsible for actually carrying out the work that the driver assigns them. The driver and executors are simply processes, which means that they can live on the same machine or different machines.\n\nThere is a **SparkSession** object available to the user, which is the entrance point to running Spark code. When using Spark from Python or R, you don’t write explicit JVM instructions; instead, you write Python and R code that Spark translates into code that it then can run on the executor JVMs.\n**Spark’s language APIs** make it possible for you to run Spark code using various programming languages like Scala, Java, Python, SQL and R.\nSpark has two fundamental sets of APIs: the **low-level “unstructured” APIs** (RDDs), and the **higher-level structured APIs** (Dataframes, Datasets).\n\n## Spark Toolsets\n\nA **DataFrame** is the most common Structured API and simply represents a table of data with rows and columns. To allow every executor to perform work in parallel, Spark breaks up the data into chunks called partitions. A **partition** is a collection of rows that sit on one physical machine in your cluster.\n\nIf a function returns a Dataframe or Dataset or Resilient Distributed Dataset (RDD) then it is a **transformation** and if it doesn’t return anything then it’s an **action**. An action instructs Spark to compute a result from a series of transformations. The simplest action is count.\n\nTransformation are of types narrow and wide. **Narrow transformations** are those for which each input partition will contribute to only one output partition. **Wide transformation** will have input partitions contributing to many output partitions.\n\nSparks performs a **lazy evaluation** which means that Spark will wait until the very last moment to execute the graph of computation instructions. This provides immense benefits because Spark can optimize the entire data flow from end to end.\n\n## Spark-submit\n\n## References\n\n- https://spark.apache.org/docs/latest/\n- spark: The Definitive Guide by Bill Chambers and Matei Zaharia"}
curl --location --request POST 'https://zylalabs.com/api/5660/web+extractor+api/7370/text+content+extract' --header 'Authorization: Bearer YOUR_API_KEY'
--data-raw '{
"url": "https://techtalkverse.com/post/software-development/spark-basics/"
}'
| 标头 | 描述 |
|---|---|
授权
|
[必需] 应为 Bearer access_key. 订阅后,请查看上方的"您的 API 访问密钥"。 |
无长期承诺。随时升级、降级或取消。 免费试用包括最多 50 个请求。
Web Extractor API旨在从网页中提取干净、结构化的文本和Markdown,使用户能够分析、记录或展示内容,而不会受到广告或导航元素等无关细节的干扰
该API兼容静态和动态网页,适应复杂的网页结构,以确保在内容提取过程中始终获得一致且高质量的结果
主要功能包括两个专门的端点,用于提取干净文本和将网页转换为结构化的markdown文档,使其适合博客和内容管理
是的,Web Extractor API 旨在处理复杂的网页结构,确保其准确地提取有意义的内容,无论页面布局如何
提取内容的markdown转换端点将其格式化为结构化的markdown文档,这些文档可以轻松地与基于markdown的平台集成,以实现无缝的内容管理
"获取干净文本"端点返回从网页提取的未格式化文本,而"文本内容提取"端点返回包括标题、URL、描述和类别等元数据的结构化markdown
"获取干净文本"响应包含干净的文本内容 "文本内容提取"响应包括标题 URL 主机名 描述 网站名称 日期 和类别等字段,为提取的内容提供全面的上下文
“检索干净文本”响应是一个简单的JSON对象,包含一个“response”键,里面是文本。“文本内容提取”响应是一个更复杂的JSON对象,包含多个键,包括元数据和Markdown内容,结构化以便于集成
“获取干净文本”端点提供网页的主要文本内容,而“文本内容提取”端点同时提供文本和附加元数据,如页面的标题、URL和类别,增强内容的上下文
用户可以通过指定不同的 URL 来定制请求以进行提取 API 处理提供的 URL 返回相关的干净文本或 Markdown 允许内容来源的灵活性
典型的用例包括内容分析、文档和博客 用户可以提取干净的文本用于研究或将网页转换为markdown以便于集成到内容管理系统或博客中
网络提取器API使用算法解析和提取相关内容,同时过滤广告和导航元素,确保提取数据的高准确性 持续更新提取逻辑有助于保持质量
用户可以期待“获取干净文本”端点返回连贯的段落,而“文本内容提取”将产生带有清晰元数据的结构化markdown。这两种输出都旨在干净且易于在各种应用中使用
服务级别:
100%
响应时间:
2,680ms
服务级别:
100%
响应时间:
3,976ms
服务级别:
100%
响应时间:
8,219ms
服务级别:
100%
响应时间:
5,748ms
服务级别:
100%
响应时间:
7,660ms
服务级别:
100%
响应时间:
10,154ms
服务级别:
100%
响应时间:
1,811ms
服务级别:
100%
响应时间:
884ms
服务级别:
100%
响应时间:
876ms
服务级别:
100%
响应时间:
2,507ms
服务级别:
100%
响应时间:
2,352ms
服务级别:
100%
响应时间:
500ms
服务级别:
100%
响应时间:
10,199ms
服务级别:
100%
响应时间:
2,141ms
服务级别:
100%
响应时间:
718ms
服务级别:
100%
响应时间:
0ms
服务级别:
100%
响应时间:
97ms
服务级别:
100%
响应时间:
0ms
服务级别:
100%
响应时间:
0ms
服务级别:
100%
响应时间:
0ms